### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [91]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

1. **Cleaning the `review` column**
   - Replace all missing values (`NaN`) in the `review` column with an empty string `""` so that later text operations do not raise errors.
   - Apply the `remove_punctuation` function to every review to remove punctuation marks. This cleans the text and makes it easier to use for further processing (e.g., tokenization or model training).


In [92]:
#a) b)
baby_df['review'] = baby_df['review'].fillna('')
baby_df['review'] = baby_df['review'].apply(remove_punctuation)

#short tests:
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
baby_df["review"][38] == baby_df["review"][38]

True

3. **Removing neutral reviews**
   - Filter out all rows where the `rating` is equal to 3, since these reviews are considered neutral and we only want clearly positive or negative examples.


In [94]:
#c)
baby_df = baby_df[baby_df['rating'] != 3]

#short test:
sum(baby_df["rating"] == 3)

0

4. **Mapping ratings to sentiment labels**
   - Convert numerical ratings into sentiment labels:
     - Ratings ≥ 4 are mapped to `1` (positive sentiment)
     - Ratings ≤ 2 are mapped to `-1` (negative sentiment)
   - This simplifies the task into a binary sentiment classification problem.


In [95]:
#d)
def map_sentiment(rating):
    if rating >= 4:
        return 1
    else:
        return -1

baby_df['rating'] = baby_df['rating'].apply(
    map_sentiment
)

#short test:
sum(baby_df["rating"]**2 != 1)

0

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [96]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())

['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [97]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer.

# a) Split dataset into training and test sets.     
I Split here data into training and test set

In [98]:
from sklearn.model_selection import train_test_split
#a)
train_data, test_data = train_test_split(baby_df, train_size=0.8, test_size=0.2)

b) **Transform reviews into vectors**  
- Use `CountVectorizer` to convert the text reviews into numerical feature vectors.  
- `fit_transform` is applied on the training data to learn the vocabulary and transform the text into vectors.  
- `transform` is applied on the test data using the same vocabulary learned from training, ensuring consistent feature mapping.


In [99]:
vectorizer = CountVectorizer()

X_train = vectorizer.fit_transform(train_data['review'])
X_test = vectorizer.transform(test_data['review'])

Y_train = train_data['rating']
Y_test = test_data['rating']



In [101]:
print(Y_train.value_counts())
print(Y_test.value_counts())

rating
 1    112181
-1     21220
Name: count, dtype: int64
rating
 1    28078
-1     5273
Name: count, dtype: int64


## Exercise 3
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

a) **Train the model**  
- Fit a `LogisticRegression` classifier on the training data (`X_train`) with the target labels (`Y_train`).  
- `max_iter=1000` ensures convergence even for large feature sets.


In [102]:
#a)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, Y_train)

b) **Identify the most positive and negative words**  
- Extract the model's coefficients (`model.coef_`) which indicate the importance of each word for predicting positive or negative sentiment.  
- Map the coefficients to the corresponding words from the `CountVectorizer` vocabulary.  
- Sort the coefficients to find:
  - The 10 words with the most negative impact (most negative coefficients).  
  - The 10 words with the most positive impact (most positive coefficients).  


In [103]:
#b)
coefs = model.coef_[0]
words = np.array(vectorizer.get_feature_names_out())
sorted_indices = np.argsort(coefs)

print("10 most negative words:")
print(words[sorted_indices[:10]])

print("\n10 most positive words:")
print(words[sorted_indices[-10:]])

#hint: model.coef_, vectorizer.get_feature_names()

10 most negative words:
['dissapointed' 'disappointing' 'worst' 'theory' 'worthless' 'unusable'
 'useless' 'poor' 'poorly' 'ineffective']

10 most positive words:
['perfect' 'minor' 'penny' 'saves' 'excellent' 'sooner' 'awesome' 'worry'
 'rich' 'ply']


## Exercise 4
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

a) **Predict sentiment labels**  
- Use `model.predict` on `X_test` to get the predicted sentiment labels (1 for positive, -1 for negative) for each review in the test set.


In [104]:
#a)
y_pred = model.predict(X_test)

print(y_pred)


[ 1  1  1 ...  1 -1  1]


b) **Predict sentiment probabilities**  
- Use `model.predict_proba` to obtain the probability estimates for each class.  
- This gives a sense of the model's confidence in each prediction.


In [105]:
#b)
probability = model.predict_proba(X_test)

probability

#hint: model.predict_proba()

array([[5.37268365e-04, 9.99462732e-01],
       [3.06128031e-06, 9.99996939e-01],
       [1.15699682e-03, 9.98843003e-01],
       ...,
       [2.09036126e-02, 9.79096387e-01],
       [9.00270332e-01, 9.97296681e-02],
       [1.08673988e-03, 9.98913260e-01]])

c) **Find the most positive and negative reviews**  
- Sort the test reviews by predicted probabilities for positive and negative classes.  
- Identify and display the top 5 reviews that the model considers most positive and most negative.


In [106]:
#c)

top5_pos_idx = np.argsort(probability[:, 1])[-5:][::-1]
top5_neg_idx = np.argsort(probability[:, 0])[-5:][::-1]

print("5 most positive reviews:")
for i in top5_pos_idx:
    print(f"\n Review {i+1}:\n {test_data.iloc[i]['review']}")

print("\n5 most negative reviews:")
for i in top5_neg_idx:
    print(f"\n Review {i+1}:\n {test_data.iloc[i]['review']}")


5 most positive reviews:

 Review 20455:
 The joovy zoom 360 was the perfect solution for us We couldnt justify spending the money on a mountain buggy terrain but we wanted a very sturdy allterrain jogger with a locking swivel wheel I tried out a BOB as well in the store I also wanted a large sun canopy and a seat that my daughter would be able to fit in for years This stroller is affordable while still having most of the features I was looking for in a jogger The biggest compromise for me was that I had wanted a hand brake but honestly I probably dont need it This stroller is so easy to push and stop that it is unnecessaryThe fabric is sturdy and feels like it will really lastThe foot rest is far enough away from the seat that my daughter will be able to fit comfortably in the seat for several years without outgrowing it It is sturdy metal with drainage holes I didnt like the BOBs foot rest because it was made of fabricThe locking swivel wheel is easy to lock or unlock It doesnt shake

d) **Calculate accuracy**  
- Use `model.score` to compute the overall accuracy on the test set, i.e., the proportion of correctly predicted labels.


In [107]:
#d)
accuracy = model.score(X_test, Y_test)

print("Accuracy on test set:", accuracy)


Accuracy on test set: 0.9317861533387305


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

a) **Train and evaluate with limited dictionary**  
- Define a smaller set of significant words (`significant_words`) based on their relevance to sentiment.  
- Use `CountVectorizer` with this limited vocabulary to vectorize reviews.  
- Train a new `LogisticRegression` model (`light_model`) on this reduced feature set.  
- Predict sentiment labels and probabilities on the test set and identify the top 5 most positive and most negative reviews.


In [108]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [109]:
vectorizer_small = CountVectorizer(vocabulary=significant_words)

X_train_small = vectorizer_small.fit_transform(train_data['review'])
X_test_small = vectorizer_small.transform(test_data['review'])

y_train_small = train_data['rating']
y_test_small = test_data['rating']

light_model = LogisticRegression(max_iter=1000)
light_model.fit(X_train_small, y_train_small)

y_pred2 = light_model.predict(X_test_small)
probability2 = light_model.predict_proba(X_test_small)


top5_pos_idx = np.argsort(probability2[:, 1])[-5:][::-1]
top5_neg_idx = np.argsort(probability2[:, 0])[-5:][::-1]




In [110]:
print("5 most positive reviews:")
for i in top5_pos_idx:
    print(f"\n Review {i+1}:\n {test_data.iloc[i]['review']}")

print("\n5 most negative reviews:")
for i in top5_neg_idx:
    print(f"\n Review {i+1}:\n {test_data.iloc[i]['review']}")

5 most positive reviews:

 Review 16716:
 I did tons of research on strollers I knew I wanted a stroller that was light all terrain that could handle jogging and easy maneuverability My husband thought I was nuts for putting so much time into our small investment It came down to a mountain buggy urban and bumbleride indie I am beyond elated I chose the Bumbleride for our 5 month old DD1 easy to out together2 very light3 movespushes like buttah So easy especially when jogging I dont jog in a straight line yet My path has lots of turns so I dont lock the front wheel and it does great I easly jog with 1 arm freeswinging the other has the jogging strap around my wrist and pushing4 big basket I can put much more in it than my Chicco stroller5 canopy  it does exactly what its meant for Its huge and does its job well I live in Louisiana and its quite sunny here The canopy can shade my DD all the way to her toes It almost is like a cocoon  She loves it6 folds well overall Id give it a 7 I dont

b) **Check the impact of individual words**  
- Print the coefficient of each word in the limited vocabulary.  
- This shows which words have the strongest positive or negative influence on predictions.


In [113]:
#b)
for word, coef in zip(vectorizer_small.get_feature_names_out(), light_model.coef_[0]):
    print(f"{word}: {coef}")


love: 1.386016484877855
great: 0.936978702112791
easy: 1.1866647659373306
old: 0.08140197987416038
little: 0.5272424036745277
perfect: 1.4632091882739175
loves: 1.6758439544514716
well: 0.48614261537774883
able: 0.2298311415099273
car: 0.06368005269244824
broke: -1.6830256315956766
less: -0.21400592294924428
even: -0.5099792154979386
waste: -1.951197334187572
disappointed: -2.351665363551018
work: -0.6404249113734013
product: -0.32111672458863105
money: -0.9168770100357132
would: -0.3417209262307177
return: -2.060700754514235


# c) Compare accuracy of predictions and the time of evaluation.

In [115]:
#c)
print("\nOriginal model score:")
print(model.score(X_test, Y_test))

print("\nLight model score:")
print(light_model.score(X_test_small, y_test_small))


Original model score:
0.9317861533387305

Light model score:
0.8697790171209259


In [116]:
%%time
%%timeit
model.predict(X_test)

7.35 ms ± 94.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CPU times: user 5.97 s, sys: 9.16 ms, total: 5.98 s
Wall time: 5.98 s


In [117]:
%%time
%%timeit
light_model.predict(X_test_small)

867 µs ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
CPU times: user 7.08 s, sys: 5.15 ms, total: 7.08 s
Wall time: 7.14 s


**Performance comparison (prediction time):**

- **Original model:**  
  - Prediction on the full test set took about **7.35 ms per loop**.  
- **Light model (limited vocabulary):**  
  - Prediction was much faster: **~0.87 ms per loop**.  

**Interpretation:**  
- Reducing the number of features (words) significantly speeds up prediction.  
- Even though the light model has slightly lower accuracy (~87% vs ~93%), it is much more efficient and lightweight, making it suitable for applications where speed and resource usage matter.






