### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [133]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [134]:
#a)
baby_df['review'] = baby_df['review'].astype(str)
baby_df['review'] = baby_df['review'].apply(remove_punctuation)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [135]:
#b)
baby_df['review'].replace(np.nan, "", inplace=True)

#short test:
baby_df["review"][38] == baby_df["review"][38]

True

In [136]:
#c)

baby_df.drop(baby_df[baby_df['rating']==3].index, inplace=True)

#short test:
sum(baby_df["rating"] == 3)

0

In [137]:
#d) 
def change_ratings(rating):
    if rating <=2:
        return -1
    return 1

baby_df['rating'] = baby_df['rating'].apply(change_ratings)

#short test:
sum(baby_df["rating"]**2 != 1)

0

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [138]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [139]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [140]:
#a)
from sklearn.model_selection import train_test_split
x_train_raw, x_test_raw, y_train, y_test = train_test_split(baby_df['review'].values, baby_df['rating'].values, test_size=0.2, random_state=42)

In [141]:
#b)
vectorizer = CountVectorizer()

x_train = vectorizer.fit_transform(x_train_raw)
x_test = vectorizer.transform(x_test_raw)

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [142]:
#a)
model = LogisticRegression(max_iter=300)
model.fit(x_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [143]:
#b)
coefs = model.coef_.flatten()
indices_of_most_positive = np.argpartition(coefs, -10)[-10:]
indices_of_most_negative = np.argpartition(coefs, 10)[:10]

words = vectorizer.get_feature_names_out()

most_positive = words[indices_of_most_positive]
most_negative = words[indices_of_most_negative]

print(most_positive)
print(most_negative)



#hint: model.coef_, vectorizer.get_feature_names()

['rich' 'pleasantly' 'worry' 'minor' 'thankful' 'penny' 'skeptical' 'con'
 'lifesaver' 'saves']
['dissapointed' 'unusable' 'disappointing' 'disappointed' 'useless'
 'poorly' 'worthless' 'unacceptable' 'worst' 'nope']


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [144]:
#a)
model.predict(x_test)

array([ 1, -1, -1, ...,  1,  1,  1])

In [145]:
#b)
probability = model.predict_proba(x_test)
probability
#hint: model.predict_proba()

array([[4.81240333e-01, 5.18759667e-01],
       [7.85151735e-01, 2.14848265e-01],
       [9.79541684e-01, 2.04583161e-02],
       ...,
       [9.89536574e-06, 9.99990105e-01],
       [1.33219364e-04, 9.99866781e-01],
       [7.30126914e-02, 9.26987309e-01]])

In [146]:
#c) 

sorted_negatives_indexes = [index for index, value in sorted(enumerate(probability[:, 0]), key=lambda x: x[1], reverse=True)[:5]]
sorted_positives_indexes = [index for index, value in sorted(enumerate(probability[:, 1]), key=lambda x: x[1], reverse=True)[:5]]

# print(zipped_positives)
print(">>>most positive reviews")
print(x_test_raw[sorted_positives_indexes])
print(">>>most negative reviews")
print(x_test_raw[sorted_negatives_indexes])

>>>most positive reviews
['BOTTOM LINE I would buy this again in a heartbeat and I would buy it over any other travel crib or pack n play I have been so impressed with this crib 100 worth every penny It has been used every night for seven months and shows no signs of wearMy husband and I bought this crib as our only crib for our first baby We were nervous because it was such a large purchase for us This is the most expensive thing we have purchased besides our car After extensive research we decided on the babybjorn because my husband is in graduate school and we will be doing a lot of moving in the next few years She has slept in it since she was three days old and she is now seven months and the crib looks and functions as if it was newOur BabyBjorn has been in six states and five countries We have checked it at airports all over Europe and the US We have checked it in its carrying case alone and it has not been damaged in anyway It has been on trains buses metros trams in taxis cars

In [147]:
#d) 
model.score(x_test, y_test)


0.9327456448082516

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [148]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [149]:
#a)
vectorizer = CountVectorizer(vocabulary=significant_words)
x_train_new = vectorizer.fit_transform(x_train_raw)
x_test_new = vectorizer.transform(x_test_raw)
vectorizer.get_feature_names_out()

array(['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves',
       'well', 'able', 'car', 'broke', 'less', 'even', 'waste',
       'disappointed', 'work', 'product', 'money', 'would', 'return'],
      dtype=object)

In [150]:
new_model = LogisticRegression()
new_model.fit(x_train_new, y_train)

probability = new_model.predict_proba(x_test_new)

sorted_negatives_indexes = [index for index, value in sorted(enumerate(probability[:, 0]), key=lambda x: x[1], reverse=True)[:5]]
sorted_positives_indexes = [index for index, value in sorted(enumerate(probability[:, 1]), key=lambda x: x[1], reverse=True)[:5]]

print(">>>most positive reviews")
print(x_test_raw[sorted_positives_indexes])
print(">>>most negative reviews")
print(x_test_raw[sorted_negatives_indexes])

new_model.score(x_test_new, y_test)

>>>most positive reviews
['We bought this stroller after selling our beloved BOB rev on craigslist We used the BOB for 9 months for my son but it just wasnt practical I dont jogrun it didnt have a big basket and was very bulky to take into stores quickly However I did love how it unfolded easily but it was heavy to fold up and lift into my small trunk myself Overall I didnt realize what Id need in a stroller until AFTER I had my son Live  learn We did love how easily the BOB would go over pretty much anything Nevertheless we sold it and after extensive research on strollers we decided it was between the uppababy brand because of the large baskets OR the city mini GT because of its easy fold up design After looking over both strollers I decided on the uppababy cruz because of a few main factors It SITS UP I cant tell you how much my son hates being reclined when he is just riding in the stroller and not napping The BOB and the City Mini had a slight recline and he always tried sitting m

0.8689994303019399

In [151]:
#b)
for word, coef in zip(vectorizer.get_feature_names_out(), new_model.coef_[0]):
    print(f"{word}: {coef}")

love: 1.356642692388926
great: 0.9304455970455212
easy: 1.1911831942495426
old: 0.07193430042721062
little: 0.5015727915334721
perfect: 1.5210846987612965
loves: 1.7004351873496204
well: 0.49631522350373436
able: 0.19393899578834797
car: 0.07434929700403688
broke: -1.6680562824339067
less: -0.2071430591846399
even: -0.4910435852746328
waste: -2.0079159425522124
disappointed: -2.3889031315891938
work: -0.6377398258925681
product: -0.31252413709867266
money: -0.9391017677299048
would: -0.3398058943932005
return: -2.0782374013451674


In [152]:
import sys, time

In [160]:
%%time
%%timeit
new_model.predict(x_test_new)

341 µs ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
CPU times: user 2.77 s, sys: 332 µs, total: 2.77 s
Wall time: 2.77 s


In [161]:
%%time
%%timeit
model.predict(x_test)

4.26 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CPU times: user 3.39 s, sys: 2.24 ms, total: 3.4 s
Wall time: 3.39 s


In [162]:
print(f"first model score: {model.score(x_test, y_test)}")
print(f"second model score: {new_model.score(x_test_new, y_test)}")

first model score: 0.9327456448082516
second model score: 0.8689994303019399
