### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [264]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [265]:
#a)
baby_df.loc[~(baby_df['review'].isna()), 'review'] = baby_df.loc[~(baby_df['review'].isna()), 'review'].apply(remove_punctuation)
#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

Here i'm using two Pandas methods:
<br/>
isna() - for detecting missing values 
<br/>
apply() - to change fragment of the dataframe.

In [266]:
#b)
baby_df.fillna("", inplace=True)
#short test:
baby_df["review"][38] == baby_df["review"][38]

True

Using method fillna(), we changing all reviews with NA/NaN to to empty string "".

In [267]:
#c)
baby_df = baby_df[~(baby_df['rating']==3)]
#short test:
sum(baby_df["rating"] == 3)

0

Here we are dropping all the entries with rating = 3. The new dataframe swapping with the old one.

In [268]:
#d) 
baby_df['rating'] = baby_df["rating"].apply(lambda x: -1 if x<3 else 1)
#short test:
sum(baby_df["rating"]**2 != 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0

Changing the values in a strictly defined order, because this may affect the subsequent execution of instructions.

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [269]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names())
print(X_train_example.todense())



['adore', 'and', 'apples', 'bananas', 'dislike', 'hate', 'like', 'oranges', 'they', 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]




In [270]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [271]:
#a)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(baby_df['review'], baby_df['rating'], test_size=0.33, random_state=0)

Here I spilt the data into X_train, X_test, y_train, with the test_size of the original data 33%

In [272]:
#b)
vectorizer = CountVectorizer()
vectorizer.fit(baby_df['review'])

X_train_vec = vectorizer.transform(X_train.values) # .astype('uint8')
X_test_vec = vectorizer.transform(X_test.values) # .astype('uint8')

Here I initialize CountVectorizer(), converted the test data into sparce matrix, using the one word-column mapping

In [273]:
print(X_train_vec)
print(X_test_vec)

  (0, 2170)	1
  (0, 12183)	3
  (0, 12746)	1
  (0, 15775)	3
  (0, 19790)	1
  (0, 26710)	1
  (0, 31167)	3
  (0, 31732)	1
  (0, 41827)	1
  (0, 49594)	1
  (0, 51725)	1
  (0, 52961)	1
  (0, 60164)	1
  (0, 66290)	6
  (0, 66632)	3
  (0, 73610)	2
  (0, 75590)	1
  (0, 77762)	1
  (0, 80219)	1
  (0, 81008)	3
  (0, 82634)	1
  (0, 83317)	1
  (0, 85373)	1
  (0, 85955)	1
  (0, 86832)	1
  :	:
  (111722, 18908)	1
  (111722, 24555)	1
  (111722, 25214)	1
  (111722, 51832)	1
  (111722, 52352)	1
  (111722, 54290)	1
  (111722, 59152)	1
  (111722, 63049)	1
  (111722, 66290)	1
  (111722, 66632)	4
  (111722, 68477)	1
  (111722, 77579)	1
  (111722, 86246)	2
  (111722, 86832)	1
  (111722, 107046)	2
  (111722, 112658)	1
  (111722, 116676)	1
  (111722, 119617)	1
  (111722, 123063)	1
  (111722, 123173)	1
  (111722, 123234)	1
  (111722, 124682)	1
  (111722, 137846)	1
  (111722, 138104)	1
  (111722, 138556)	1
  (0, 15775)	1
  (0, 51725)	3
  (0, 52773)	1
  (0, 55887)	1
  (0, 60175)	2
  (0, 66632)	1
  (0, 66876)	1
  (0

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [274]:
#a)
model=LogisticRegression(max_iter=3000)
# model = LogisticRegression()
model.fit(X_train_vec, y_train)



LogisticRegression(max_iter=3000)

Training my model with training data (after transformation by CountVectorizer)

In [275]:
#b)
indices = np.argsort(model.coef_, axis=1)
words = np.array(vectorizer.get_feature_names())
print("The most positive: ", words[indices[0, -10:]])
print("The most negative: ", words[indices[0, :10]])
#hint: model.coef_, vectorizer.get_feature_names()



The most positive:  ['saves' 'amazed' 'perfect' 'worry' 'utilize' 'awesome' 'excellent'
 'outstanding' 'pleasantly' 'minor']
The most negative:  ['disappointing' 'dissapointed' 'worst' 'worthless' 'shame' 'useless'
 'unusable' 'concept' 'poorly' 'poor']


Here i using argsort() - returns the indices that would sort an array. indicies accepts sort names by their coefs, words accepts transform words.

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [276]:
#a)
y_pred = model.predict(X_test_vec)
print(y_pred.tolist())

[1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, -1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, 1, -1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, -1, 1, 1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, -1,

Printed predicted data by my model.

In [277]:
#b)
y_pred_proba = model.predict_proba(X_test_vec)
y_pred_proba

#hint: model.predict_proba()

array([[1.32609364e-02, 9.86739064e-01],
       [2.24344742e-01, 7.75655258e-01],
       [1.27847467e-01, 8.72152533e-01],
       ...,
       [7.16350555e-02, 9.28364944e-01],
       [1.63594533e-06, 9.99998364e-01],
       [5.91837570e-03, 9.94081624e-01]])

Printed probability for each classes.

In [278]:
#c) 
X_test_res = X_test.reset_index()
pos_indx = np.argsort(y_pred_proba[:,1])[-5:]
pos_rew = X_test_res.iloc[pos_indx]
display(pos_rew)

print(pos_indx)
#hint: use the results of b)

Unnamed: 0,index,review
45737,14008,I bought the tower despite the bad reviews and...
46280,41763,After considering several lightweight stroller...
868,129722,This is a review of the 2012 Bumbleride Flite ...
51863,42430,new to cloth diapering trying to figure out if...
1649,48158,Were keeping this stroller After much research...


[45737 46280   868 51863  1649]


In [279]:
neg_indx = np.argsort(y_pred_proba[:,0])[-5:]
neg_rew = X_test_res.iloc[neg_indx]
display(neg_rew)

print(neg_indx)

Unnamed: 0,index,review
42135,131738,I purchased this in the black color For some ...
24937,10370,This product should be in the hall of fame sol...
40754,92570,I would recommend in the strongest possible wa...
38899,89902,I am so incredibly disappointed with the strol...
45356,10180,Please see my email to the companyHelloI am wr...


[42135 24937 40754 38899 45356]


Here we can see the 5 most positive and most negative reviews. The results very close to 3.b)

In [280]:
#d) 
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

0.9297461338566937


As we can see predict test data is very similar to real test data.

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [281]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [282]:
#a)
vectorizer_2 = CountVectorizer(vocabulary=significant_words)
X_train_2v = vectorizer_2.fit_transform(X_train)
X_test_2v = vectorizer_2.transform(X_test)
X_train_2v = vectorizer_2.fit_transform(X_train)

model_2 = LogisticRegression(max_iter=10000)
model_2.fit(X_train_2v , y_train)

indices = np.argsort(model_2.coef_, axis=1)
words = np.array(vectorizer_2.get_feature_names())
print("The most positive: ", words[indices[0, -10:]])
print("The most negative: ", words[indices[0, :10]])

y_pred_2 = model_2.predict(X_test_2v)
y_pred_proba = model_2.predict_proba(X_test_2v)

print(y_pred_proba)

X_test_res = X_test.reset_index()
pos_indx = np.argsort(y_pred_proba[:,1])[-5:]
pos_rew = X_test_res.iloc[pos_indx]

display(pos_rew)

neg_indx = np.argsort(y_pred_proba[:,0])[-5:]
neg_rew = X_test_res.iloc[neg_indx]

display(neg_rew)

print(accuracy_score(y_test, y_pred_2))

The most positive:  ['car' 'old' 'able' 'little' 'well' 'great' 'easy' 'love' 'perfect'
 'loves']
The most negative:  ['disappointed' 'return' 'waste' 'broke' 'money' 'work' 'even' 'would'
 'product' 'less']
[[0.21185247 0.78814753]
 [0.21185247 0.78814753]
 [0.19717838 0.80282162]
 ...
 [0.18275693 0.81724307]
 [0.04524728 0.95475272]
 [0.21185247 0.78814753]]




Unnamed: 0,index,review
46088,26882,we needed a gate to fit in between the kitchen...
54433,38061,I literally have owned almost every stroller N...
13875,135152,Weve been using Britax for our boy now 14 mont...
44428,74899,We love this highchair We have a 4 year old a...
27855,134265,We bought this stroller after selling our belo...


Unnamed: 0,index,review
54216,140418,I am a researchaholic in general and have rese...
46562,123630,I am a researchaholic in general and have rese...
46481,77138,beware that the quality is not as good as it l...
7942,121156,I added this product Dr Browns BPA Free Deluxe...
29833,41581,Looks really cute however the cloth smells fun...


0.8675062239909865


This model is less accurate, the reason could be having less significant words processed by CountVectorizer.

In [283]:
#b)
for w, k in sorted(zip(significant_words, model_2.coef_[0]), key = lambda x: x[1]):
  print("{} and impact: {}".format(w, k))

disappointed and impact: -2.3745236563319208
return and impact: -2.0849257206147502
waste and impact: -2.020418882334948
broke and impact: -1.6483030176119617
money and impact: -0.9050163671106995
work and impact: -0.6434689058918271
even and impact: -0.5328669066340904
would and impact: -0.33265238488024645
product and impact: -0.32303260939682876
less and impact: -0.16301774470905087
car and impact: 0.06752893415392877
old and impact: 0.09022861756646108
able and impact: 0.18398442528160544
little and impact: 0.4855317360214128
well and impact: 0.48604894397027926
great and impact: 0.939021939557738
easy and impact: 1.162236906837408
love and impact: 1.3293596928608336
perfect and impact: 1.5137725493268426
loves and impact: 1.7054934029863555


The word with the most negative impact is "disappointed" and the word with
the most positive impact is "loves".
Some words, like "old" and "car" have positive impact, but it is very low.

In [284]:
#c)
%%time
print(accuracy_score(y_test, y_pred))
#hint: %time, %timeit

0.9297461338566937
CPU times: user 12.7 ms, sys: 3.79 ms, total: 16.5 ms
Wall time: 11.3 ms


In [285]:
%%time
print(accuracy_score(y_test, y_pred_2))

0.8675062239909865
CPU times: user 7.52 ms, sys: 0 ns, total: 7.52 ms
Wall time: 7.95 ms


First model took same time than the second (difference 2 ms).
The accuracy difference is 0.6 for testing set.
It can be concluded that it doesn't take long to build a pretty good model.