### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [29]:
#a)
baby_df['review'] = baby_df['review'].apply(lambda x: remove_punctuation(x) if isinstance(x, str) else x)
#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [30]:
#b)
print(baby_df["review"][38])
baby_df['review'] = baby_df['review'].fillna("")
#short test:
baby_df["review"][38]

nan


''

In [31]:
#c)
baby_df = baby_df[baby_df['rating'] != 3]
#short test:
sum(baby_df["rating"] == 3)

0

In [32]:
#d) 
baby_df['rating'] = baby_df['rating'].apply(lambda x: 1 if x >= 4 else -1)
#short test:
sum(baby_df["rating"]**2 != 1)
baby_df['rating'].unique()

array([ 1, -1])

#### I removed all the rows that had a rating of 3 and reranked the rest as per point d)

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [33]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [34]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [35]:
from sklearn.model_selection import train_test_split
Train, Test = train_test_split(baby_df, test_size=0.2)

In [36]:
#b)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(list(Train['review']))
y_train = Train["rating"]
X_test = vectorizer.transform(list(Test['review']))
y_test = Test["rating"]

#### In this task, I divided the set into training and test data, and converted the set of words into arrays of numbers

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [37]:
#a)
model = LogisticRegression(max_iter=50)
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [38]:
coefficients = model.coef_[0]
feature_names = vectorizer.get_feature_names_out()
coef_dict = dict(zip(feature_names, coefficients))
sorted_coef = sorted(coef_dict.items(), key=lambda x: x[1])
print("10 najbardziej pozytywnych słów:")
for word, coef in sorted_coef[-10:][::-1]:
    print(f"{word}: {coef}")

print("\n10 najbardziej negatywnych słów:")
for word, coef in sorted_coef[:10]:
    print(f"{word}: {coef}")

10 najbardziej pozytywnych słów:
perfectly: 1.9535787357866645
glad: 1.8834263255859534
highly: 1.8719155500856872
pleased: 1.8560705729179956
perfect: 1.7978266647846062
excellent: 1.7595556191821333
awesome: 1.7521802752976288
exactly: 1.6667605522079887
love: 1.6041593420682658
loves: 1.5686527011338085

10 najbardziej negatywnych słów:
returned: -2.492845935682586
poor: -2.3200229772736263
returning: -2.308139063037786
disappointed: -2.259339387902345
waste: -2.1490829164879237
useless: -2.073698316502393
worst: -2.060851187080608
terrible: -2.008043163531774
disappointing: -1.8333656517880446
horrible: -1.7350224463531352


#### I created a logistic regression model that learns how to recognize good and bad reviews based on training data. Then I wrote down the 10 most positive words and the 10 most negative words.

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [39]:
#a)
y_pred = model.predict(X_test)
y_pred

array([1, 1, 1, ..., 1, 1, 1])

In [40]:
#b)
y_pred_proba = model.predict_proba(X_test)
y_pred_proba
#hint: model.predict_proba()

array([[7.20999242e-04, 9.99279001e-01],
       [3.20389469e-04, 9.99679611e-01],
       [8.40631819e-03, 9.91593682e-01],
       ...,
       [2.87506058e-03, 9.97124939e-01],
       [2.55734897e-03, 9.97442651e-01],
       [5.17799311e-03, 9.94822007e-01]])

In [41]:
#c) 
print("Positive \n: ", Test.iloc[y_pred_proba.T[1].argmax()]["review"])
print("\n Negative: ", Test.iloc[y_pred_proba.T[0].argmax()]["review"])
#hint: use the results of b)

Positive 
:  I bought this seat for my tall 38in and thin 28lb 2 year old daughter I was impressed with the height the seat allowed for given that my child will out grow a seat in height long before weight I needed a seat that would fit in a 2012 Focus and was FAA cerified The seat is very compact It was easy to install using the seat belt and the latch Since the seat is so compact I had enough room to get in the car and put my knee in the seat to help get a solid install I was able to do this myself in just a few minutes even in an airport parking lotMy kiddo loved the purple color That particular color is a micro suede type fabric that is super easy to clean The wings support her head well when she is sleeping She can climb in and out of the seat very easily since it is much shorter in rise than her old seat an Evenflo Symphony The clips are all great quality and the shoulder pads are nice and soft She feels less boxed in sitting in this seat She loves that she can reach things if sh

In [42]:
#d) 
model.score(X_test, y_test)

0.9289676471470121

In [43]:
temp_X_test = X_test

#### In this task, I predicted the sentiment of test data reviews and propability of the sentiment. I also printed score basen on test data.

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [44]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [45]:
#a)
vectorizer = CountVectorizer(vocabulary=significant_words)

X_train = vectorizer.fit_transform(list(Train['review']))
y_train = Train["rating"]
X_test = vectorizer.transform(list(Test['review']))
y_test = Test["rating"]

In [46]:
model2 = LogisticRegression(max_iter=50)
model2.fit(X_train, y_train)
coefficients = model2.coef_[0]
feature_names = vectorizer.get_feature_names_out()
coef_dict = dict(zip(feature_names, coefficients))
sorted_coef = sorted(coef_dict.items(), key=lambda x: x[1])
print("10 najbardziej pozytywnych słów:")
for word, coef in sorted_coef[-10:][::-1]:
    print(f"{word}: {coef}")

print("\n10 najbardziej negatywnych słów:")
for word, coef in sorted_coef[:10]:
    print(f"{word}: {coef}")

10 najbardziej pozytywnych słów:
loves: 1.706828037752739
perfect: 1.521129431612671
love: 1.3439075578970332
easy: 1.2021338040291283
great: 0.9329650494541082
well: 0.5133242090864254
little: 0.5090150069761615
able: 0.2222710410010122
old: 0.08777808125498429
car: 0.05890451947624577

10 najbardziej negatywnych słów:
disappointed: -2.3258132820958752
return: -2.0484741492654397
waste: -2.0302995245571056
broke: -1.7132976788988898
money: -0.9319705214499334
work: -0.625601827406585
even: -0.5448820509860358
would: -0.34233393217065655
product: -0.31062280211003035
less: -0.17304222733066404


In [47]:
y_pred = model2.predict(X_test)
y_pred

array([1, 1, 1, ..., 1, 1, 1])

In [48]:
y_pred_proba = model2.predict_proba(X_test)
y_pred_proba

array([[0.01352711, 0.98647289],
       [0.06670452, 0.93329548],
       [0.02304548, 0.97695452],
       ...,
       [0.06670452, 0.93329548],
       [0.20063672, 0.79936328],
       [0.01350256, 0.98649744]])

In [49]:
print("most positive: ", Test.iloc[y_pred_proba.T[1].argmax()]["review"])
print("\n most negative: ", Test.iloc[y_pred_proba.T[0].argmax()]["review"])

most positive:  Ive posted an UPDATE at the endFirst let me state that I have been buying products from Amazon for the past 12 years  I love Amazon  I love the reviews because they have always played a part sometimes more sometimes less in the products I decide to purchase  For some reason I have never left a review I know  Until now  I feel such an urge to express how much my husband and I love this strollerMy first son will turn 7 next month  We purchased an Evenflo strollercar seat travel system for him in 2005  We hated it  We used the car seat but never EVER used the stroller  It was big bulky heavy and a pain in the rear  I think my husband hated it even more than I did  I wore my first son everywhere instead of putting him in a stroller  Fast forward to 6 years later and we have our second son in September of 2011  I researched for months  I wanted to find a stroller that had several qualities  Looks  As shallow as it may seem I wanted a stroller that wasnt ugly  I wasnt interes

In [50]:
model2.score(X_test, y_test)

0.869059398518785

#### In this task, I repeated all the operations from the previous tasks, but with a limited set of words. The results are as follows: For a limited set of words, the final result was several points worse, which means a much worse evaluation of the opinions of e.g. customers.

In [51]:
#b)
for word, coef in zip(vectorizer.get_feature_names_out(), model2.coef_[0]):
    print(f"{word}: {coef}")

love: 1.3439075578970332
great: 0.9329650494541082
easy: 1.2021338040291283
old: 0.08777808125498429
little: 0.5090150069761615
perfect: 1.521129431612671
loves: 1.706828037752739
well: 0.5133242090864254
able: 0.2222710410010122
car: 0.05890451947624577
broke: -1.7132976788988898
less: -0.17304222733066404
even: -0.5448820509860358
waste: -2.0302995245571056
disappointed: -2.3258132820958752
work: -0.625601827406585
product: -0.31062280211003035
money: -0.9319705214499334
would: -0.34233393217065655
return: -2.0484741492654397


In [52]:
import sys, time

In [53]:
%%time
%%timeit
model.predict(temp_X_test)

4.78 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CPU times: user 3.84 s, sys: 12.9 ms, total: 3.85 s
Wall time: 3.85 s


In [54]:
%%time
%%timeit
model2.predict(X_test)

559 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
CPU times: user 4.6 s, sys: 0 ns, total: 4.6 s
Wall time: 4.6 s


#### Finally, I checked how certain words affect the grade and compared the time of both solutions and it turned out that with a smaller array the solution is several percent worse, but it is over 8 times faster