### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [2]:
#b)
baby_df["review"] = baby_df["review"].fillna("")
#short test:
baby_df["review"][38] == baby_df["review"][38]

True

In [3]:
#a)
baby_df["review"] = baby_df["review"].apply(remove_punctuation)
#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock.,'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [4]:
#c)
baby_df = baby_df[baby_df["rating"] != 3]
#short test:
sum(baby_df["rating"] == 3)

0

In [5]:
#d) 
baby_df["rating"] = baby_df["rating"].apply(lambda x: 1 if x >= 4 else -1)
#short test:
sum(baby_df["rating"]**2 != 1)

0

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [7]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [8]:
#a)
train_df, test_df = train_test_split(baby_df, test_size=0.2, random_state=42)

In [9]:
#b)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_df["review"])
print(vectorizer.get_feature_names_out())

['00' '000' '0001' ... 'zzzzzz' 'zzzzzzz' 'zzzzzzzzzzz']


## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [10]:
#a)
model = LogisticRegression()
model.max_iter = 1000
model.fit(X_train, train_df["rating"])


In [11]:
#b)
print(model.coef_)
coefficients = model.coef_[0]
words = vectorizer.get_feature_names_out()
sorted_names = np.argsort(coefficients)
print("10 most negative words:")
for i in range(10):
    print(words[sorted_names[i]])
print("10 most positive words:")
for i in range(10):
    print(words[sorted_names[-i]])
#hint: model.coef_, vectorizer.get_feature_names()

[[ 0.0004661   0.00888183  0.00654983 ...  0.00893896  0.00385359
  -0.00011139]]
10 most negative words:
dissapointed
worthless
worst
useless
poorly
disappointing
unusable
disappointed
poor
unacceptable
10 most positive words:
dissapointed
lifesaver
minor
con
skeptical
ply
thankful
saves
rich
wonderfully


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [12]:
#a)
y_pred = model.predict(vectorizer.transform(test_df["review"]))
print(y_pred)

[ 1 -1 -1 ...  1  1  1]


In [13]:
#b)
y_pred_proba = model.predict_proba(vectorizer.transform(test_df["review"]))
print(y_pred_proba)
#hint: model.predict_proba()

[[4.62077715e-01 5.37922285e-01]
 [7.81199266e-01 2.18800734e-01]
 [9.71662347e-01 2.83376528e-02]
 ...
 [2.20355151e-05 9.99977964e-01]
 [8.81505081e-05 9.99911849e-01]
 [6.96894765e-02 9.30310523e-01]]


In [14]:
#c) 
best_reviews = np.argsort(y_pred_proba[:,1])[-5:]
worst_reviews = np.argsort(y_pred_proba[:,1])[:5]
print("5 most positive reviews:")
for i in range(5):
    print(test_df["review"].iloc[best_reviews[i]], "\n")

print("5 most negative reviews:")
for i in range(5):
    print(test_df["review"].iloc[worst_reviews[i]], "\n")
#hint: use the results of b)

5 most positive reviews:
We love this highchair  We have a 4 year old and an 8 month old  This is our 3rd highchairFeatures we loveFit  This chair FITS my infant daughter  She fits in this chair without the extra insert way better than in the basic Evenflo chair we had before  I only use the 3point harness and let her shoulders be free and she sits at a correct level so her arms can move around well and she can lean and reach for things on the tray  Many other chairs have a real problem with fit  So I do believe that with the insert this is the perfect chair to start your 4 month infant in for feeding  The insert will make them more secure kind of like their carseatTray Insert  With our other chairs the tray clicks down into the larger tray all the way aroundyou can remove it for cleaning  Fine  But I always hated that food got into the crack nearest to the baby so I couldnt just wipe it downI HAD to remove it to get out all the gunk  The tray on this is so smartthe edge closest to the

In [15]:
#d) 
accuracy = accuracy_score(test_df["rating"], y_pred)
print(accuracy)

0.9325657401577164


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [16]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [17]:
#a)
vectorizer = CountVectorizer(vocabulary=significant_words)
X_train = vectorizer.fit_transform(train_df["review"])
print(vectorizer.get_feature_names_out())

model_smaller = LogisticRegression()
model_smaller.max_iter = 1000
model_smaller.fit(X_train, train_df["rating"])
print(model_smaller.coef_)
coefficients = model_smaller.coef_[0]
words = vectorizer.get_feature_names_out()
sorted_names = np.argsort(coefficients)
print("10 most negative words:")
for i in range(10):
    print(words[sorted_names[i]])
print("10 most positive words:")
for i in range(10):
    print(words[sorted_names[-i]])

y_pred = model_smaller.predict(vectorizer.transform(test_df["review"]))
print(y_pred)

y_pred_proba = model_smaller.predict_proba(vectorizer.transform(test_df["review"]))
print(y_pred_proba)

best_reviews = np.argsort(y_pred_proba[:,1])[-5:]
worst_reviews = np.argsort(y_pred_proba[:,1])[:5]
print("5 most positive reviews:")
for i in range(5):
    print(test_df["review"].iloc[best_reviews[i]], "\n")

print("5 most negative reviews:")
for i in range(5):
    print(test_df["review"].iloc[worst_reviews[i]], "\n")
accuracy = accuracy_score(test_df["rating"], y_pred)
print(accuracy)

['love' 'great' 'easy' 'old' 'little' 'perfect' 'loves' 'well' 'able'
 'car' 'broke' 'less' 'even' 'waste' 'disappointed' 'work' 'product'
 'money' 'would' 'return']
[[ 1.35900018  0.93088153  1.19322423  0.07344137  0.50243139  1.51506778
   1.68497171  0.49619639  0.19326954  0.0745294  -1.68064024 -0.20157026
  -0.4897191  -1.97957053 -2.39875109 -0.63564887 -0.31372724 -0.94642423
  -0.34223904 -2.09283648]]
10 most negative words:
disappointed
return
waste
broke
money
work
even
would
product
less
10 most positive words:
disappointed
loves
perfect
love
easy
great
little
well
able
car
[1 1 1 ... 1 1 1]
[[0.07624749 0.92375251]
 [0.21395717 0.78604283]
 [0.21395717 0.78604283]
 ...
 [0.04229028 0.95770972]
 [0.10860443 0.89139557]
 [0.09165205 0.90834795]]
5 most positive reviews:
UPDATE 112013  I went ahead and used a tiny bit of WD40 and also thoroughly cleaned the carseat base and stroller and all the mechanisms started working like new again  No more issues opening or closing the

In [18]:
#b)
specific_word = "good"  # przykład konkretnego słowa
index = vectorizer.vocabulary_.get(specific_word)  # Indeks słowa w macierzy wektorów
print(f"Współczynnik dla słowa '{specific_word}': {coefficients[index]}")


Współczynnik dla słowa 'good': [[ 1.35900018  0.93088153  1.19322423  0.07344137  0.50243139  1.51506778
   1.68497171  0.49619639  0.19326954  0.0745294  -1.68064024 -0.20157026
  -0.4897191  -1.97957053 -2.39875109 -0.63564887 -0.31372724 -0.94642423
  -0.34223904 -2.09283648]]


In [19]:
#c)

time = %timeit -o model.predict(vectorizer.transform(test_df["review"]))
print(time)
time = %timeit -o model_smaller.predict(vectorizer.transform(test_df["review"]))
print(time)
#hint: %time, %timeit

ValueError: X has 20 features, but LogisticRegression is expecting 121805 features as input.