### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


The dataset conatains 3 columns:
* name - The name of the product
* review - text review of given product
* rating - numercial rating of given product

## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [2]:
#a)
baby_df['review'] = baby_df['review'].apply(lambda x: remove_punctuation(str(x)))
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,These flannel wipes are OK but in my opinion n...,3
1,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,5


In [3]:
#b)
baby_df['review'] = baby_df['review'].fillna('')

In [4]:
#c)
print('Before: ', sum(baby_df["rating"] == 3))
baby_df = baby_df[baby_df.rating != 3]
print('After: ', sum(baby_df["rating"] == 3))

Before:  16779
After:  0


In [5]:
#d) 
print('Before:', sum(baby_df["rating"]**2 != 1))
baby_df['rating'] = baby_df['rating'].apply(lambda x: 1 if x>=4 else -1)
print('After:', sum(baby_df["rating"]**2 != 1))
baby_df.head()

Before: 151569
After: 0


Unnamed: 0,name,review,rating
1,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,When the Binky Fairy came to our house we didn...,1


## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names())
print(X_train_example.todense())



['adore', 'and', 'apples', 'bananas', 'dislike', 'hate', 'like', 'oranges', 'they', 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]




In [7]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [8]:
#a)
from sklearn.model_selection import train_test_split
X = baby_df['review'][:25000]
y = baby_df['rating'][:25000]
reviews_train, reviews_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=43)

In [9]:
#b)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(reviews_train)
X_test = vectorizer.transform(reviews_test)

After these transormations we have test and train sets which contain numerical values assigned to strings. This means they are ready to be analyzed.

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [10]:
#a)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [11]:
#b)
coef = model.coef_.tolist().pop()
indexes_neg = sorted(range(len(coef)), key=lambda x: coef[x])[:10]
indexes_pos = sorted(range(len(coef)), key=lambda x: coef[x])[-10:]
print(f'Most negative words: {[vectorizer.get_feature_names()[i] for i in indexes_neg]}')
print(f'Most positive words: {[vectorizer.get_feature_names()[i] for i in indexes_pos]}')
#hint: model.coef_, vectorizer.get_feature_names()



Most negative words: ['poor', 'disappointed', 'worst', 'returned', 'returning', 'concept', 'worse', 'unsafe', 'horrible', 'waste']
Most positive words: ['awesome', 'best', 'pleased', 'lifesaver', 'loves', 'love', 'perfect', 'easy', 'highly', 'excellent']


As we can see - the given results of most positive and negative words contains words that are usually associated with positive/negative wording. This means top 10 results are satysfying.

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [12]:
#a)
sentiment = model.predict(X_test)

In [13]:
#b)
sentiment_prob = model.predict_proba(X_test)
#hint: model.predict_proba()

In [14]:
#c) 
neg_sentiment = sorted(range(len(sentiment_prob)), key=lambda x:sentiment_prob[:, 0][x])[-5:]
pos_sentiment = sorted(range(len(sentiment_prob)), key=lambda x:sentiment_prob[:, 1][x])[-5:]
print(f'Most negative reviews indices: {neg_sentiment}')
print(f'Most positive reviews indices: {pos_sentiment}')
reviews_neg = [np.array(baby_df['review'])[x] for x in neg_sentiment]
reviews_pos = [np.array(baby_df['review'])[x] for x in pos_sentiment]
print('-----------------------------------')
print('Most negative reviews:')
for i in range(len(reviews_neg)):
    print(f'Most negative review {i+1}: {reviews_neg[i]}')
    print('-----------------------------------')
print('Most positive reviews:')
for i in range(len(reviews_pos)):
    print(f'Most positive review {i+1}: {reviews_pos[i]}')
    print('-----------------------------------')

#hint: use the results of b)

Most negative reviews indices: [9928, 9170, 11864, 3399, 8710]
Most positive reviews indices: [4943, 3232, 8080, 747, 4614]
-----------------------------------
Most negative reviews:
Most negative review 1: I originally had the deluxe swing and my baby loved it so much so he wore it out I immediately went to purchase another and found only the Aquarium Take Along was avaliable I am concerned about its safety as it pushes his chin to his chest and also allows his head to lean to each side to the point it rests on his shoulder I believe this is set up for airway impairment I will be returning this item and will not purchase another Fisher Price product with the same seat design
-----------------------------------
Most negative review 2: For lightweight quick and easy transport of a toddler nothing beats this Chicco  Its built much better than the 1015 variety you can get at several big box stores or drugstores but is nearly as compact and boasts many other features the cheapest dont offe

Comparing to top 10 positive/negative words, there is a difference between results in positive/negative reviews. The reviews from given results usually have a positive sentiment, however the prediction is based on weighting of each word. This means that if there is a word with large negative sentiment in a review with mostly positive sentiment words but with low weigthings - the overall review sentiment will be negative. Because of that, the results are expected to be as they are given.

In [15]:
#d) 
acc_1 = model.score(X_test, y_test)
print(f'Accuracy: {acc_1}')

Accuracy: 0.9008


The score of the model is fairly high so the model seems well trained.

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [16]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [17]:
#a)
X = baby_df['review']
y = baby_df['rating']
reviews_train, reviews_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=43)

vectorizer = CountVectorizer(vocabulary=significant_words)
X_train = vectorizer.fit_transform(reviews_train)
X_test = vectorizer.transform(reviews_test)

model_ = LogisticRegression(max_iter=1000)
model_.fit(X_train, y_train)

coef = model_.coef_.tolist().pop()
indexes_neg = sorted(range(len(coef)), key=lambda x: coef[x])[:10]
indexes_pos = sorted(range(len(coef)), key=lambda x: coef[x])[-10:]
print(f'Most negative words: {[vectorizer.get_feature_names()[i] for i in indexes_neg]}')
print(f'Most positive words: {[vectorizer.get_feature_names()[i] for i in indexes_pos]}')

sentiment = model_.predict(X_test)

sentiment_prob = model_.predict_proba(X_test)

neg_sentiment = sorted(range(len(sentiment_prob)), key=lambda x:sentiment_prob[:, 0][x])[-5:]
pos_sentiment = sorted(range(len(sentiment_prob)), key=lambda x:sentiment_prob[:, 1][x])[-5:]
print(f'Most negative reviews indices: {neg_sentiment}')
print(f'Most positive reviews indices: {pos_sentiment}')
reviews_neg = [np.array(baby_df['review'])[x] for x in neg_sentiment]
reviews_pos = [np.array(baby_df['review'])[x] for x in pos_sentiment]
print('-----------------------------------')
print('Most negative reviews:')
for i in range(len(reviews_neg)):
    print(f'Most negative review {i+1}: {reviews_neg[i]}')
    print('-----------------------------------')
print('Most positive reviews:')
for i in range(len(reviews_pos)):
    print(f'Most positive review {i+1}: {reviews_pos[i]}')
    print('-----------------------------------')

acc_2 = model_.score(X_test, y_test)

Most negative words: ['disappointed', 'waste', 'return', 'broke', 'money', 'work', 'even', 'would', 'product', 'less']
Most positive words: ['old', 'car', 'able', 'well', 'little', 'great', 'easy', 'love', 'perfect', 'loves']
Most negative reviews indices: [54636, 59656, 24581, 23107, 74902]
Most positive reviews indices: [60336, 60515, 10503, 57036, 66569]
-----------------------------------
Most negative reviews:
Most negative review 1: These are so useful when going out to eat  My son loves the images on them as well  Glad I found out about them
-----------------------------------
Most negative review 2: This bed works well with my special needs grandson He sleeps great on it The only thing I did not like was it takes awhile to inflate  its Loud
-----------------------------------
Most negative review 3: I got this for my son when he was still not steady sitting up for long periods I really like the extra cushioning and it packed up really well But it did not fit the carts at Target



Comparing to results from model that was using no limitations of dictionary:
* the results of some negative/positive words from dictionary are not very intuitive: for example words old and car are positive words. But overall results of wording are expected.
* the results of positive/negative reviews are simillar, however there is a clear negative review in most positive sentiment reviews. This is caused by words that are negative but not classified because of given dictionary.

In [18]:
#b)
coef = model_.coef_.tolist().pop()
for i, x in enumerate(significant_words):
    print(f'{x} : {coef[i]}')

love : 1.3698976524524904
great : 0.9369617326186344
easy : 1.119406685765037
old : 0.07748633667492044
little : 0.5156942318384039
perfect : 1.4459681844126886
loves : 1.7400455398877799
well : 0.4680548341402263
able : 0.1883177748776173
car : 0.09254342689592518
broke : -1.7724302419436613
less : -0.11846486377784393
even : -0.5432893015603202
waste : -2.0993806893927776
disappointed : -2.45668585062283
work : -0.6417252618292573
product : -0.3047002359770949
money : -0.8745018680409571
would : -0.33243375548346604
return : -2.0164142040923516


The impact of the words are expected and intuitive. However, there are some words with interesting impact values such as 'work', 'would', and 'product' that thave negative values, where 'old' has slightly positive impact.

For c) I wam going to use a slice of whole baby_df because the evaluation time with %timeit was way too long

In [19]:
#c)
X = baby_df['review'][:30000]
y = baby_df['rating'][:30000]
reviews_train, reviews_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=43)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(reviews_train)
X_test = vectorizer.transform(reviews_test)
model_a = LogisticRegression(max_iter=1000)
print('Timing of fit of model without limited dictionary: ')
%timeit model_a.fit(X_train, y_train)
acc_a = model_a.score(X_test, y_test)

reviews_train, reviews_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=43)
vectorizer = CountVectorizer(vocabulary=significant_words)
X_train = vectorizer.fit_transform(reviews_train)
X_test = vectorizer.transform(reviews_test)
model_b = LogisticRegression(max_iter=1000)
print('Timing of fit of model with limited dictionary: ')
%timeit model_b.fit(X_train, y_train)
acc_b = model_b.score(X_test, y_test)
print('\nAccuracy from these models:')
print(f'Accuracy 1 = {acc_a}    |    Accuracy 2 = {acc_b}')

print('\nAccuracy from previous models:')
print(f'Accuracy 1 = {acc_1}    |    Accuracy 2 = {acc_2}')

#hint: %time, %timeit

Timing of fit of model without limited dictionary: 
14.3 s ± 2.7 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Timing of fit of model with limited dictionary: 
25.5 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Accuracy from these models:
Accuracy 1 = 0.9060666666666667    |    Accuracy 2 = 0.8426

Accuracy from previous models:
Accuracy 1 = 0.9008    |    Accuracy 2 = 0.8657407407407407


Comparing the evaluation time - the model that uses given dictionary has much better times. However, the better calculation time is at the cost of words limitation. With fewer words in consideration, the second model has worse score from the first one. It is visible both on models that used sliced data and on model that used whole columns.