### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')# Too many iteration warning

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()
# First five records in DataFrame

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [2]:
#a)
baby_df.review = baby_df.review.apply(lambda x: remove_punctuation((str(x))))
#short test:
print(baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock')
print(remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock')

True
True


In [3]:
#b)
baby_df = baby_df.fillna("")
#short test:
print(baby_df["review"][38] == baby_df["review"][38])

True


In [20]:
def estimation(rate):
    if rate == 3: return 0
    elif rate > 3 : return 1
    else : return -1
#c)
baby_df.rating = baby_df.rating.apply(lambda x: estimation(x))
#short test:
sum(baby_df["rating"] == 3) # When function works properly value of rate never equals to 3

0

In [21]:
#d)
# function in c) solves d
#short test:
sum(baby_df["rating"]**2 != 1)
# When function works properly value of rate is always -1,0 or 1

0

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names())
print(X_train_example.todense())

['adore', 'and', 'apples', 'bananas', 'dislike', 'hate', 'like', 'oranges', 'they', 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [7]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [8]:
#a)
import time
start = time.time()

review_train, review_test, rating_train, rating_test = train_test_split(baby_df.review,baby_df.rating, test_size=0.3, random_state=44)
#splitting dataset into training and test sets in ratio 7 to 3

In [9]:
#b)
vectorizer = CountVectorizer()
vectorizer.fit(baby_df.review)
# fit vectorizer to review
review_train_V = vectorizer.transform(review_train)
review_test_V = vectorizer.transform(review_test)
# data transformation

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [10]:
#a)
model = LogisticRegression()
model.fit(review_train_V, rating_train)
# creating model of logistic regression and fitting data

LogisticRegression()

In [11]:
#b)
indexes = np.argsort(model.coef_) # sorts by coefficients, get indexes in right order
words = np.array(vectorizer.get_feature_names()) # Returns a list of feature names, ordered by their indexes

print("The most positive: ", words[indexes[0, :10]]) # writing out the first 10
print("The most negative: ", words[indexes[0,-10:]]) #writing out the last 10

The most positive:  ['awesome' 'highly' 'excellent' 'exactly' 'glad' 'perfectly' 'pleased'
 'beautiful' 'helps' 'complaint']
The most negative:  ['horrible' 'poorly' 'returned' 'returning' 'disappointing' 'poor'
 'useless' 'terrible' 'waste' 'worst']


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [12]:
#a)
rating_pred = model.predict(review_test_V) # calculate ratings based on review using model creating by LogisticRegression

In [13]:
#b)
rating_pred_proba = model.predict_proba(review_test_V)
# calculate ratings based on review using model creating by LogisticRegression
# predict gives only -1, 0 ,1
# predict_proba gives probabilities

In [14]:
#c)
indexes_P = np .argsort(rating_pred_proba[:,1])[-5:]
print(review_test.iloc[indexes_P])

indexes_N = np.argsort(rating_pred_proba[:,0])[-5:]
print(review_test.iloc[indexes_N])

time1 = time.time() - start

121321    I cant say this worked as a trainer cup since ...
181321    I just bought this after my Withings stopped w...
33939     I bought the Natures Touch Papasan Cradle Swin...
10416     February 2011 3 starsI was excited when one of...
183236    I am not going to preach to you all about how ...
Name: review, dtype: object
140655    I had an Advocate 65 prior to buying this 70 m...
10180     Please see my email to the companyHelloI am wr...
120209    This is the first review I have ever written o...
77987     Bought August 2010 after reading reviews about...
147902    My disappointment with this product prompted m...
Name: review, dtype: object


In [15]:
#d)
AoP = sum(rating_pred==rating_test)/len(rating_test)
print(AoP)

0.8469124591354885


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [16]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [17]:
#a)

X_train, X_test, y_train, y_test = train_test_split(baby_df.review,baby_df.rating, test_size=0.3, random_state=44)


In [18]:
#b)
start2= time.time()
vectorizer = CountVectorizer()
vectorizer.fit(significant_words)

X_train_V = vectorizer.transform(X_train.values)
X_test_V = vectorizer.transform(X_test.values)

model = LogisticRegression()
model.fit(X_train_V, y_train)

indexes = np.argsort(model.coef_)
words = np.array(vectorizer.get_feature_names())
print("The most positive: ", words[indexes[0,:10]])
print("The most negative: ", words[indexes[0, -10:]])


y_pred = model.predict(X_test_V)

y_pred_proba = model.predict_proba(X_test_V)

indexes_P = np .argsort(y_pred_proba[:,1])[-5:]
print(X_test.iloc[indexes_P])
indexes_N = np.argsort(y_pred_proba[:,0])[-5:]
print(X_test.iloc[indexes_N])

end2 = time.time()

AoP2 = sum(y_pred==y_test)/len(y_test)
#impact of words from dictionary is very high, the results have changed completely

The most positive:  ['loves' 'perfect' 'love' 'easy' 'great' 'little' 'well' 'able' 'old'
 'car']
The most negative:  ['less' 'would' 'product' 'work' 'even' 'money' 'broke' 'return'
 'disappointed' 'waste']
11978     After reading the other reviews I have a coupl...
9592      My husband and I purchased Avent bottles for o...
134999    Let me begin with the fact that the monitor wo...
56757     I bought this bedding set for my little guy be...
171041    Ill just say in advance that the haters can ju...
Name: review, dtype: object
99430     Before My husband  I bought these we did read ...
143020    Totally cheap Wish I could return it to China ...
3746      Prior to parenthood I had heard several parent...
168391    I loved all the features of the car seat  It i...
6286      I am extremely disappointed with this productw...
Name: review, dtype: object


In [19]:
#c)
print("Time without limited dictionary ", time1)
print("Time with limited dictionary ", end2 - start2)
print("Accuracy of predictions without limited dictionary ", AoP)
print("Accuracy of predictions with limited dictionary ", AoP2)
# We can see that limited dictionary improves time and accuracy

Time without limited dictionary  59.23413372039795
Time with limited dictionary  19.462774753570557
Accuracy of predictions without limited dictionary  0.8469124591354885
Accuracy of predictions with limited dictionary  0.789847439157283
