### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [2]:
#b)
baby_df['review'] = baby_df['review'].fillna("")
#short test:
baby_df["review"][38] == baby_df["review"][38]

True

I changed the order of operations. In the first place, I had to change the empty values to empty words, so that later other functions would have no problem transforming the data.

Cleansing is essential because you can’t obtain good results from bad data even with the best algorithm

In [3]:
#a)
baby_df['review'] = baby_df['review'].apply(remove_punctuation)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'


True

Here we get rid of neutral data, max is 5 and min i 1 so neutral is mean

In [4]:
#c)
print(f"Max rating [{baby_df.rating.max()}]")
print(f"Min rating [{baby_df.rating.min()}]")

baby_df = baby_df[baby_df.rating != 3]
#short test:
sum(baby_df["rating"] == 3)

Max rating [5]
Min rating [1]


0

here we set positive comments with value 3 and 4 to 1, so there is no difference between them and same goes to 1 and 2 and we set them to -1 so the work oposite

In [5]:
#d) 
def set_rate(rate):
    if rate <= 2:
        return -1
    if rate >= 4:
        return 1

baby_df['rating'] = baby_df['rating'].apply(set_rate)
    

#short test:
sum(baby_df["rating"]**2 != 1)

0

Now everything is clean and ready for next steps 

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names())
print(X_train_example.todense())



['adore', 'and', 'apples', 'bananas', 'dislike', 'hate', 'like', 'oranges', 'they', 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [7]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [8]:
#a)
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(baby_df, test_size=0.2, random_state=42)


In [9]:
#b)

vectorizer = CountVectorizer()

x_train = vectorizer.fit_transform(train_df['review'])
x_test = vectorizer.transform(test_df['review'])

y_train = train_df['rating']
y_test = test_df['rating']
vectorizer.get_feature_names()


['00',
 '000',
 '0001',
 '001',
 '001cm',
 '002',
 '01',
 '010',
 '012',
 '012010',
 '012013',
 '01202012',
 '013005',
 '01302012my',
 '01312009',
 '015a',
 '017',
 '0182196',
 '02',
 '020',
 '02000z',
 '02060',
 '02072',
 '02090',
 '021',
 '02100',
 '02172014after',
 '02180',
 '021914',
 '021meal',
 '02220',
 '024',
 '02534',
 '02640a',
 '02644',
 '03',
 '030',
 '030611fantastic',
 '032010',
 '034',
 '03mo',
 '03mos',
 '03mosbut',
 '04',
 '0427',
 '046060us',
 '04a',
 '05',
 '050',
 '051',
 '052',
 '05202013',
 '05oz',
 '06',
 '06182012my',
 '0635',
 '065',
 '06a',
 '06m',
 '06mfor',
 '06mo',
 '06month',
 '06mosit',
 '06mths',
 '07',
 '072012',
 '073',
 '075',
 '075long',
 '08',
 '0804',
 '080412',
 '08120firms',
 '0813',
 '08132011',
 '081713',
 '08280',
 '08all',
 '08m',
 '08while',
 '09',
 '09082009',
 '09092011',
 '09282012',
 '093',
 '093010c',
 '097',
 '099',
 '0bviously',
 '0fast',
 '0good',
 '0gt',
 '0i',
 '0m',
 '0negatives',
 '0r',
 '0star',
 '0the',
 '0these',
 '0up',
 '0z'

Here I devided 20% of our data into test to see results

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [10]:
#a)
model = LogisticRegression(solver='sag', max_iter=200)
model.fit(x_train, y_train)




LogisticRegression(max_iter=200, solver='sag')

Our dataset is a bit tricky. So, I used a special method that works better when we have lots of data. I also changed some settings to make sure the program learns well.

When we use more data and make things more complicated, these smart tools take more time to figure things out. For example, the program I used took about a minute to finish its job. But in big companies, the tools they use can take several hours to learn because they deal with even more complicated stuff.

In [11]:
#b)
all_names = vectorizer.get_feature_names()
all_coefs = model.coef_[0]

paired_coefs = list(zip(all_coefs, all_names))
sorted_coefs = sorted(paired_coefs, key=lambda x: x[0])

print(f"Most positive words: {[x[1] for x in sorted_coefs[-10:]]}")
print(f"Most negative words: {[x[1] for x in sorted_coefs[:10]]}")

#hint: model.coef_, vectorizer.get_feature_names()

Most positive words: ['glad', 'excellent', 'great', 'best', 'happy', 'easy', 'perfectly', 'love', 'perfect', 'loves']
Most negative words: ['disappointed', 'returned', 'waste', 'useless', 'poor', 'return', 'idea', 'returning', 'terrible', 'worst']


Using the feature factor in the decision function, we can easily find out what the most negative and positive words were. As we might have guessed, these positive words are most often associated with human emotions. The negative ones are most of the time about dissatisfaction and willingnes to return an item

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [12]:
#a)
prediction = model.predict(x_test)

diff = 0

for i in range(len(prediction)):
    if prediction[i] != y_test.tolist()[i]:
        diff += 1
        
pred_number = len(prediction)

print(f"Number of test elements [{pred_number}]")        
print(f"Number of mistakes in predictions [{diff}]")
print(f"Percentage of correct predictions [{np.round((pred_number - diff) * 100 / pred_number, 2)}%]")


Number of test elements [33351]
Number of mistakes in predictions [2306]
Percentage of correct predictions [93.09%]


Model fits very well, and only 7% of guesses are incorrecnt

In [13]:
#b)
proba_prediction = model.predict_proba(x_test)
print(proba_prediction)
print(f"Model classes {model.classes_}")

#hint: model.predict_proba()

[[4.76091089e-01 5.23908911e-01]
 [7.16374867e-01 2.83625133e-01]
 [4.75378466e-01 5.24621534e-01]
 ...
 [2.46177612e-04 9.99753822e-01]
 [7.43107542e-03 9.92568925e-01]
 [1.27799017e-01 8.72200983e-01]]
Model classes [-1  1]


the closer to 1, the more certain it is that it is a certain value for a given class, and the closer to 0 then model is certian that it's wrong

In [14]:
#c) 
paired_probs = list(zip(proba_prediction[:, 0], test_df['review']))
sorted_probs = sorted(paired_probs, key=lambda x: x[0])

print("Most positive:")
for i in sorted_probs[:5]:
    print(f"* {i[1]}\n")

print("\nMost negative:")
for i in sorted_probs[-5:]:
    print(f"* {i[1]}\n")
#hint: use the results of b)

Most positive:
* I bought this carrier when my daughter was about 4 weeks old shes now 10 weeks old  I had a Moby that I borrowed from a friend but could never quite get to work and my daughter hated being in it  I also have a Bjorn Active but she seemed pretty precarious in that when she was so littleThis carrier is nearly perfect for infants  Its not quite as easy to put on as the Bjorn but MUCH easier than the Moby and it gives that nice snug fit that the Moby did  Its much lighter weight than the Bjorn so its easier on my back  I have the khakicolored version so I havent had any problems with it showing dirt and dust  I wish the straps werent so long one size fits all but I just wrap them around to my back and tie them again loosely  Im 54 and 145 lbs  I think this carrier would fit most people  The fabric is thick and the construction seems to be of good quality It might be hot in hot weather but most carriers areThis has been a life saver  My daughter wont sleep anywhere except w

All of those opinions are very long so I guess that's why our model says that they are the most/least positive

In [15]:
#d) 
from sklearn.metrics import accuracy_score

print(f"Model accuracy [{round(model.score(x_test, y_test), 2)}%]")
print(f"Model accuracy [{round(accuracy_score(y_test, prediction), 2)}%]")

Model accuracy [0.93%]
Model accuracy [0.93%]


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [16]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [17]:
#a)
vectorizer_limited = CountVectorizer()
vectorizer_limited.fit_transform(significant_words)

x_train_limited = vectorizer_limited.transform(train_df['review'])
x_test_limited = vectorizer_limited.transform(test_df['review'])

all_names_less = vectorizer_limited.get_feature_names()
print(f"Limited words: {all_names_less}")

model_limited = LogisticRegression(solver='sag', max_iter=200)
model_limited.fit(x_train_limited, y_train)

paired_coefs = list(zip(model_limited.coef_[0], all_names_less))
sorted_coefs = sorted(paired_coefs, key=lambda x: x[0])

print(f"Most positive words: {[x[1] for x in sorted_coefs[-10:]]}")
print(f"Most negative words: {[x[1] for x in sorted_coefs[:10]]}")

Limited words: ['able', 'broke', 'car', 'disappointed', 'easy', 'even', 'great', 'less', 'little', 'love', 'loves', 'money', 'old', 'perfect', 'product', 'return', 'waste', 'well', 'work', 'would']
Most positive words: ['old', 'car', 'able', 'well', 'little', 'great', 'easy', 'love', 'perfect', 'loves']
Most negative words: ['disappointed', 'return', 'waste', 'broke', 'money', 'work', 'even', 'would', 'product', 'less']


In [18]:

prediction = model_limited.predict(x_test_limited)
proba_prediction = model_limited.predict_proba(x_test_limited)
print(proba_prediction)
print(f"Model classes {model_limited.classes_}")

paired_probs = list(zip(proba_prediction[:, 0], test_df['review']))
sorted_probs = sorted(paired_probs, key=lambda x: x[0])

print("Most positive reviews:")
for i in sorted_probs[:5]:
    print(f"* {i[1]}\n")

print("\nMost negative reviews:")
for i in sorted_probs[-5:]:
    print(f"* {i[1]}\n")

[[0.07665783 0.92334217]
 [0.21469346 0.78530654]
 [0.21469346 0.78530654]
 ...
 [0.04226248 0.95773752]
 [0.10949687 0.89050313]
 [0.09187263 0.90812737]]
Model classes [-1  1]
Most positive reviews:
* We bought this stroller after selling our beloved BOB rev on craigslist We used the BOB for 9 months for my son but it just wasnt practical I dont jogrun it didnt have a big basket and was very bulky to take into stores quickly However I did love how it unfolded easily but it was heavy to fold up and lift into my small trunk myself Overall I didnt realize what Id need in a stroller until AFTER I had my son Live  learn We did love how easily the BOB would go over pretty much anything Nevertheless we sold it and after extensive research on strollers we decided it was between the uppababy brand because of the large baskets OR the city mini GT because of its easy fold up design After looking over both strollers I decided on the uppababy cruz because of a few main factors It SITS UP I cant t

Now with limited words, we can see that diferent opions were assumed to be positive/negative

In [19]:
print(f"Model accuracy [{round(model_limited.score(x_test_limited, y_test), 2)}%]")

Model accuracy [0.87%]


we can see that now our model is less precise, but it works way faster

In [20]:
for pair in sorted_coefs:
    print("{:>12} - [{}]".format(pair[1], pair[0]))

disappointed - [-2.388319013380369]
      return - [-2.077206990927258]
       waste - [-2.0072577142469235]
       broke - [-1.6666980106122395]
       money - [-0.9384763534057065]
        work - [-0.6372887914440477]
        even - [-0.4905926898546979]
       would - [-0.3393604334780837]
     product - [-0.31197904805136506]
        less - [-0.2068813101029875]
         old - [0.07259771239041735]
         car - [0.07459277895225933]
        able - [0.19446499463265007]
        well - [0.49705929974884044]
      little - [0.5021900591092869]
       great - [0.9313163231934577]
        easy - [1.1917852116899987]
        love - [1.3574981123231504]
     perfect - [1.5221597093305508]
       loves - [1.701383834544173]


In [None]:
#c)
print(f"First model accuracy {model.score(x_test, y_test)}")
print(f"Limited model accuracy {model_limited.score(x_test_limited, y_test)}")

print("\nFirst model prediction time")
%timeit -n100 -r10 model.predict(x_test)

print("\nLimited model prediction time")
%timeit -n100 -r10 model_limited.predict(x_test_limited)

print("\nFirst model learning time")
%timeit -r1 model.fit(x_train, y_train)

print("\nLimited model learning time")
%timeit -r1 model_limited.fit(x_train_limited, y_train)

#hint: %time, %timeit

First model accuracy 0.9308566459776318
Limited model accuracy 0.869059398518785

First model prediction time
6.69 ms ± 438 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

Limited model prediction time
601 µs ± 44.4 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

First model learning time




Limited model is more than 20 times faster