In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [28]:
# b) Replace all missing (nan) revies with empty "" string.
# To aply given function we need firstly to remove all the NaN from "review" row in our dataframe.

print("before replacement: ", baby_df["review"][38])
baby_df["review"].replace(np.nan, "", inplace=True)
print("after replacement: ", baby_df["review"][38])

#short test:
baby_df["review"][38] == baby_df["review"][38]

before replacement:  nan
after replacement:  


True

In [29]:
#a) Remove punctuation from reviews using the given function.
baby_df["review"] = baby_df["review"].apply(remove_punctuation)

# short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [30]:
#c) Drop all the entries with rating = 3, as they have neutral sentiment.
rows = len(baby_df.index)

for i in range(0,rows):
    if(baby_df["rating"][i]==3):
        baby_df.drop(i, inplace=True)
#         print(baby_df["rating"][i])
    
#short test:
sum(baby_df["rating"] == 3)

0

In [31]:
baby_df.head(50)

Unnamed: 0,name,review,rating
1,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,5
5,Stop Pacifier Sucking without tears with Thumb...,When the Binky Fairy came to our house we didn...,5
6,A Tale of Baby's Days with Peter Rabbit,Lovely book its bound tightly so you may not b...,4
7,"Baby Tracker&reg; - Daily Childcare Journal, S...",Perfect for new parents We were able to keep t...,5
8,"Baby Tracker&reg; - Daily Childcare Journal, S...",A friend of mine pinned this product on Pinter...,5
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4


In [32]:
# d)  Set all positive ( ≥ 4) ratings to 1 and negative( ≤ 2) to -1.
baby_df["rating"].replace(1, -1, inplace = True)
baby_df["rating"].replace(2, -1, inplace = True)
baby_df["rating"].replace(4, 1, inplace = True)
baby_df["rating"].replace(5, 1, inplace = True)
#short test:
sum(baby_df["rating"]**2 != 1)

0

In [33]:
baby_df.head(50)

Unnamed: 0,name,review,rating
1,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,When the Binky Fairy came to our house we didn...,1
6,A Tale of Baby's Days with Peter Rabbit,Lovely book its bound tightly so you may not b...,1
7,"Baby Tracker&reg; - Daily Childcare Journal, S...",Perfect for new parents We were able to keep t...,1
8,"Baby Tracker&reg; - Daily Childcare Journal, S...",A friend of mine pinned this product on Pinter...,1
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,1


## Summary: 
As we can see in c) after we dropped all the rows where rating was equal to 3 the test that checks sum of all rows where rating was equal to 3 return 0 as expected.

In d) we use replace function just as in b) but this time we replace values lesser than 3 with -1 and values greater than 3 with 1.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [34]:
# a)
from sklearn.model_selection import train_test_split

train, test = train_test_split(baby_df, train_size=0.8, test_size=0.2)

In [35]:
# print("train:", list(train["review"])[3])
# print("test:", list(test["review"])[3])

# b)
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
x_train = vectorizer.fit_transform(list(train["review"]))
x_test = vectorizer.transform(list(test["review"]))

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [47]:
#a)
y_train = train["rating"]
# print(list(y_train))

model = LogisticRegression(solver='lbfgs', max_iter=10000)
model.fit(x_train, y_train)

In [55]:
#b)
# Get the indices of the sorted cofficients
ascending = np.argsort(model.coef_.flatten()) 
descending = ascending[::-1]

vocab = np.array(vectorizer.get_feature_names_out())

# Print most positive words
print("Most positive words: ", end="")
for i in range(10):
    print(vocab[descending[i]], end=", ")
print("\n")

# Print most negative words
print("Most negative words: ", end="")
for i in range(10):
    print(vocab[ascending[i]], end=", ")
print("\n")

#hint: model.coef_, vectorizer.get_feature_names()

Most positive words: lifesaver, minor, rich, ply, saves, penny, flipit, perfect, highly, excellent, 

Most negative words: worst, dissapointed, disappointing, worthless, useless, theory, poorly, disappointed, poor, unacceptable, 



## Summary:
In this exercise we are interpreting the coefficients of a logistic regression fit on the products rating. \
descending[0] contains the index of the largest coefficient \
vocab[descending[0]] contains the word corresponding to the largest coefficient

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [38]:
# a)

In [39]:
# b)

#hint: model.predict_proba()

In [40]:
# c) 

#hint: use the results of b)

In [41]:
# d) 

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [42]:
# a)

In [43]:
# b)

In [44]:
#c)

#hint: %time, %timeit