### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

class colors:
    POSITIVE = '\033[92m'
    NEGATIVE = '\033[91m'
    RESET = '\033[0m'

iterations = 2000

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$ 4) ratings to 1 and negative($\leq$ 2) to -1.

In [12]:
#a)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
print(remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock')
###########

print(f'Review before: {baby_df["review"][3]}', end='\n\n')

baby_df['review'] = baby_df['review'].apply(lambda x: remove_punctuation(x) if isinstance(x, str) else '')

print(f'Review after: {baby_df["review"][3]}', end='\n\n')

True
Review before: This is a product well worth the purchase.  I have not found anything else like this, and it is a positive, ingenious approach to losing the binky.  What I love most about this product is how much ownership my daughter has in getting rid of the binky.  She is so proud of herself, and loves her little fairy.  I love the artwork, the chart in the back, and the clever approach of this tool.

Review after: This is a product well worth the purchase  I have not found anything else like this and it is a positive ingenious approach to losing the binky  What I love most about this product is how much ownership my daughter has in getting rid of the binky  She is so proud of herself and loves her little fairy  I love the artwork the chart in the back and the clever approach of this tool



# Note:
    As we can see the punctuation in the review dissapeared - which be useful when we vectorize our data later (to analize it), because we won't have punctuation symbols in the vector that could destroy our model (because most of the people use some kind of punctuation symbols and they dont serve any meaning for us because we are interested in sentiment)

In [13]:
#b) done in a)
# check whether all rows do not have nan as the value in review column
print(baby_df[baby_df['review'].isna()])
#short test:
baby_df["review"][38] == baby_df["review"][38]

Empty DataFrame
Columns: [name, review, rating]
Index: []


True

# Note:
    All reviews with no review value ('nan' as a value) now are just empty strings which can be taken into consideration when making a vector. 

In [14]:
#c)
baby_df = baby_df[baby_df.rating != 3]
#short test:
print(f'Amount of rows in which rating column is 3: {sum(baby_df["rating"] == 3)}')

Amount of rows in which rating column is 3: 0


# Note:
    We succesfully dropped all data in which rating was 3 (so the neutral ones)

In [15]:
#d) 
baby_df.loc[baby_df["rating"] <= 2, "rating"] = -1
baby_df.loc[baby_df["rating"] >= 4 , "rating"] = 1

#short test:
print(f'Amount of ratings with ratting different than -1 or 1: {sum(baby_df["rating"]**2 != 1)}')

Amount of ratings with ratting different than -1 or 1: 0


# Note:
    We succesfully changed all negative ratings to -1 and positive to 1, so that it's easier to work with the classified data.

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = [
    "We like apples",
    "We hate oranges",
    "I adore bananas",
    "We like like apples and oranges",
    "They dislike bananas"
]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())

['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [17]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [18]:
#a)
from sklearn.model_selection import train_test_split

smaller_df = baby_df[:]
X_train, X_test, y_train, y_test = train_test_split(smaller_df['review'], smaller_df['rating'], test_size=0.3, random_state=44)

In [19]:
#b)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
# print(np.where(np.array(vectorizer.get_feature_names_out()) == ' '))
X_test = vectorizer.transform(X_test)
print(np.shape(X_train))

(116726, 111592)


# Note:
    We made our vectorization using the training dataset (part of the original dataset [0.3 of the dataset to be exact]). 
    
    The size of the vector (116726 x 11592) is huge because of the amount of data in our dataset.

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [20]:
#a)
model = LogisticRegression(max_iter=iterations)
model.fit(X_train, y_train)

# Note:
    We succesfully trained our LogisticRegression model using the traning data.

In [21]:
#b)
coefs = model.coef_.reshape(-1, 1)
new_coefs = np.array([coef[0] for coef in coefs])

most_positive_idx = (-new_coefs).argsort()[:10]
most_negative_idx = new_coefs.argsort()[:10]

most_positive = [vectorizer.get_feature_names_out()[i] for i in most_positive_idx]
most_negative = [vectorizer.get_feature_names_out()[i] for i in most_negative_idx]

print('Most positive words:', *most_positive)
print('Most negative words:', *most_negative)

#hint: model.coef_, vectorizer.get_feature_names()

Most positive words: rich ply thankful awesome pleasantly minor worry lifesaver perfect downside
Most negative words: dissapointed disappointing worst worthless poorly useless nope shame pointless theory


# Note:
    We can see that the positive and the negative words fit nicely in their category (even though there are some exceptions like 'downside' in positive words [probably because most people when giving product reviews like to also list their downsides so that the review can mean more to other people wanting to buy something.])

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [22]:
#a)
y_pred = model.predict(X_test)

In [23]:
#b)
y_pred_prob = model.predict_proba(X_test)
#hint: model.predict_proba()

In [24]:
#c) 
negative = y_pred_prob[:, 0]
positive = y_pred_prob[:, 1]
most_negative_idx = (-negative).argsort()[:5]
most_positive_idx = positive.argsort()[-5:]

print('Most negative reviews row numbers:', *most_negative_idx)
print('Most positive reviews row numbers:', *most_positive_idx)

print(f'{colors.NEGATIVE}red {colors.RESET} means a negative review (value == -1)')
print(f'{colors.POSITIVE}green {colors.RESET} means a positive review (value == 1)')

_ = 'Most positive reviews according to probability'
print(f'\n{_:-^70}')
for idx in most_positive_idx:
    color = colors.POSITIVE if np.array(smaller_df['rating'])[idx] == 1 else colors.NEGATIVE
    print(f'\t{color}{idx}:{colors.RESET}', np.array(smaller_df['review'])[idx])

_ = _.replace('positive', 'negative')
print(f'\n{_:-^70}')
for idx in most_negative_idx:
    color = colors.POSITIVE if np.array(smaller_df['rating'])[idx] == 1 else colors.NEGATIVE
    print(f'\t{color}{idx}:{colors.RESET}',np.array(smaller_df['review'])[idx])
print('-'*70)
#hint: use the results of b)

Most negative reviews row numbers: 22799 24865 40068 46409 17196
Most positive reviews row numbers: 28974 24539 38946 709 29309
[91mred [0m means a negative review (value == -1)
[92mgreen [0m means a positive review (value == 1)

------------Most positive reviews according to probability------------
	[92m28974:[0m Great for a night over at Grandmas house  My grandson enjoyed it and the parents gave it a thumbs up
	[92m24539:[0m this gate and extension are grate they look good and work good  that is if you follow the directions If you dont reed the directions you will thank it is broken but it is not it will all make seance wants it is installed
	[92m38946:[0m Both my 4 year old and 6 month old are wonderful shoppers but on a marathon shopping day when I am by myself my 4 year old does get a little weary of treading around behind me  He thinks he is too grown up for a baby stroller but was very excited to see the big boy seat on the back of his babys stroller  This stroller is

# Note:
    From the reviews we can see that most of them are positive (when it comes to original ranking) but the model predicted some of them as positive and other as negative.
    
    What I found really interesting is that the longer the review the more probable it is that the model specifies it as the negative review (which is probably because the density of words are high and words can serve multiple meanings)

In [25]:
#d) 
print(f'Model score: {model.score(X_test, y_test)}')

Model score: 0.9318154559628993


# Note:
    The model achieved a score of 0.931 which is a high score - which also means that the training and testing data was correctly split.

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [26]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [27]:
#a)
X_train, X_test, y_train, y_test = train_test_split(smaller_df['review'], smaller_df['rating'], test_size=0.3, random_state=44)
vectorizer = CountVectorizer(vocabulary=significant_words)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
model = LogisticRegression(max_iter=iterations)
model.fit(X_train, y_train)

coefs = model.coef_.reshape(-1, 1)
new_coefs = np.array([coef[0] for coef in coefs])

most_positive_idx = (-new_coefs).argsort()[:10]
most_negative_idx = new_coefs.argsort()[:10]

most_positive = [vectorizer.get_feature_names_out()[i] for i in most_positive_idx]
most_negative = [vectorizer.get_feature_names_out()[i] for i in most_negative_idx]

print('Most positive words:', *most_positive)
print('Most negative words:', *most_negative, end='\n\n')
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)
negative = y_pred_prob[:, 0]
positive = y_pred_prob[:, 1]

most_negative_idx = (-negative).argsort()[:5]
most_positive_idx = positive.argsort()[-5:]
print('Most positive reviews row numbers:', *most_positive_idx)
print('Most negative reviews row numbers:', *most_negative_idx, end='\n\n')

print(f'{colors.NEGATIVE}red {colors.RESET} means a negative review (value == -1)')
print(f'{colors.POSITIVE}green {colors.RESET} means a positive review (value == 1)')

_ = 'Most positive reviews according to probability'
print(f'\n{_:-^70}')
for idx in most_positive_idx:
    color = colors.POSITIVE if np.array(smaller_df['rating'])[idx] == 1 else colors.NEGATIVE
    print(f'\t{color}{idx}:{colors.RESET}', np.array(smaller_df['review'])[idx])

_ = _.replace('positive', 'negative')
print(f'\n{_:-^70}')
for idx in most_negative_idx:
    color = colors.POSITIVE if np.array(smaller_df['rating'])[idx] == 1 else colors.NEGATIVE
    print(f'\t{color}{idx}:{colors.RESET}',np.array(smaller_df['review'])[idx])
print('-'*70)

print(f'Model score: {model.score(X_test, y_test)}')

Most positive words: loves perfect love easy great little well able old car
Most negative words: disappointed return waste broke money work even would product less

Most positive reviews row numbers: 6473 19348 16281 40378 28924
Most negative reviews row numbers: 11091 1403 10376 11231 39845

[91mred [0m means a negative review (value == -1)
[92mgreen [0m means a positive review (value == 1)

------------Most positive reviews according to probability------------
	[91m6473:[0m I registered for this monitor because of the two receiver feature and because the 900 mHz technology was supposed to be the best  We already used the Fisher Price prenatal to nursery monitor during my pregnancy but we thought that this one would be better for everyday use  From day one there was static but I just thought that all monitors must do that  It got worse with time neither channel would come in clearly the second channel never did work we would get a loud static noise about every 30 seconds that wa

# Note:
    Using the limited dictionary we achieved worse results - the model score is now 0.868 (6.82% worse than the previous one).
    
    The positive reviews even though predicted as the most positive by the model were negative in the original dataset.

    The interesting thing is that the empty review (39845) even though classified as a positive review (in the original dataset) was in the top negative reviews according to the model - which should be taken into consideration when analising such data (maybe we want to drop all the rows in which the reviews are empty as they do not participate as much in model prediction)

In [28]:
#b)
coefs = np.array([coef for coef in model.coef_[0]])
idx = coefs.argsort()
for i in idx:
    color = colors.POSITIVE if coefs[i] > 0 else colors.NEGATIVE
    print(f'{color}{round(coefs[i],5):>8}{colors.RESET}: {vectorizer.get_feature_names_out()[i]}')

[91m-2.32503[0m: disappointed
[91m-2.17067[0m: return
[91m -1.9979[0m: waste
[91m-1.73447[0m: broke
[91m-0.92054[0m: money
[91m-0.63547[0m: work
[91m-0.51369[0m: even
[91m-0.34632[0m: would
[91m-0.30603[0m: product
[91m-0.17897[0m: less
[92m 0.05671[0m: car
[92m 0.08259[0m: old
[92m 0.19439[0m: able
[92m 0.48143[0m: well
[92m 0.48502[0m: little
[92m 0.94037[0m: great
[92m 1.14975[0m: easy
[92m 1.34271[0m: love
[92m 1.47255[0m: perfect
[92m 1.70305[0m: loves


# Note:
    We can see that the most negative words were: dissapointed, return, waste - all having negative meaning in english

    When it comes to positive words they were: loves, perfect, easy - all of them having a very positive meaning
    
    So the model got the positive and negative words right (although words like 'car', 'product' don't have a very positive/negative meaning in english they were listed here because they were in a big amount of positive/negative reviews.)

In [29]:
#c)
X_train, X_test, y_train, y_test = train_test_split(smaller_df['review'], smaller_df['rating'], test_size=0.3, random_state=44)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
model = LogisticRegression(max_iter=iterations)
print('Timing fit of model without limited dictionary...')
%timeit model.fit(X_train, y_train)
print(f'Model score: {model.score(X_test, y_test)}', end='\n\n')

X_train, X_test, y_train, y_test = train_test_split(smaller_df['review'], smaller_df['rating'], test_size=0.3, random_state=44)
vectorizer = CountVectorizer(vocabulary=significant_words)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
model = LogisticRegression(max_iter=iterations)
print('Timing fit of model with limited dictionary...')
%timeit model.fit(X_train, y_train)
print(f'Model score: {model.score(X_test, y_test)}')
#hint: %time, %timeit

Timing fit of model without limited dictionary...
38.5 s ± 873 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Model score: 0.9318154559628993

Timing fit of model with limited dictionary...
151 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Model score: 0.8682285211689921


# Note:
    The time difference the model in which we used predefined vocabulary is enormous - because we do not have to analyze all of the words but only ones from the given vocabulary. 
    Of course because of that the score is worse (but we discussed it before, as it is the same model as previously)