### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [29]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression


def remove_punctuation(text):
    import string
    if isinstance(text, str):
        translator = str.maketrans('', '', string.punctuation)
        return text.translate(translator)
    return text
baby_df = pd.read_csv('./data/amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [30]:
#a)

baby_df["review"] = baby_df["review"].apply(remove_punctuation)
#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

I had to modify the function a bit, because it was trying to remove punctuation from a float. Now it works as expected.

In [31]:
#b)
baby_df["review"] = baby_df["review"].fillna("")

#short test:
baby_df["review"][38] == baby_df["review"][38]

True

Now there are no NaN values in the dataset. We now that because the value NaN compared with other NaN could not be equal.

In [32]:
#c)
baby_df = baby_df[baby_df['rating'] != 3].reset_index(drop=True)

#short test:
sum(baby_df["rating"] == 3)

0

There is no rating equal to 3 in the dataset as we can wee above.

In [33]:
#d) 
baby_df["rating"] = np.where(baby_df["rating"] >= 4, 1, -1)

#short test:
sum(baby_df["rating"]**2 != 1)

#163lk

0

Now we have only two classes of ratings. 1 or -1

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


I had to change the function get_feature_names() to get_feature_names_out() because of the deprecation. From certain version of sklearn, the function get_feature_names() has been replaced with get_feature_names_out().

## Vectorizer: 
is a tool used in Natural Language Processing (NLP) to convert text data into numerical data. This process is essential because machine learning models cannot process raw text data directly; they require numerical input.
## CountVectorizer:
is a type of text vectorization technique that converts a collection of text documents (such as sentences or paragraphs) into a matrix of token counts, where each token is a word. It essentially transforms text into a bag-of-words (BoW) representation.
## How does CountVectorizer Work?
- **Tokenization**: It breaks down the text into individual words, known as tokens.
- **Building the Vocabulary**: It creates a vocabulary (dictionary) of all unique words present in the dataset.
- **Counting Word Frequencies**: For each document, it counts the occurrences of each word in the vocabulary.
- **Creating the Matrix**: It represents the document as a sparse matrix, where:
    - Rows correspond to each document in the dataset.
    - Columns correspond to each unique word in the vocabulary.
    - The values represent the frequency of the word in that specific document.

In [35]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [36]:
#a)
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(baby_df["review"], baby_df["rating"], test_size=0.2, random_state=42)


In [37]:
#b)

vectorizer = CountVectorizer()

train_x = vectorizer.fit_transform(train_x)
test_x = vectorizer.transform(test_x)


Using the created train_test_split I split the data from baby_df.
Then I used CountVectorizer to transform the reviews into vectors.

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [38]:
#a)
model = LogisticRegression(max_iter = 1000)

model.fit(train_x, train_y)


In [39]:
#b)
feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

sorted_indices = np.argsort(coefficients)

most_negative = [feature_names[i] for i in sorted_indices[:10]]
most_positive = [feature_names[i] for i in sorted_indices[-10:]]

print("Most negative words: ", most_negative)
print("Most positive words: ", most_positive)


#hint: model.coef_, vectorizer.get_feature_names()

Most negative words:  ['dissapointed', 'worthless', 'worst', 'useless', 'poorly', 'disappointing', 'unusable', 'disappointed', 'unacceptable', 'poor']
Most positive words:  ['wonderfully', 'hinder', 'saves', 'skeptical', 'rich', 'thankful', 'con', 'ply', 'minor', 'lifesaver']


We can clearly see that the most positive and negative words are related to the sentiment of the reviews. As we look above, the words picked by our model make sense. 
I also see that some words for instance dissapointed and disappointed are the same, but the model treats them as different words because dataframe included words having spelling mistakes.
  

## Code below 
The cell below is used for measuring time which we have to do in the last exercise. 
It is here because it is easier to measure time of instructions in the same order and in one cell
<br>For now, we can skip it. 

In [40]:
import datetime

train_x, test_x, train_y, test_y = train_test_split(baby_df["review"], baby_df["rating"], test_size=0.2, random_state=42)

s_time = datetime.datetime.now()

vectorizer = CountVectorizer()

train_x = vectorizer.fit_transform(train_x)
test_x = vectorizer.transform(test_x)

model = LogisticRegression(max_iter = 1000)
model.fit(train_x, train_y)

feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

sorted_indices = np.argsort(coefficients)

most_negative = [feature_names[i] for i in sorted_indices[:10]]
most_positive = [feature_names[i] for i in sorted_indices[-10:]]

print("Most negative words: ", most_negative)
print("Most positive words: ", most_positive)


predicted_sentiments = model.predict(test_x)
predicted_sentiments_proba = model.predict_proba(test_x)

e_time = datetime.datetime.now()
unlimited_vocab_time = e_time - s_time


Most negative words:  ['dissapointed', 'worthless', 'worst', 'useless', 'poorly', 'disappointing', 'unusable', 'disappointed', 'unacceptable', 'poor']
Most positive words:  ['wonderfully', 'hinder', 'saves', 'skeptical', 'rich', 'thankful', 'con', 'ply', 'minor', 'lifesaver']


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [41]:
#a)
predicted_sentiments = model.predict(test_x)

In [42]:
#b)
predicted_sentiments_proba = model.predict_proba(test_x)
#hint: model.predict_proba()

## What is the difference between model.predict() and model.predict_proba()
Those two methods are commonly used in machine learning classification models, but they serve different purposes. Here's a breakdown of their differences and when to use each.<br>
- model.predict():
    - purpose: used to make a prediction about the class label for each sample in your input data. It returns the most likely class based on the learned model.
    - output: An array of class labels. Each element in the output corresponds to the predicted class for a sample.
- model.predict_proba():   
    - purpose: used to obtain the probability estimates for each class. It returns the probability that a given sample belongs to each possible class.
    - output: 
        - A 2D array where each row corresponds to a sample and each column corresponds to a class.  
        -  The values in each row are probabilities that sum to 1.0.
## When to use each method?
- model.predict() when: 
    - You only care about the predicted class and don't need to know the confidence level.
    - You need a simple classification decision (e.g., spam vs. not spam).
- model.predict_proba() when:
    - You want to understand the model's confidence in its predictions.
    - You need to set custom thresholds for classifying samples.
        - For example, you may only want to classify a sample as "positive" if the probability is greater than 0.7.

In [43]:
#c) #d)

df = pd.DataFrame({
    "review": baby_df["review"].iloc[test_y.index],
    "true_rating": test_y.values,
    "predicted_rating": predicted_sentiments,
    "predicted_rating_proba": predicted_sentiments_proba[:, 1]
})

most_positive = df.sort_values("predicted_rating_proba", ascending=False).head(5)
most_negative = df.sort_values("predicted_rating_proba").head(5)
accuracy_1 = np.mean(predicted_sentiments == test_y)

print(f"Most positive reviews: {most_positive}")
print(f"Most negative reviews: {most_negative}")
print(f"Accuracy: {accuracy_1}")

#hint: use the results of b)

Most positive reviews:                                                    review  true_rating  \
130545  I was a little nervous about ordering this bab...            1   
99578   My wifes 5 and Im about 56 our baby is within ...            1   
112343  I did a TON of research before I purchased thi...            1   
51918   I started wearing the Babyplus when I was 18 w...            1   
164117  After much research I purchased an Urbo2 Its e...            1   

        predicted_rating  predicted_rating_proba  
130545                 1                     1.0  
99578                  1                     1.0  
112343                 1                     1.0  
51918                  1                     1.0  
164117                 1                     1.0  
Most negative reviews:                                                    review  true_rating  \
134430  My disappointment with this product prompted m...           -1   
159179  I had to return this stroller for three reason.

## Results:
- It seems like the most positive and negative review were **found correctly**. 
- **The accuracy** of the model is quite high, which is a good sign. It is **0.93**, which means that 93% of the reviews were classified correctly.
- We also can see that the **predicted_rating_proba** is 1.0 in the 5 most positive reviews and is a very low negative value in every one of 5 most negative reviews.
- **Predicted rating** is always -1 or 1 as expected

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [44]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [45]:
#a)
import datetime

train_x, test_x, train_y, test_y = train_test_split(baby_df["review"], baby_df["rating"], test_size=0.2, random_state=42)

s_time = datetime.datetime.now()

vectorizer = CountVectorizer(vocabulary=significant_words)

train_x = vectorizer.fit_transform(train_x)
test_x = vectorizer.transform(test_x)

model = LogisticRegression(max_iter = 1000)
model.fit(train_x, train_y)

feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

sorted_indices = np.argsort(coefficients)

most_negative = [feature_names[i] for i in sorted_indices[:10]]
most_positive = [feature_names[i] for i in sorted_indices[-10:]]

print("Most negative words: ", most_negative)
print("Most positive words: ", most_positive)


predicted_sentiments = model.predict(test_x)
predicted_sentiments_proba = model.predict_proba(test_x)

e_time = datetime.datetime.now()
limited_vocab_time = e_time - s_time

df = pd.DataFrame({
    "review": baby_df["review"].iloc[test_y.index],
    "true_rating": test_y.values,
    "predicted_rating": predicted_sentiments,
    "predicted_rating_proba": predicted_sentiments_proba[:, 1]
})


most_positive = df.sort_values("predicted_rating_proba", ascending=False).head(5)
most_negative = df.sort_values("predicted_rating_proba").head(5)

accuracy_2 = np.mean(predicted_sentiments == test_y)

print(f"Most positive reviews: {most_positive}")
print(f"Most negative reviews: {most_negative}")
print(f"Accuracy: {accuracy_2}")


Most negative words:  ['disappointed', 'return', 'waste', 'broke', 'money', 'work', 'even', 'would', 'product', 'less']
Most positive words:  ['old', 'car', 'able', 'well', 'little', 'great', 'easy', 'love', 'perfect', 'loves']
Most positive reviews:                                                    review  true_rating  \
122030  We bought this stroller after selling our belo...            1   
68033   We love this highchair  We have a 4 year old a...            1   
122843  Weve been using Britax for our boy now 14 mont...            1   
137273  I did tons of research on strollers I knew I w...            1   
66949   UPDATE 112013  I went ahead and used a tiny bi...            1   

        predicted_rating  predicted_rating_proba  
122030                 1                     1.0  
68033                  1                     1.0  
122843                 1                     1.0  
137273                 1                     1.0  
66949                  1                     1.0 

## Comment:
- **Accuracy**:
    - The accuracy of the model with the limited vocabulary is **0.87** which is quite worse than the previous one with **0.93**.
    - It can be that the model know better which words to choose in order of classification
- **Result**:
    - The most positive and negative reviews are different compared to the previous model's.
    - It does not necessarily mean that the model is worse, but it is undoubtedly different.    

In [46]:
#b)

feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

word_impact = pd.DataFrame({
    "word": feature_names,
    "impact": coefficients
}).sort_values("impact", ascending=False)

print("\nWord_impact\n", word_impact)



Word_impact
             word    impact
6          loves  1.684972
5        perfect  1.515068
0           love  1.359000
2           easy  1.193224
1          great  0.930882
4         little  0.502431
7           well  0.496196
8           able  0.193270
9            car  0.074529
3            old  0.073441
11          less -0.201570
16       product -0.313727
18         would -0.342239
12          even -0.489719
15          work -0.635649
17         money -0.946424
10         broke -1.680640
13         waste -1.979571
19        return -2.092836
14  disappointed -2.398751


## What is word_impact in our model?
- represents the influence or importance of each word (or feature) on the predictions made by our machine learning model
- In a linear model like Logistic Regression, each feature (word) is assigned a coefficient. These coefficients indicate the strength and direction of the impact each word has on the prediction.
- **model.coef** is a NumPy array where each element represents the coefficient of a corresponding word
- For binary classification, **model.coef_[0]** gives the coefficients for the first (and only) class.

In [48]:
#c)
print(f"Time of evaluation without limited vocabulary (1): {unlimited_vocab_time}")
print(f"Time of evaluation with limited vocabulary (2): {limited_vocab_time}")
print(f"Time of evaluation difference (1-2): {abs(unlimited_vocab_time - limited_vocab_time)}")

print(f"Accuracy without limited vocabulary (1): {accuracy_1}")
print(f"Accuracy with limited vocabulary (2): {accuracy_2}")
print(f"Accuracy difference (2): {abs(accuracy_2 - accuracy_1)}")

#hint: %time, %timeit

Time of evaluation without limited vocabulary (1): 0:00:28.292363
Time of evaluation with limited vocabulary (2): 0:00:06.167744
Time of evaluation difference (1-2): 0:00:22.124619
Accuracy without limited vocabulary (1): 0.9326856765914066
Accuracy with limited vocabulary (2): 0.8689994303019399
Accuracy difference (2): 0.06368624628946662


## Comment:
- **Time**:
    - time of evaluation with limited vocabulary is **much shorter** than without limited vocabulary
    - it is because the model has to process fewer words
    - the difference is almost 19 seconds which is significant
## Conclusion:
- There is no golden mean here. Sometimes using limited vocabulary is more efficient.
- From our situation I can see that the model with the limited vocabulary is less accurate but much faster which is a trade-off. 