# Explore here

In [19]:
#Imports
import pandas as pd 
import numpy as np
import string
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import *
from sklearn.model_selection import GridSearchCV

Loading the data

In [2]:
df = pd.read_csv('/workspaces/https-github.com-4GeeksAcademy-machine-learning-python-template/data/interim/playstore_reviews.csv')
df

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0
...,...,...,...
886,com.rovio.angrybirds,loved it i loooooooooooooovvved it because it...,1
887,com.rovio.angrybirds,all time legendary game the birthday party le...,1
888,com.rovio.angrybirds,ads are way to heavy listen to the bad review...,0
889,com.rovio.angrybirds,fun works perfectly well. ads aren't as annoy...,1


Following the dashboard instructions regarding the cleaning of the df 

In [3]:
#Dropping package_name variable as per instructions
df.drop(columns=['package_name'], inplace=True)

In [4]:
#Removing spaces and converting the text to lowercase
df["review"] = df["review"].str.strip().str.lower()
df

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0
3,the new features suck for those of us who don'...,0
4,forced reload on uploading pic on replying com...,0
...,...,...
886,loved it i loooooooooooooovvved it because it ...,1
887,all time legendary game the birthday party lev...,1
888,ads are way to heavy listen to the bad reviews...,0
889,fun works perfectly well. ads aren't as annoyi...,1


In [5]:
#Delete punctuation marks 
df['cleaned_review_text'] = df['review'].str.replace(f"[{string.punctuation}]", "", regex=True)
df.drop(columns=['review'], inplace=True)

In [6]:
df

Unnamed: 0,polarity,cleaned_review_text
0,0,privacy at least put some option appear offlin...
1,0,messenger issues ever since the last update in...
2,0,profile any time my wife or anybody has more t...
3,0,the new features suck for those of us who dont...
4,0,forced reload on uploading pic on replying com...
...,...,...
886,1,loved it i loooooooooooooovvved it because it ...
887,1,all time legendary game the birthday party lev...
888,0,ads are way to heavy listen to the bad reviews...
889,1,fun works perfectly well ads arent as annoying...


In [7]:
df.shape

(891, 2)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   polarity             891 non-null    int64 
 1   cleaned_review_text  891 non-null    object
dtypes: int64(1), object(1)
memory usage: 14.1+ KB


In [9]:
df.value_counts()

polarity  cleaned_review_text                                                                                                                                                                                                                                                                                                                                                                                                                                        
0         this update sucks i cant open the game anymore just crashed damn it i have an on going war but the app wont run if there is a half star option to rate this i would give you that fix this asap                                                                                                                                                                                                                                                                2
1         5stars this app has saved my life on multiple occasionspictures are life and 

In [10]:
df.nunique()

polarity                 2
cleaned_review_text    890
dtype: int64

After cleaning our df we can now save a new cleaned_df version that contains the text in lower cases and w/ punctuation marks

In [11]:
cleaned_df = df.copy()
cleaned_df.to_csv('cleaned_df.csv')

Now let's split the data into training and test

In [12]:
X = cleaned_df['cleaned_review_text']
y = cleaned_df['polarity']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

Now let's vectorize these values

In [13]:
#CountVectorizer 
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [17]:
X_test

709    lovehate has bug and security issues i tried t...
439    whatsapp i use this app now that blackberry me...
840                             usefully verry  nice app
720    fonts why in the heck is this thing analysing ...
39     app doesnt work after latest upgrade the faceb...
                             ...                        
433    app continuously losses connection at times i ...
773    way below expection why does it lag so much sc...
25     cant install error code 505 have samsung galax...
84     sort it out why can i not get my networks post...
10     what the heck cant get status updates to be in...
Name: cleaned_review_text, Length: 179, dtype: object

In [15]:
#Initilization and training the classifier Naive Bayes Multinomial
clf = MultinomialNB().fit(X_train_vec, y_train)

#Make predictions on the training data 
y_pred = clf.predict(X_test_vec)

#Metrics
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.95      0.89       126
           1       0.83      0.57      0.67        53

    accuracy                           0.84       179
   macro avg       0.84      0.76      0.78       179
weighted avg       0.84      0.84      0.83       179



The model has an overall accuracy of 84%, meaning it correctly predicted 84% of all reviews (both positive and negative). However, the rate of missed positive reviews is relatively high, let's see the contrast

``Negative Reviews (class 0)``

Precision: 0.84 - When the model predicts a review is negative, it's correct 84% of the time.

Recall: 0.95 - Out of all actual negative reviews, it correctly identified 95% of them.

This means the model is very good at identifying negative reviews.


``Positive Reviews (class 1)``

Precision: 0.83  When the model predicts a review is positive, it's right 83% of the time.

Recall: 0.57  It only identifies 57% of the actual positive reviews correctly.

This means the model misses many positive reviews, classifying them as negative instead.

Despite of the performance model (overall good) let's search for hyperparameters in order to better adjust Recall on Model 1 (positive reviews)

In [24]:
#Define the model
nb = MultinomialNB()

#Define parameter grid to search over
param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0, 5.0]}

#Grid search, using recall as the scoring metric
grid_search = GridSearchCV(nb, param_grid, scoring='recall', cv=5)

In [25]:
#Fit on training data
grid_search.fit(X_train_vec, y_train)

0,1,2
,estimator,MultinomialNB()
,param_grid,"{'alpha': [0.1, 0.5, ...]}"
,scoring,'recall'
,n_jobs,
,refit,True
,cv,5
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,alpha,0.1
,force_alpha,True
,fit_prior,True
,class_prior,


In [None]:
#Best model
best_nb = grid_search.best_estimator_

#Predict on test 
y_pred1 = best_nb.predict(X_test_vec)

#metrics
print("Best alpha:", grid_search.best_params_['alpha'])
print(classification_report(y_test, y_pred1))

Best alpha: 0.1
              precision    recall  f1-score   support

           0       0.88      0.90      0.89       126
           1       0.76      0.70      0.73        53

    accuracy                           0.84       179
   macro avg       0.82      0.80      0.81       179
weighted avg       0.84      0.84      0.84       179



This project focused on building a sentiment analysis model using a Multinomial Naive Bayes classifier to classify customer reviews as positive or negative. The initial model achieved an overall accuracy of 84%, but showed poor recall for the positive class (0.57), indicating it frequently misclassified positive reviews. By tuning the alpha hyperparameter through grid search, the optimal value (α = 0.1) improved the recall for positive reviews to 0.70 while maintaining the same overall accuracy.