# Table of content

1. [Problem statement and data collection](##1-problem-statement-and-data-collection)  
2. [Exploration and data cleaning](##2-Exploration-and-data-cleaning)
3. [Split train and test](##3-Split-train-and-test)
4. [Build a naive bayes model](##4-Build-a-naive-bayes-model)
    - [4.1 GaussianNB](###4.1-GaussianNB)
    - [4.2 MultinomialNB](###4.2-MultinomialNB)
    - [4.3 BernoulliNB](###4.3-BernoulliNB)
5. [Optimise the previous model](##5-Optimise-the-previous-model)
6. [Explore other alternatives](##6-Explore-other-alternatives)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 1. Problem statement and data collection

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import json
from sklearn.model_selection import train_test_split
import pickle
from pickle import dump
import warnings
def warn(*args, **kwargs):
    pass
warnings.warn = warn
warnings.filterwarnings("ignore", category=FutureWarning)

In [44]:
total_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv")
total_data.head()

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


In this project, we aim to analyse the sentiment of a collection of reviews from the Play Store. We are working with three variables: two predictors and a dichotomous label. Of the two predictors, our primary focus is on the review text, as the classification of a comment as positive or negative depends on its content rather than the application from which it originates.

## 2. Exploration and data cleaning

In [45]:
total_data.shape

(891, 3)

The datasframe has 891 rows, and three columns.

In [46]:
total_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 21.0+ KB


After a quick analysis we verify that there are not nulls in the datframe.

In [47]:
total_data.drop(["package_name"], axis = 1, inplace = True)
total_data.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0


We remove the *package_name* variable, as we are only interested in the reviews copy to analyse the sentiment.

In [48]:
total_data["review"] = total_data["review"].str.strip().str.lower()

Afterwards, to be able to analyse the text data we eliminate the spaces and we convert all text to lower case. 

## 3. Split train and test

In [49]:
x = total_data["review"]
y = total_data["polarity"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)


x_train.head()

331    just did the latest update on viber and yet ag...
733    keeps crashing it only works well in extreme d...
382    the fail boat has arrived the 6.0 version is t...
704    superfast, just as i remember it ! opera mini ...
813    installed and immediately deleted this crap i ...
Name: review, dtype: object

## 4. Build a naive bayes model

The first step is to transform the text into a word count matrix.

In [50]:
from sklearn.feature_extraction.text import CountVectorizer

vec_model = CountVectorizer(stop_words = "english")
x_train_vec = vec_model.fit_transform(x_train).toarray()
x_test_vec = vec_model.transform(x_test).toarray()

### 4.1 GaussianNB

In [51]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from pickle import dump

accuracy_results = []

model = GaussianNB()

model.fit(x_train_vec, y_train)
y_pred = model.predict(x_train_vec)

accuracy = accuracy_score(y_train, y_pred)
accuracy_results.append(accuracy_score(y_train, y_pred))

print("Accuracy:", accuracy_results)

Accuracy: [0.9859550561797753]


In [52]:
model = GaussianNB()

model.fit(x_train_vec, y_train)
y_pred_test = model.predict(x_test_vec)

accuracy = accuracy_score(y_test, y_pred_test)
print("Test accuracy:", accuracy)

Test accuracy: 0.8044692737430168


We trained a GaussianNB model, the train accuracy model reaches over 98.5%, and the test 80.4%. This is a good permorming model wich parameters have not been optimised yet. 

**SAVING THE MODEL**:

In [53]:
dump(model, open("/workspaces/E-Pablos-Naive-Bayes-Reviews/models/naive_bayes_GaussianNB.sav", "wb"))

### 4.2 MultinomialNB

In [54]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

accuracy_results = []

model = MultinomialNB(
    alpha=1.0,
    fit_prior=True,
)

model.fit(x_train_vec, y_train)
y_pred = model.predict(x_train_vec)

accuracy = accuracy_score(y_train, y_pred)
accuracy_results.append(accuracy_score(y_train, y_pred))

print("Accuracy:", accuracy_results)

Accuracy: [0.9606741573033708]


In [55]:
model = MultinomialNB(
    alpha=1.0,
    fit_prior=True,
)


model.fit(x_train_vec, y_train)
y_pred_test = model.predict(x_test_vec)

accuracy = accuracy_score(y_test, y_pred_test)
print("Test accuracy:", accuracy)

Test accuracy: 0.8156424581005587


After training the model with Multinomial NB, we observed a training accuracy of over 96% and a test accuracy of 81.5%. Although the train accuracy dropped 2.5%, we can see that the test accuracy improve by 1%.

**SAVING THE MODEL**:

In [56]:
dump(model, open("/workspaces/E-Pablos-Naive-Bayes-Reviews/models/naive_bayes_MultinomialNB.sav", "wb"))

### 4.3 BernoulliNB

In [57]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score

accuracy_results = []

model = BernoulliNB(
    alpha=1.0,
    binarize=0.0,
)

model.fit(x_train_vec, y_train)
y_pred = model.predict(x_train_vec)

accuracy = accuracy_score(y_train, y_pred)
accuracy_results.append(accuracy_score(y_train, y_pred))

print("Accuracy:", accuracy_results)

Accuracy: [0.9199438202247191]


In [58]:
model = BernoulliNB(
    alpha=1.0,
    binarize=0.0,
)
model.fit(x_train_vec, y_train)
y_pred_test = model.predict(x_test_vec)

accuracy = accuracy_score(y_test, y_pred_test)
print("Test accuracy:", accuracy)

Test accuracy: 0.770949720670391


When training the model with BernoulliNB, we realised that is not the best approach for this model, as the train accuracy drops to 91.9% and the test accuracy to only 77%. 

**SAVING THE MODEL**:

In [59]:
dump(model, open("/workspaces/E-Pablos-Naive-Bayes-Reviews/models/naive_bayes_BernoulliNB.sav", "wb"))

## 5. Optimise the previous model

We use a **GridSearch** to opmise the previous model, using MultinomialNB.

In [60]:
from sklearn.model_selection import GridSearchCV

model = model = MultinomialNB(
    alpha=1.0,
    fit_prior=True,
)

param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0, 5.0],
    'fit_prior': [True, False]
}


grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='accuracy',     
    cv=5,                   
    n_jobs=-1,              
    verbose=2               
)
grid_search.fit(x_train_vec, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best accuracy:", grid_search.best_score_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.1s


[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.1s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.1s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END ..........................alpha=0.5, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.5, fit_prior=True; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.1s
[CV] END ..........................alpha=0.5, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.5, fit_prior=True; total time=   0.0s
[CV] END ...................

In [61]:
best_modelo = grid_search.best_estimator_

y_pred = best_modelo.predict(x_test_vec)
print("Test accuracy:", accuracy_score(y_test, y_pred))


Test accuracy: 0.8212290502793296


In [62]:
grid_search.fit(x_train_vec, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.1s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.1s


[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END ..........................alpha=0.5, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.5, fit_prior=True; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.1s
[CV] END ..........................alpha=0.5, fit_prior=True; total time=   0.0s
[CV] END .........................alpha=0.5, fit_prior=False; total time=   0.0s
[CV] END ...................

0,1,2
,estimator,MultinomialNB()
,param_grid,"{'alpha': [0.1, 0.5, ...], 'fit_prior': [True, False]}"
,scoring,'accuracy'
,n_jobs,-1
,refit,True
,cv,5
,verbose,2
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,alpha,2.0
,force_alpha,True
,fit_prior,False
,class_prior,


Taking into account the results provided by GridSearch, we adjusted our parameters to: fit_prior = False and alpha = 2.0 in order to improve the model’s accuracy.

In [68]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from pickle import dump

accuracy_results = []

model = MultinomialNB(
    alpha=2.0,
    fit_prior=False,
)

model.fit(x_train_vec, y_train)
y_pred = model.predict(x_train_vec)

accuracy = accuracy_score(y_train, y_pred)
accuracy_results.append(accuracy_score(y_train, y_pred))

print("Train accuracy:", accuracy_results)

Train accuracy: [0.9480337078651685]


In [69]:
model = MultinomialNB(
    alpha=2.0,
    fit_prior=True,
)
model.fit(x_train_vec, y_train)
y_pred_test = model.predict(x_test_vec)

accuracy = accuracy_score(y_test, y_pred_test)
print("Test accuracy:", accuracy)

Test accuracy: 0.8324022346368715


After using GridSearch to optimise the model by adjusting the parameters, the training accuracy dropped by almost 2%, while the test accuracy improved by nearly 2%.

**SAVING THE MODEL**:

In [70]:
dump(model, open("/workspaces/E-Pablos-Naive-Bayes-Reviews/models/naive_bayes__GridSearch_a2_fp_true.sav", "wb"))

-----

We use a **RandomizedSearch** to opmise the previous model, with MultinomialNB.

In [73]:
from sklearn.model_selection import RandomizedSearchCV


model = MultinomialNB(
    alpha=1.0,
    fit_prior=True,
)

param_distributions = {
    'alpha': [0.1, 0.5, 1.0, 2.0, 5.0],
    'fit_prior': [True, False]
}

random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_distributions,
    scoring='accuracy',     
    cv=5,                   
    n_jobs=-1,              
    verbose=2               
)

random_search.fit(x_train_vec, y_train)

print("Best hyperparams:")
print(random_search.best_params_)

print("Best accuracy found in random search:")
print(random_search.best_score_)

mejor_modelo = random_search.best_estimator_
y_pred = mejor_modelo.predict(x_test_vec)
print("Test accuracy:", accuracy_score(y_test, y_pred))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END ..........................alpha=0.1, fit_prior=True; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END ..........................alpha=0.5, fit_prior=True; total time=   0.0s
[CV] END .........................alpha=0.1, fit_prior=False; total time=   0.0s
[CV] END ..........................alpha=0.5, fi

As a strategy to improve the model, we also applied Randomised Search. However, the results were not as good as those obtained with Grid Search. Both the training and test accuracy reached only 82%. In the case of the training accuracy, this is 12.5% lower than the optimisation achieved with Grid Search.

## 6. Explore other alternatives

**Random Tree**:

In [74]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

accuracy_results = []

model = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    criterion="entropy",
    max_features=None,
    random_state=42
)

model.fit(x_train_vec, y_train)
y_pred = model.predict(x_train_vec)
accuracy = accuracy_score(y_train, y_pred)
accuracy_results.append(accuracy_score(y_train, y_pred))

print("Accuracy:", accuracy_results)

Accuracy: [0.7514044943820225]


We applied the Random Tree model in an attempt to outperform the Naive Bayes model; however, the training accuracy was only 75.1%, almost 20% lower than the optimised Naive Bayes model.