# Naive Bayes Project

Naive Bayes models are very useful when we want to analyze sentiment, classify texts into topics or recommendations, as the characteristics of these challenges meet the theoretical and methodological assumptions of the model very well.

In this project I will practice with a dataset to create a review classifier for the Google Play store.

In [45]:
import pandas as pd 
import optuna
import numpy as np

In [6]:
# import dataset

raw_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv")
raw_data

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0
...,...,...,...
886,com.rovio.angrybirds,loved it i loooooooooooooovvved it because it...,1
887,com.rovio.angrybirds,all time legendary game the birthday party le...,1
888,com.rovio.angrybirds,ads are way to heavy listen to the bad review...,0
889,com.rovio.angrybirds,fun works perfectly well. ads aren't as annoy...,1


In [51]:
raw_data.to_csv('/workspaces/Naive-Bayes-Project/data/raw/raw_data.csv', index=False)

In [8]:
# "package_name" isn't necessary in this case, so I drop it

total_data = raw_data[['review', 'polarity']].copy()
total_data

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0
...,...,...
886,loved it i loooooooooooooovvved it because it...,1
887,all time legendary game the birthday party le...,1
888,ads are way to heavy listen to the bad review...,0
889,fun works perfectly well. ads aren't as annoy...,1


In [18]:
# process text in "review" feature. Removing spaces and converting the text to lowercase:

total_data["review"] = total_data["review"].str.strip().str.lower()
total_data

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0
3,the new features suck for those of us who don'...,0
4,forced reload on uploading pic on replying com...,0
...,...,...
886,loved it i loooooooooooooovvved it because it ...,1
887,all time legendary game the birthday party lev...,1
888,ads are way to heavy listen to the bad reviews...,0
889,fun works perfectly well. ads aren't as annoyi...,1


In [52]:
total_data.to_csv('/workspaces/Naive-Bayes-Project/data/processed/total_data.csv', index=False)

In [20]:
from sklearn.model_selection import train_test_split

# Divide the dataset into train and test

X = total_data["review"]
y = total_data["polarity"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.head()

331    just did the latest update on viber and yet ag...
733    keeps crashing it only works well in extreme d...
382    the fail boat has arrived the 6.0 version is t...
704    superfast, just as i remember it ! opera mini ...
813    installed and immediately deleted this crap i ...
Name: review, dtype: object

In [22]:
# Transform the text into a word count matrix. This is a way to obtain numerical features from the text

from sklearn.feature_extraction.text import CountVectorizer


vec_model = CountVectorizer(stop_words = "english")
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

In [26]:
# we have the predictors ready to train the model.
# I select the MultinomialNB because just the target is binary while the predictor are categorical numbers.

from sklearn.naive_bayes import MultinomialNB

model_multinomial = MultinomialNB()
model_multinomial.fit(X_train, y_train)

In [31]:
y_pred_multinomial = model_multinomial.predict(X_test)
y_pred_multinomial

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0])

In [32]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred_multinomial)

0.8156424581005587

Let´s try the others Naive Bayes Models

In [30]:
from sklearn.naive_bayes import GaussianNB
model_gaussian = GaussianNB()
model_gaussian.fit(X_train, y_train)

In [35]:
y_pred_gaussian = model_gaussian.predict(X_test)
y_pred_gaussian

array([0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0])

In [36]:
accuracy_score(y_test, y_pred_gaussian)

0.8044692737430168

In [37]:
from sklearn.naive_bayes import BernoulliNB

model_bernoulli = BernoulliNB()
model_bernoulli.fit(X_train, y_train)

In [38]:
y_pred_bernoulli = model_bernoulli.predict(X_test)
y_pred_bernoulli

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0])

In [39]:
accuracy_score(y_test, y_pred_bernoulli)

0.770949720670391

Best Model: MultinomialNB with an accuracy of: 0.8156424581005587

In [47]:
# let's optimize the model with optuna

def objective(trial):
    param = {
        "alpha": trial.suggest_float("alpha", 0.01, 10.0),
        "fit_prior": trial.suggest_categorical("fit_prior", [True, False])
    }
    
    # Crear y entrenar el modelo
    model = MultinomialNB(**param)
    model.fit(X_train, y_train)
    
    # Predecir y calcular la precisión
    preds = model.predict(X_test)
    accuracy = accuracy_score(y_test, preds)
    
    return accuracy

In [48]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=200)

print('Número de pruebas: ', len(study.trials))
print('Mejores parámetros: ', study.best_params)
print('Mejor precisión: ', study.best_value)

[I 2024-05-16 17:36:07,269] A new study created in memory with name: no-name-09ee533c-85ae-47fe-b94b-4adf7a113590
[I 2024-05-16 17:36:07,286] Trial 0 finished with value: 0.8324022346368715 and parameters: {'alpha': 4.013446233504225, 'fit_prior': True}. Best is trial 0 with value: 0.8324022346368715.
[I 2024-05-16 17:36:07,312] Trial 1 finished with value: 0.8268156424581006 and parameters: {'alpha': 3.3768148497199086, 'fit_prior': True}. Best is trial 0 with value: 0.8324022346368715.
[I 2024-05-16 17:36:07,349] Trial 2 finished with value: 0.8212290502793296 and parameters: {'alpha': 1.159058032735426, 'fit_prior': True}. Best is trial 0 with value: 0.8324022346368715.
[I 2024-05-16 17:36:07,370] Trial 3 finished with value: 0.8268156424581006 and parameters: {'alpha': 9.410376988184623, 'fit_prior': False}. Best is trial 0 with value: 0.8324022346368715.
[I 2024-05-16 17:36:07,396] Trial 4 finished with value: 0.8044692737430168 and parameters: {'alpha': 0.7348990116154732, 'fit_p

Número de pruebas:  200
Mejores parámetros:  {'alpha': 5.618086997574443, 'fit_prior': False}
Mejor precisión:  0.8491620111731844


Número de pruebas:  200

Mejores parámetros:  {'alpha': 5.618086997574443, 'fit_prior': False}

Mejor precisión:  0.8491620111731844

In [49]:
model = MultinomialNB(alpha = 5.618086997574443, fit_prior = False)
model.fit(X_train, y_train)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.8491620111731844

In [50]:
#save the model

from pickle import dump

dump(model, open("/workspaces/Naive-Bayes-Project/models/MultinomialNB_alpha_5.618086997574443_fit_prior_False.sav", "wb"))