# Naive Bayes Algorithm

Paso 1. Cargar la data.

In [333]:
# Se carga la data contenida en un cvs.
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv")

In [334]:
df.head()

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


En este caso, por definición del ejercicio, solo necesitamos una variable objetivo (polarity) y una variable predictora (review), es por eso que se eliminará la variable "package_name":

In [335]:
# Se elimina columna "package_name"
df.drop("package_name", axis=1, inplace=True)
df.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0


## Paso 2: Estudio de variables y su contenido

En este caso, como el contenido de la variable predictora es de tipo texto no es necesario realizar un EDA, pues el tratamiento es diferente.

Por otro lado, ya que el texto que contiene la variable review es texto plano, se le deben realizar ciertas modificaciones: la primera de ella será que se eliminarán los espacios y se convertirá el texto a minúsculas:

In [336]:
df["review"] = df["review"].str.strip().str.lower()

In [337]:
df

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0
3,the new features suck for those of us who don'...,0
4,forced reload on uploading pic on replying com...,0
...,...,...
886,loved it i loooooooooooooovvved it because it ...,1
887,all time legendary game the birthday party lev...,1
888,ads are way to heavy listen to the bad reviews...,0
889,fun works perfectly well. ads aren't as annoyi...,1


In [338]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   review    891 non-null    object
 1   polarity  891 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 14.1+ KB


In [339]:
df.describe()

Unnamed: 0,polarity
count,891.0
mean,0.344557
std,0.47549
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


Ahora, se dividira el conjunto de datos en train y test:

In [340]:
from sklearn.model_selection import train_test_split
X = df.drop("polarity", axis=1)
y = df["polarity"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Se usará una matriz de recuento de palabras para obtener información numérica del texto:

In [341]:
from sklearn.feature_extraction.text import CountVectorizer

vec_model = CountVectorizer(stop_words = "english")
X_train = vec_model.fit_transform(X_train['review']).toarray()
X_test = vec_model.transform(X_test['review']).toarray()

## Paso 3. Predicción del modelo BernoulliNB.

### Entrenamiento del modelo y selección de tipo de implementación (GaussianNB, MultinomialNB o BernoulliNB)

### Método Bernoulli

En este caso, por ser la variable objetivo de tipo dicotómica, se seleccionara BenoulliNB:

In [342]:
from sklearn.naive_bayes import BernoulliNB

model = BernoulliNB()
model.fit(X_train, y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,binarize,0.0
,fit_prior,True
,class_prior,


Se eligió el modelo de Bernoulli debido a que nuestra variable objetivo es de tipo dicotómica:

In [343]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0])

In [363]:
from sklearn.metrics import accuracy_score, recall_score, f1_score

print(f'Accurancy: { accuracy_score(y_test, y_pred)}' )
print(f'Recall: { recall_score(y_test, y_pred)}' )
print(f'F1-score: { f1_score(y_test, y_pred)}' )

Accurancy: 0.770949720670391
Recall: 0.39622641509433965
F1-score: 0.5060240963855421


In [345]:
from sklearn.metrics import classification_report

classification_report(y_test, y_pred)

'              precision    recall  f1-score   support\n\n           0       0.79      0.93      0.85       126\n           1       0.70      0.40      0.51        53\n\n    accuracy                           0.77       179\n   macro avg       0.74      0.66      0.68       179\nweighted avg       0.76      0.77      0.75       179\n'

Ahora, para corroborrar que se haya seleccionado el modelo adecuado, se probará con los métodos de GaussianNB y MultinomialNB:

### Método Gaussian

In [354]:
from sklearn.naive_bayes import GaussianNB

model_g = GaussianNB()
model_g.fit(X_train, y_train)

0,1,2
,priors,
,var_smoothing,1e-09


In [355]:
y_pred_g = model_g.predict(X_test)
y_pred_g

array([0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0])

In [364]:
from sklearn.metrics import accuracy_score, recall_score, f1_score

print(f'Accurancy: { accuracy_score(y_test, y_pred_g)}' )
print(f'Recall: { recall_score(y_test, y_pred_g)}' )
print(f'F1-score: { f1_score(y_test, y_pred_g)}' )

Accurancy: 0.8044692737430168
Recall: 0.6226415094339622
F1-score: 0.6534653465346535


In [362]:
classification_report(y_test, y_pred_g)

'              precision    recall  f1-score   support\n\n           0       0.85      0.88      0.86       126\n           1       0.69      0.62      0.65        53\n\n    accuracy                           0.80       179\n   macro avg       0.77      0.75      0.76       179\nweighted avg       0.80      0.80      0.80       179\n'

### Método MultinomialNB

In [357]:
from sklearn.naive_bayes import MultinomialNB

model_m = MultinomialNB()
model_m.fit(X_train, y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [358]:
y_pred_m = model_m.predict(X_test)
y_pred_m

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0])

In [365]:
from sklearn.metrics import accuracy_score, recall_score, f1_score

print(f'Accurancy: { accuracy_score(y_test, y_pred_m)}' )
print(f'Recall: { recall_score(y_test, y_pred_m)}' )
print(f'F1-score: { f1_score(y_test, y_pred_m)}' )

Accurancy: 0.8156424581005587
Recall: 0.6037735849056604
F1-score: 0.6597938144329897


In [361]:
classification_report(y_test, y_pred_m)

'              precision    recall  f1-score   support\n\n           0       0.84      0.90      0.87       126\n           1       0.73      0.60      0.66        53\n\n    accuracy                           0.82       179\n   macro avg       0.79      0.75      0.77       179\nweighted avg       0.81      0.82      0.81       179\n'

## Interpretación:

Aunque se eligió el método de Bernoulli, principalmente, porque el texto se convirtió o vectorizó a arrays de 0s y 1s, el método del que se obtuvo mejores resultados fue el método Multinomial, una razón podría ser que se está trabajando con texto y este método funciona mejor con este tipo de valor.