# Explore here

### Paso 1: Carga del conjunto de datos

In [1]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv')
df

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0
...,...,...,...
886,com.rovio.angrybirds,loved it i loooooooooooooovvved it because it...,1
887,com.rovio.angrybirds,all time legendary game the birthday party le...,1
888,com.rovio.angrybirds,ads are way to heavy listen to the bad review...,0
889,com.rovio.angrybirds,fun works perfectly well. ads aren't as annoy...,1


### Paso 2: Estudio de variables y su contenido

In [2]:
df = df.drop('package_name', axis = 1)
df.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0


Eliminamos la columna 'Package_name' porque nos interesa solo el comentario y no el nombre de la aplicacion movil

In [3]:
df['review'] = df['review'].str.strip().str.lower()
df.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0
3,the new features suck for those of us who don'...,0
4,forced reload on uploading pic on replying com...,0


Eliminamos todos los espacios en blanco al inicio y al final con "strip" y convertimos todo el texto en miniscula con "lower"

In [4]:
from sklearn.model_selection import train_test_split
X = df['review']
y = df['polarity']

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size= 0.2, random_state= 42)
X_train.info()

<class 'pandas.Series'>
Index: 712 entries, 331 to 102
Series name: review
Non-Null Count  Dtype
--------------  -----
712 non-null    str  
dtypes: str(1)
memory usage: 11.1 KB


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vec_model = CountVectorizer(stop_words = "english")
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(712, 3310))

Transformamos el texto en una matriz de recuento de palabras.
CountVectorizer para el filtrado de palabras que no aportan significado real al analisis.
Entrenamos el modelo y con .toarray() convertimos el resultado en una matriz numerica clasica


### Paso 3: Construye un naive bayes

Elegimos el modelo MultinomialNB porque queremos medir la frecuencia con la que aparece una palabra y a su vez porque el tipo de datos es de numeros enteros, no son con decimales ni tampoco binarios.

In [6]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

model = MultinomialNB()
model.fit(X_train,y_train)

0,1,2
,"alpha  alpha: float or array-like of shape (n_features,), default=1.0 Additive (Laplace/Lidstone) smoothing parameter (set alpha=0 and force_alpha=True, for no smoothing).",1.0
,"force_alpha  force_alpha: bool, default=True If False and alpha is less than 1e-10, it will set alpha to 1e-10. If True, alpha will remain unchanged. This may cause numerical errors if alpha is too close to 0. .. versionadded:: 1.2 .. versionchanged:: 1.4  The default value of `force_alpha` changed to `True`.",True
,"fit_prior  fit_prior: bool, default=True Whether to learn class prior probabilities or not. If false, a uniform prior will be used.",True
,"class_prior  class_prior: array-like of shape (n_classes,), default=None Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.",


In [7]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0])

In [8]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_pred)

0.8156424581005587

Calculamos la efectividad con los otros dos modelos usando un bucle.

In [9]:
for model_auxiliar in [GaussianNB(),BernoulliNB()]:
    model_auxiliar.fit(X_train,y_train)
    y_pred_auxiliar = model_auxiliar.predict(X_test)
    print(f'{model_auxiliar} con accuracy de: {accuracy_score(y_test,y_pred_auxiliar)}')


GaussianNB() con accuracy de: 0.8044692737430168
BernoulliNB() con accuracy de: 0.770949720670391


En este caso vemos que el modelo MultinomialNB es el mejor ya que tiene una efectividad del 81%

### Paso 4: Optimiza el modelo anterior

In [16]:
from sklearn.model_selection import GridSearchCV

nb_model = MultinomialNB()
param_grid = {
    'alpha': [0.01,1,10.0,100],
    'fit_prior': [True,False]
}

grid_nb = GridSearchCV(nb_model, param_grid, cv=5, scoring='accuracy')
grid_nb.fit(X_train, y_train)

print(f"Mejor score (Accuracy): {grid_nb.best_score_}")
print(f"Mejores parámetros: {grid_nb.best_params_}")

Mejor score (Accuracy): 0.8159755737220525
Mejores parámetros: {'alpha': 0.01, 'fit_prior': False}


Ahora que encontramos los mejores parametros, vamos a optimizar y re-entrenar el modelo

In [17]:
model = MultinomialNB(alpha = 0.01, fit_prior = False)
model.fit(X_train, y_train)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.8212290502793296

Hay una mejora de un 1%

### Paso 5: Guarda el modelo

In [19]:
from pickle import dump

dump(model, open("../models/naive_bayes_alpha_0-01_fit_prior_False_42.sav", "wb"))

### Paso 6: Explora otras alternativas

Regresion Logistica

In [20]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
print(f"Accuracy Logistic Regression: {lr_model.score(X_test, y_test)}")

Accuracy Logistic Regression: 0.8324022346368715


Elegimos la Regresión Logística como alternativa porque, a diferencia de Naive Bayes (que trata a cada palabra como si fuera totalmente independiente), la Regresión Logística es capaz de ajustar la importancia (pesos) de cada palabra en relación con las demás. Esto la hace mucho más robusta para entender el lenguaje real, donde el significado de una reseña depende de la combinación de sus palabras y no solo de que aparezcan sueltas.
Ademas que la Regresión Logística	Se utiliza para clasificar sucesos binarios o multiclase (menos común) al igual que Naive Bayes