# Naive Bayes Project Tutorial



Objetivo de la tarea: crear un clasificador de reseñas de Google Play Store usando Naive Sentiment Analysis. El clasificador debe clasificar en reseñas buenas y malas.

Primero se debe ejecutar desde la consola: `pip install -r requirements.txt`

**Paso 1:** cargar los datos

In [5]:
# instalo librerías que no quedaron instaladas al ejecutar requirements en consola 
! pip install sklearn
! pip install pandas

Collecting sklearn
  Using cached sklearn-0.0-py2.py3-none-any.whl
Collecting scikit-learn
  Using cached scikit_learn-1.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31.2 MB)
Collecting joblib>=1.0.0
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Collecting scipy>=1.3.2
  Using cached scipy-1.8.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.6 MB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting numpy>=1.17.3
  Using cached numpy-1.23.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Installing collected packages: threadpoolctl, numpy, joblib, scipy, scikit-learn, sklearn
Successfully installed joblib-1.1.0 numpy-1.23.1 scikit-learn-1.1.1 scipy-1.8.1 sklearn-0.0 threadpoolctl-3.1.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To 

In [7]:
# librerías
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import pickle

In [8]:
# datos
data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews_dataset.csv')

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 21.0+ KB


El dataset contiene 891 observaciones y 3 variables

In [10]:
data.head()

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


In [11]:
# elimino la primera variable
data = data.drop(columns = 'package_name')

In [12]:
data.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0


In [13]:
# elimino espacios al inicio y al final, paso caracteres a minúscula
data['review'] = data['review'].str.strip().str.lower()

In [14]:
# miro distribución de variable target
data['polarity'].value_counts()

0    584
1    307
Name: polarity, dtype: int64

**Paso 2:** Se crean X e y y se separan en train y test

In [15]:
# separo en X e y
X = data['review']
y = data['polarity']

In [16]:
# separo en train y test, eligiendo proporcionalmente valores de y = 0 o 1 de acuerdo al dataset completo
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.25, random_state = 42)

**Paso 3:** Vectorizar las variables y aplicar Naive Bayes

In [17]:
# función para vectoriza la variable reviews (X)
vec = CountVectorizer(stop_words = 'english')

In [18]:
# se aplica la vectorización a X_train y X_test
X_train = vec.fit_transform(X_train).toarray()
X_test = vec.transform(X_test).toarray()

In [20]:
# miro como quedó X_train
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [21]:
X_train.shape

(668, 3161)

X_train es una matriz esparsa de 668 filas y 3161 columnas, cada columna representa una palabra distinta. Es decir, hay 3161 palabras distintas en el dataset!

In [23]:
# frecuencia de las palabras (es un diccionario)
vec.vocabulary_

{'fantastic': 1032,
 'helpful': 1309,
 'app': 194,
 'paid': 1962,
 'version': 2989,
 'wouldn': 3104,
 'really': 2207,
 'want': 3028,
 'free': 1121,
 'love': 1654,
 'able': 78,
 'photos': 2022,
 'business': 416,
 'cards': 454,
 'turn': 2867,
 'contacts': 643,
 'ability': 76,
 'text': 2765,
 'pdfs': 1991,
 'extremely': 1014,
 'useful': 2961,
 'awsome': 267,
 'great': 1217,
 'type': 2876,
 'social': 2550,
 'media': 1717,
 'wrong': 3110,
 'used': 2960,
 'crash': 680,
 'phone': 2018,
 'update': 2936,
 'doesnt': 838,
 'lag': 1540,
 'bit': 336,
 'pleaz': 2055,
 'use': 2959,
 'properly': 2142,
 'žŕ': 3157,
 'šŕ': 3151,
 'żŕ': 3155,
 'şŕ': 3149,
 'ŕľ': 3146,
 'żŕľ': 3156,
 'šŕľ': 3152,
 'let': 1572,
 'edit': 898,
 'review': 2320,
 'using': 2968,
 'keyboard': 1505,
 'week': 3053,
 'sorry': 2570,
 'unusable': 2932,
 'doesn': 837,
 'pick': 2024,
 'right': 2332,
 'word': 3091,
 'user': 2964,
 'unfriendly': 2906,
 'interface': 1447,
 'pls': 2057,
 'make': 1678,
 'usable': 2957,
 'thanks': 2774,
 'ho

In [58]:
# creo data frame con frecuencia de cada palabra
words = pd.DataFrame(data = [vec.vocabulary_.keys(), vec.vocabulary_.values()]).T
words.columns = ['word', 'frec']
words.head()

Unnamed: 0,word,frec
0,fantastic,1032
1,helpful,1309
2,app,194
3,paid,1962
4,version,2989


In [62]:
# creo data frame ordenado
words_ord = words.sort_values(by = 'frec', ascending = False)

In [63]:
print('Palabras más frecuentes')
print(words_ord.head())
print('')

print('Palabras menos frecuentes')
print(words_ord.tail())
print('')

print('Palabras elegidas aleatoriamente')
print(words_ord.sample(5))
print('')



Palabras más frecuentes
     word  frec
911   ˇŕľ  3160
2259   ˇŕ  3159
909   žŕľ  3158
37     žŕ  3157
42    żŕľ  3156

Palabras menos frecuentes
     word frec
2162  101    4
379   100    3
1480   10    2
2267   0x    1
2266   04    0

Palabras elegidas aleatoriamente
           word  frec
1870    posting  2084
2892    effects   905
138      record  2226
820        tall  2735
207   messenger  1737



In [24]:
# ajusto modelo Naive Bayes
model = MultinomialNB()
model.fit(X_train, y_train)

In [65]:
# ajuste del modelo en muestra de test
print(f'El accuracy es de {round(model.score(X_test, y_test), 4)}')

El accuracy es de 0.8565


A pesar de ser un modelo muy básico, el ajuste es razonablemente bueno. Ahora pruebo el modelo con texto elegido intencionalmente

In [67]:
# predicción con comentario negativo
model.predict(vec.transform([' this app is very bad ']))

array([0])

El modelo predice correctamente el comentario como negativo

In [70]:
# predicción con comentario positivo
model.predict(vec.transform([" I love this app "]))

array([1])

En este caso el modelo también predice correctamente el comentario positivo

In [72]:
# predicción con comentario neutro
model.predict(vec.transform([' neither good nor bad ']))

array([0])

En este caso se equivoca, clasificando como negativo un comentario neutro

In [74]:
# predicción con texto largo
string = '''

I love Google Play Store. The website and mobile site are both very easy to maneuver. I think it's great that Google has so many free applications available. I review for Google often and they usually send me an update to say thank you. I use Google Play often and will continue to do so. #GooglePlayStore #Techie #DigitalArt created by yours truly. Follow me on IG to see more original content. #MKR #IamMimisMusic Google Play Store

'''
model.predict(vec.transform([string]))

array([0])

En este caso el comentario es positivo pero al ser tan largo el modelo tiene problemas en clasificarlo.

En resumen: si el comentario es breve y claro el modelo funciona bien, pero en textos largos o ambiguos la predicción puede fallar

**Paso 4:** Se guarda el modelo

In [75]:
# se guarda el modelo
filename = '../models/nb_model.sav'
pickle.dump(model, open(filename,'wb'))