# Naive Bayes Project

Sentiment analysis
- Naive Bayes models are very useful when we want to analyze sentiment, classify texts into topics or recommendations, as the characteristics of these challenges meet the theoretical and methodological assumptions of the model very well.

- In this project you will practice with a dataset to create a review classifier for the Google Play store.

#### Step 1: Loading the dataset

In [89]:
import pandas as pd

total_data = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv")
total_data.head()

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


#### Step 2: Study of variables and their content

In [90]:
# Delete package_name variable:

total_data = total_data.drop(columns = 'package_name')
total_data.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0


In [91]:
total_data.shape

(891, 2)

In [92]:
total_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   review    891 non-null    object
 1   polarity  891 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 14.0+ KB


In [93]:
# Removing spaces and converting the text to lowercase:

total_data['review'] = total_data['review'].str.strip().str.lower()
total_data.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0
3,the new features suck for those of us who don'...,0
4,forced reload on uploading pic on replying com...,0


In [94]:
# Divide the dataset into train and test:

from sklearn.model_selection import train_test_split

X = total_data['review']
y = total_data['polarity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [95]:
# Transform the text into a word count matrix:

from sklearn.feature_extraction.text import CountVectorizer

vec_model = CountVectorizer(stop_words = "english")
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

In [96]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

#### Step 3: Build a naive bayes model

- It does not make sense to work with GaussianNB as our dataset it's not a continuous data.

In [97]:
# MultinomialNB

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

model = MultinomialNB()
model.fit(X_train, y_train)


In [98]:
# Make predictions on test data:

y_pred = model.predict(X_test)

# Calculating model accuracy on test data:

accuracy_score(y_test, y_pred)

0.8156424581005587

In [99]:
# Optimize the Multinomial Model:

from sklearn.model_selection import GridSearchCV

# We define the parameters by hand that we want to adjust

hyperparams = {
    'alpha': [0.001, 0.01, 0.1, 1, 10],
    'fit_prior': [True, False]
}

# We initialize the grid

grid = GridSearchCV(model, hyperparams, scoring = "accuracy", cv = 5)
grid

In [100]:
# Retrieve the best parameters:

grid.fit(X_train, y_train)

print(f"Best hyperparameters: {grid.best_params_}")

Best hyperparameters: {'alpha': 0.01, 'fit_prior': False}


In [101]:
# Retrain the model:

model_grid = MultinomialNB(alpha=0.01, fit_prior=True)
model_grid.fit(X_train, y_train)

In [102]:
# Make predictions on retrained data:

y_pred = model_grid.predict(X_test)

# Calculating model accuracy on test data:

accuracy_score(y_test, y_pred)

0.8212290502793296

In [103]:
# Save the model:

from pickle import dump

dump(model, open("/workspaces/machine-learning-naive-bayes-Juli-MM/models/nbayes_multinomial_opt_alpha-0.01_prior-true.sav", "wb"))

In [104]:
# BernoulliNB

from sklearn.naive_bayes import BernoulliNB

model = BernoulliNB()
model.fit(X_train, y_train)

In [105]:
# Make predictions on test data:

y_pred = model.predict(X_test)

# Calculating model accuracy on test data:

accuracy_score(y_test, y_pred)

0.770949720670391

In [106]:
# Optimize Bernoulli model:

from sklearn.model_selection import GridSearchCV

# We define the parameters by hand that we want to adjust
hyperparams = {
    'alpha': [0.1, 0.5, 1.0, 2.0],    
    'binarize': [0.0, 0.5, 1.0],     
    'fit_prior': [True, False],       
}

# We initialize the grid
grid = GridSearchCV(model, hyperparams, scoring = "accuracy", cv = 5)
grid

In [107]:
# Retrieve the best parameters:

grid.fit(X_train, y_train)

print(f"Best hyperparameters: {grid.best_params_}")

Best hyperparameters: {'alpha': 0.1, 'binarize': 0.0, 'fit_prior': True}


In [108]:
# Retrain the model:

model_grid = BernoulliNB(alpha=0.1, binarize=0.0, fit_prior=True)
model_grid.fit(X_train, y_train)

In [109]:
# Make predictions on retrained data:

y_pred = model_grid.predict(X_test)

# Calculating model accuracy on test data:

accuracy_score(y_test, y_pred)

0.8324022346368715

In [110]:
# Save the model:

from pickle import dump

dump(model, open("/workspaces/machine-learning-naive-bayes-Juli-MM/models/nbayes_bernoulli_opt_alpha-0.1_bina-0.0_prior-true.sav", "wb"))

#### Conclusions:

- MultinomialNB gave the best results compared with BernoulliNB model.
- As our dataset has a discrete nature, in fact, MultinomialNB is the best model to use for prediction.

- package_name. Name of the mobile application (categorical)
- review. Comment about the mobile application (categorical)
- polarity. Class variable (0 or 1), being 0 a negative comment and 1, positive (numeric)