# Naive-Bayes

---

Imported Libraries

---

In [189]:
# Data processing
# ==================================================================================
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Preprocessing and modeling
# ==================================================================================
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from pickle import dump


# Warnings Configuration
# ==================================================================================
import warnings

def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn # ignore annoying warning (from sklearn and seaborn)

pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points
'''NOTE: This affects only the display and not the underlying data, which remains unchanged.'''

'NOTE: This affects only the display and not the underlying data, which remains unchanged.'

---

# Step 1

In [190]:
df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv")
df.head(3)

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0


**Description and types of Data**

- `package_name `--> Name of the mobile application (categorical)
- `review` --> Comment about the mobile application (categorical)
- `polarity` --> Class variable (0 or 1), being 0 a negative comment and 1, positive (numeric)

---

## Step 2: Study of variables and their content

In [191]:
# Obtain dimensions

rows, columns = df.shape

print(f"The dimensions of this dataset are: {rows} Rows and {columns} Columns")

The dimensions of this dataset are: 891 Rows and 3 Columns


In [192]:
# Obtain information about data types and non-null values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 21.0+ KB


In [193]:
# Check null values

null_var = df.isnull().sum().loc[lambda x: x > 0] # Number of nulls in each variable.

num_of_null_var = len(null_var) # Number of variables with almost 1 null.

print(f"{null_var}\n\nThe number of null variables are {num_of_null_var}")

Series([], dtype: int64)

The number of null variables are 0


In this case, we have only 3 variables: 2 predictors and a dichotomous label. Of the two predictors, we are really only interested in the comment part, since the fact of classifying a comment as positive or negative will depend on its content, not on the application from which it was written. Therefore, the `package_name` variable should be removed.

In [194]:
# Eliminate irrelevant columns

df.drop(['package_name'],
            axis = "columns",
                inplace = True)

df.head(3)

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0


When we work with text, as in this case, it does not make sense to do an EDA, the process is different, since the only variable we are interested in is the one that contains the text. In other cases where the text is part of a complex set with other numeric predictor variables and the prediction objective is different, then it makes sense to apply an EDA.

However, we cannot work with plain text; it must first be processed. This process consists of several steps:

- ### 2.1 Removing spaces and converting the text to lowercase:

In [195]:
df["review"] = df["review"].str.strip().str.lower()

- ### 2.2 Divide the dataset into train and test

In [196]:
# Train - Test - Split
# ===============================================================================

X = df['review'] # Variable
y = df['polarity'] # Target

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                         test_size = 0.2,
                                                           random_state = 42)

X_train.head(3)

331    just did the latest update on viber and yet ag...
733    keeps crashing it only works well in extreme d...
382    the fail boat has arrived the 6.0 version is t...
Name: review, dtype: object

In [197]:
# Print .shape
# =====================================================================================

print("|X_train|")
print("=================================================================")
print(f"X_train shape: {X_train.shape}\n")

print("|X_test|")
print("=================================================================")
print(f"X_test shape: {X_test.shape}\n")

print("|Y_train|")
print("=================================================================")
print(f"y_train shape: {y_train.shape}\n ")

print("|Y_test|")
print("=================================================================")
print(f"y_test shape: {y_test.shape}\n")

|X_train|
X_train shape: (712,)

|X_test|
X_test shape: (179,)

|Y_train|
y_train shape: (712,)
 
|Y_test|
y_test shape: (179,)



- ### 2.3 Transform the text into a word count matrix. 

This is a way to obtain numerical features from the text. For this, we use the training set to train the transformer and apply it in test

In [198]:
# Convert a collection of text documents to a matrix of token counts.

vec_model = CountVectorizer(stop_words = "english") # If ‘english’, a built-in stop word list for English is used

vec_model

In [199]:
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

In [200]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Once we have finished we will have the predictors ready to train the model.

---

## Step 3: Build a naive bayes model

Start solving the problem by implementing a model, from which you will have to choose which of the three implementations to use: `GaussianNB`, `MultinomialNB` or `BernoulliNB`, according to what we have studied in the module. Try now to train it with the two other implementations and confirm if the model you have chosen is the right one.

---

- ### 3.1 GaussianNB

In [201]:
model_gauss = GaussianNB()
model_gauss.fit(X_train, y_train)

In [202]:
y_pred_gauss = model_gauss.predict(X_test)
y_pred_gauss

array([0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0], dtype=int64)

In [203]:
gaussian_accuracy = accuracy_score(y_test, y_pred_gauss)
print(f"The Gaussian Naive-Bayes Accuracy is: {gaussian_accuracy}")

The Gaussian Naive-Bayes Accuracy is: 0.8044692737430168


---

- ### 3.2 MultinomialNB

In [204]:
model_multi = MultinomialNB()
model_multi.fit(X_train, y_train)

In [205]:
y_pred_multi = model_multi.predict(X_test)
y_pred_multi

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0], dtype=int64)

In [206]:
multinomial_accuracy = accuracy_score(y_test, y_pred_multi)
print(f"The Multinomial Naive-Bayes Accuracy is: {multinomial_accuracy}")

The Multinomial Naive-Bayes Accuracy is: 0.8156424581005587


---

- ### 3.3 BernoulliNB

In [207]:
model_ber = BernoulliNB()
model_ber.fit(X_train, y_train)

In [208]:
y_pred_ber = model_ber.predict(X_test)
y_pred_ber

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0], dtype=int64)

In [209]:
ber_accuracy = accuracy_score(y_test, y_pred_ber)
print(f"The Bernoulli Naive-Bayes Accuracy is: {ber_accuracy}")

The Bernoulli Naive-Bayes Accuracy is: 0.770949720670391


---

- ### 3.4 Result

In [210]:
print(f"Multinomial Naive-Bayes has the best accuracy: {multinomial_accuracy}")

Multinomial Naive-Bayes has the best accuracy: 0.8156424581005587


---

## Step 4: Model hyperparameters optimization

If you have a small combination of parameters, but with large sets of possible values — along with a model that uses a lot processing time — then RandomSearchCV will save you a lot of time, while still giving you a good estimation of the optimal parameters.

Furthermore, you can use the results of the model to run RandomSearchCV again but now with a smaller set of possible values: or better yet, run GridSearchCV on the small set of possible values after having a rough idea of where the optimal parameter is with RandomSearchCV!

If your model does not take a lot of time to train, or if you already have a rough idea of where the optimal values are (due to inference, or theoretical knowledge), you should definitely use GridSearchCV as it will give you 100% certainty about which parameters you passed that produce the optimal model results.

In [211]:
# We define the parameters that we want to adjust by hand

param_grid = {
    "alpha": [0.05, 0.07, 0.09, 0.1, 0.12],
    "fit_prior": [True, False],
    "class_prior": [None, 'array-like']
}

# We initialize the grid

grid = GridSearchCV(model_multi,
                     param_grid,
                       scoring = "accuracy",
                        verbose = 0,
                         refit = True)
grid

In [212]:
grid.fit(X_train, y_train)

print(f"Best hyperparameters: {grid.best_params_}")

Best hyperparameters: {'alpha': 0.09, 'class_prior': None, 'fit_prior': True}


In [213]:
model_grid = MultinomialNB(
    alpha = 0.09,
        class_prior = None,
            fit_prior = True
)

model_grid.fit(X_train, y_train)

y_pred = model_grid.predict(X_test)

grid_accuracy = accuracy_score(y_test, y_pred)
grid_accuracy

0.8324022346368715

In [214]:
print(f"We have an increment of {round(((grid_accuracy - multinomial_accuracy)/multinomial_accuracy)*100, 2)}%")

We have an increment of 2.05%


---

- ## Step 5: Save the model

In [215]:
dump(model_grid, open("../models/MultinomialNB.sav", "wb"))