# Naive-Bayes

---

Naive Bayes is a simple yet powerful probabilistic classification algorithm based on Bayes' Theorem. It is called "naive" because it makes a key assumption: all the features (or predictors) in the data are considered to be independent of each other. In real-world scenarios, this assumption is often not true, but Naive Bayes still performs well for many complex problems.


#### **How Naive Bayes Works**

The algorithm calculates the posterior probability for each class and assigns the class with the highest probability. There are several variants of Naive Bayes, which differ mainly in the assumptions they make about the distribution of the data:

1. **Gaussian Naive Bayes**: Assumes that the continuous values associated with each feature are distributed according to a Gaussian (normal) distribution. This is used when features are continuous and can be modeled well by a bell-shaped curve.

2. **Multinomial Naive Bayes**: Typically used for discrete data, such as word counts in text classification problems. It works well for features that represent counts or frequencies.

3. **Bernoulli Naive Bayes**: Suitable for binary/boolean data, where features are either present or absent (e.g., spam email classification based on the presence or absence of certain words).

#### **Assumptions**

The key assumption of Naive Bayes is that the features are conditionally independent given the class label. While this assumption is rarely true in real-world data, Naive Bayes often works surprisingly well in practice.

#### **Advantages**

- **Efficient**: Naive Bayes is computationally efficient and works well with large datasets.
- **Simple to Implement**: The algorithm is simple to understand and implement.
- **Performs Well with High-Dimensional Data**: It is effective in text classification tasks, such as spam filtering and sentiment analysis.
- **Handles Missing Data**: The model can handle missing values by ignoring those instances during probability estimation.

#### **Disadvantages**
- **Strong Feature Independence Assumption**: The assumption of independence between features is often unrealistic.
- **Data Scarcity**: If a category is not present in the training data, the model assigns zero probability to that category, leading to poor predictions.

#### **Applications**
- **Text Classification**: Spam detection, sentiment analysis, and news categorization.
- **Medical Diagnosis**: Predicting the likelihood of diseases based on symptoms.
- **Document Categorization**: Classifying documents into topics.

#### **Example**

Imagine you want to classify emails as spam or not spam (binary classification) based on features like the presence of certain words. Naive Bayes would calculate the probability that the email is spam given each feature (word) and then combine these probabilities to make the final classification.


---

Imported Libraries

In [156]:
# Data processing
# ==================================================================================
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Preprocessing and modeling
# ==================================================================================
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from pickle import dump


# Warnings Configuration
# ==================================================================================
import warnings

def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn # ignore annoying warning (from sklearn and seaborn)

pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points
'''NOTE: This affects only the display and not the underlying data, which remains unchanged.'''

'NOTE: This affects only the display and not the underlying data, which remains unchanged.'

---

# Step 1: Loading the dataset

In [157]:
df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv")
df.head(3)

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0


**Description and types of Data**

- `package_name `--> Name of the mobile application <i>(categorical)</i>
- `review` --> Comment about the mobile application <i>(categorical)</i>
- `polarity` --> Class variable (0 or 1), being 0 a negative comment and 1, positive <i>(numerical)</i>

---

## Step 2: Study of variables and their content

In [158]:
# Obtain dimensions

rows, columns = df.shape

print(f"The dimensions of this dataset are: {rows} Rows and {columns} Columns")

The dimensions of this dataset are: 891 Rows and 3 Columns


In [159]:
# Obtain information about data types and non-null values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 21.0+ KB


In [160]:
# Check null values

null_var = df.isnull().sum().loc[lambda x: x > 0] # Number of nulls in each variable.

num_of_null_var = len(null_var) # Number of variables with almost 1 null.

print(f"{null_var}\n\nThe number of null variables are {num_of_null_var}")

Series([], dtype: int64)

The number of null variables are 0


In this case, we have only 3 variables: 2 predictors and a dichotomous label. Of the two predictors, we are really only interested in the comment part, since the fact of classifying a comment as positive or negative will depend on its content, not on the application from which it was written. Therefore, the `package_name` variable should be removed.

In [161]:
# Eliminate irrelevant columns

df.drop(['package_name'],
            axis = "columns",
                inplace = True)

df.head(3)

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0


When we work with text, as in this case, it does not make sense to do an EDA, the process is different, since the only variable we are interested in is the one that contains the text. In other cases where the text is part of a complex set with other numeric predictor variables and the prediction objective is different, then it makes sense to apply an EDA.

However, we cannot work with plain text; it must first be processed. This process consists of several steps:

- ### 2.1 Removing spaces and converting the text to lowercase:

In [162]:
df["review"] = df["review"].str.strip().str.lower()

- ### 2.2 Divide the dataset into train and test

In [163]:
# Train - Test - Split
# ===============================================================================
def split(variable,
           target,
             test_size=0.2,
               random_state=42):
  
  X = df[variable] # Variable
  y = df[target] # Target

  X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                         test_size = test_size,
                                                           random_state = random_state)

  return X_train, X_test, y_train, y_test

In [164]:
X_train, X_test, y_train, y_test = split('review', 'polarity')

In [165]:
# Print .shape
# =====================================================================================

print("|X_train|")
print("=================================================================")
print(f"X_train shape: {X_train.shape}\n")

print("|X_test|")
print("=================================================================")
print(f"X_test shape: {X_test.shape}\n")

print("|Y_train|")
print("=================================================================")
print(f"y_train shape: {y_train.shape}\n ")

print("|Y_test|")
print("=================================================================")
print(f"y_test shape: {y_test.shape}\n")

|X_train|
X_train shape: (712,)

|X_test|
X_test shape: (179,)

|Y_train|
y_train shape: (712,)
 
|Y_test|
y_test shape: (179,)



- ### 2.3 Transform the text into a word count matrix. 

This is a way to obtain numerical features from the text. For this, we use the training set to train the transformer and apply it in test

In [166]:
# Convert a collection of text documents to a matrix of token counts.

vec_model = CountVectorizer(stop_words = "english") # If ‘english’, a built-in stop word list for English is used

vec_model

In [167]:
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

In [168]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Once we have finished we will have the predictors ready to train the model.

---

## Step 3: Build a naive bayes model

Start solving the problem by implementing a model, from which you will have to choose which of the three implementations to use: `GaussianNB`, `MultinomialNB` or `BernoulliNB`, according to what we have studied in the module. Try now to train it with the two other implementations and confirm if the model you have chosen is the right one.

---


In [169]:
# Models imported from scikit learn library
models = [GaussianNB, MultinomialNB, BernoulliNB]

# List of model's names
models_ = ['Gaussian_Naive_Baiyes', 'Multinomial_Naive_Bayes', 'Bernoulli_Naive_Bayes']

# Empty list for accuracies
accuracy_model = []

# Process the models and select the best
def build(models):

    for model_NB in models:

        model = model_NB()
        model.fit(X_train, y_train)

        y_pred = model.predict(X_test)
        
        accuracy = accuracy_score(y_test, y_pred)

        accuracy_model.append(accuracy)

    print(best_model())


def best_model():

    dict_models = dict(zip(models_, accuracy_model)) # Make a dictionary with accuracy and model's names

    sort_dict_models = sorted(dict_models.items(), reverse=True) # Sort accuracies
    
    return f"{sort_dict_models[0][0]} has the best accuracy with: {sort_dict_models[0][1]}"

In [170]:
# Execute the build() function

build(models)

Multinomial_Naive_Bayes has the best accuracy with: 0.8156424581005587


---

## Step 4: Model hyperparameters optimization

If you have a small combination of parameters, but with large sets of possible values — along with a model that uses a lot processing time — then RandomSearchCV will save you a lot of time, while still giving you a good estimation of the optimal parameters.

Furthermore, you can use the results of the model to run RandomSearchCV again but now with a smaller set of possible values: or better yet, run GridSearchCV on the small set of possible values after having a rough idea of where the optimal parameter is with RandomSearchCV!

If your model does not take a lot of time to train, or if you already have a rough idea of where the optimal values are (due to inference, or theoretical knowledge), you should definitely use GridSearchCV as it will give you 100% certainty about which parameters you passed that produce the optimal model results.

In [171]:
# We define the parameters that we want to adjust by hand

param_grid = {
    "alpha": [0.05, 0.07, 0.09, 0.1, 0.12],
    "fit_prior": [True, False],
    "class_prior": [None, 'array-like']
}

# We initialize the grid

grid = GridSearchCV(MultinomialNB(), 
                     param_grid,
                       scoring = "accuracy",
                        verbose = 0,
                         refit = True)
grid

In [172]:
grid.fit(X_train, y_train)

print(f"Best hyperparameters: {grid.best_params_}")

Best hyperparameters: {'alpha': 0.09, 'class_prior': None, 'fit_prior': True}


In [173]:
model_grid = MultinomialNB(
    alpha = 0.09,
        class_prior = None,
            fit_prior = True
)

model_grid.fit(X_train, y_train)

y_pred = model_grid.predict(X_test)

grid_accuracy = accuracy_score(y_test, y_pred)
grid_accuracy

0.8324022346368715

In [176]:
model_accuracy = sort_dict_models[0][1]

print(f"We have an increment of {round(((grid_accuracy - model_accuracy)/model_accuracy)*100, 2)}%")

We have an increment of 2.05%


---

- ## Step 5: Save the model

In [175]:
dump(model_grid, open("../models/MultinomialNB.sav", "wb"))

---

- ## Step 6: Explore other alternatives

Which other models of the ones we have studied could you use to try to overcome the results of a Naive Bayes? Argue this and train the model.