# Naive Bayes

### Introduction

In this notebook, I will be implementing two types of Naive Bayes model based on the dataset features and task requirements.

For reference and additional details, please go through [Chapter 4](https://web.stanford.edu/~jurafsky/slp3/) of the SLP3 book.

In this assignment, I have two datasets. One is suitable for **Multinomial Naive Bayes**, while the other is appropriate for **Bernoulli Naive Bayes**. My task is to:
1. Analyze both datasets and determine which Naive Bayes model to apply based on the dataset’s characteristics.
2. Implement both **Multinomial** and **Bernoulli Naive Bayes** from scratch, adhering to the guidelines below regarding allowed libraries.
3. Finally, apply the corresponding models using the `sklearn` library and compare the results with my implementation.

### Guidelines:
- Using only **numpy** and **pandas** for the manual implementation of Naive Bayes classifiers. No other libraries should be used for this part.
- For the final part of the assignment, I will use **sklearn** to compare my implementation results.

All necessary libraries for this assignment have already been added. We do not need to install any additional libraries.

In [24]:
# !pip install datasets
# !pip install nltk

In [1]:
# Standard library imports
import numpy as np
import regex as re

# Third-party library imports
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,confusion_matrix
import nltk
from datasets import load_dataset

# NLTK-specific download
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 1. Loading the Datasets

We are provided with two datasets:

- **Dataset 1**: Golf Dataset (available in CSV format in the given Repo)
- **Dataset 2**: Tweet Evaluation Dataset (to be loaded from Hugging Face)

### Instructions:

1. **Golf Dataset**: We can find the CSV file of the Golf Dataset in the resources provided in the Repo. This dataset aims to explore factors that influence the decision to play golf, which could be valuable for predictive modeling tasks. ​​

2. **Tweet Evaluation Dataset**: Instead of downloading the dataset manually, we will be using the [`datasets`](https://huggingface.co/docs/datasets) library from Hugging Face to automatically download and manage the Tweet Eval dataset. This library is part of the Hugging Face ecosystem, widely used for Natural Language Processing (NLP) tasks. The `datasets` library not only downloads the dataset but also offers a standardized interface for accessing and handling the data, making it compatible with other popular libraries like Pandas and PyTorch. Format each split of the dataset into a Pandas DataFrame. The columns should be `text` and `label`, where `text` is the sentence and `label` is the emotion label. The goal is to classify tweets into various emotional categories (e.g., joy, sadness, anger) by analyzing their content.

   We can explore the extensive list of datasets available on Hugging Face [here](https://huggingface.co/datasets).

### Why Use Hugging Face?

Familiarizing Yourself with Hugging Face tools now will be beneficial for future projects and NLP-related tasks. It simplifies data handling and ensures smooth integration with machine learning workflows.

### Task:

- Explore both datasets and identify their key features. This will help us determine which dataset is best suited for **Multinomial Naive Bayes** and which is better suited for **Bernoulli Naive Bayes**. You can read more about Bernoulli Naive Bayes [here](https://medium.com/@gridflowai/part-2-dive-into-bernoulli-naive-bayes-d0cbcbabb775).


In [None]:

golf_data = pd.read_csv("golf_data.csv")
golf_data.head()


Unnamed: 0,Holiday,Month,Season,Temperature,Humidity,Windy,Outlook,Crowdedness,Play
0,1,Winter,Winter,low,low,1,sunny,high,1
1,1,Winter,Winter,low,low,1,sunny,high,0
2,1,Winter,Winter,low,low,1,sunny,high,0
3,1,Winter,Winter,low,low,1,sunny,high,1
4,1,Winter,Winter,low,low,1,sunny,high,1


In [11]:
golf_data.head()

Unnamed: 0,Holiday,Month,Season,Temperature,Humidity,Windy,Outlook,Crowdedness,Play
0,1,Winter,Winter,low,low,1,sunny,high,1
1,1,Winter,Winter,low,low,1,sunny,high,0
2,1,Winter,Winter,low,low,1,sunny,high,0
3,1,Winter,Winter,low,low,1,sunny,high,1
4,1,Winter,Winter,low,low,1,sunny,high,1


In [None]:
tweet_data = load_dataset('tweet_eval', 'emotion', cache_dir="datasets", verification_mode="no_checks") # added the verification check after permission from two TAs as there was a mismatch error without this. This check simply ignores the verification of the split size.

# tweet_data.head()

##### Before proceeding with further tasks, ensure that you have determined which type of Naive Bayes is most suitable for each dataset.

## 2. Data Preprocessing

### 2.1 Preprocessing the Golf Dataset

In this task, We will apply one-hot encoding to the categorical columns of the Golf dataset and split the data into training and test sets. We can use `sklearn's` `train_test_split` which has been imported for us above. Ensure that the `test_size` parameter is set to 0.3.

In [None]:

X = golf_data.iloc[:,:-1]
y = golf_data['Play']
# one-hot encoding
X_encoded = pd.get_dummies(X,drop_first=True)
X_encoded=X_encoded.astype("int")
# X_encoded = X_encoded.astype(int)
print(X_encoded)

print(y)

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.3, random_state=42)

X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values



      Holiday  Windy  Month_Winter  Season_Winter  Temperature_low  \
0           1      1             1              1                1   
1           1      1             1              1                1   
2           1      1             1              1                1   
3           1      1             1              1                1   
4           1      1             1              1                1   
...       ...    ...           ...            ...              ...   
7660        0      0             1              1                1   
7661        0      0             1              1                1   
7662        0      0             1              1                1   
7663        0      0             1              1                1   
7664        0      0             1              1                1   

      Humidity_low  Outlook_sunny  Crowdedness_not high  
0                1              1                     0  
1                1              1          

In [17]:
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)


X_train shape: (5365, 8)
y_train shape: (5365,)
X_test shape: (2300, 8)
y_test shape: (2300,)


### 2.2 Preprocessing the Tweet Eval Dataset

At this stage, we need to pre-process our data to ensure it's in a clean format for further analysis. The following steps should be performed:

- Remove any URL.
- Remove punctuation and non-alphanumeric characters.
- Convert all text to lowercase.
- Remove any extra whitespace.
- Eliminate common stopwords.

In the cell below, we'll implement a function that carries out these tasks. We can utilize the `re` library for cleaning text and the `nltk` library for removing stopwords.

Once the function is complete, we'll apply it to the `text` column of our dataset to obtain the preprocessed text.


In [None]:
tweet_data


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3257
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1421
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 374
    })
})

In [21]:
# nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


def preprocess(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = text.split()
    filtered_text = [word for word in tokens if word not in stop_words]
    return ' '.join(filtered_text)


train_texts = [preprocess(tweet) for tweet in tweet_data["train"]["text"]]  # Apply preprocess to each tweet
validation_texts = [preprocess(tweet) for tweet in tweet_data["validation"]["text"]]
test_texts = [preprocess(tweet) for tweet in tweet_data["test"]["text"]]
train_labels = tweet_data['train']['label'] 
validation_labels = tweet_data['validation']['label']
test_labels = tweet_data['test']['label']

# Creating DataFrames
train_df = pd.DataFrame({'processed_text': train_texts, 'label': train_labels})
validation_df = pd.DataFrame({'processed_text': validation_texts, 'label': validation_labels})
test_df = pd.DataFrame({'processed_text': test_texts, 'label': test_labels})

print("Training DataFrame:")
print(train_df.head())
print("\nValidation DataFrame:")
print(validation_df.tail())
print("\nTest DataFrame:")
print(test_df.tail())



Training DataFrame:
                                      processed_text  label
0  worry payment problem may never joyce meyer mo...      2
1  roommate okay cant spell autocorrect terrible ...      0
2  thats cute atsu probably shy photos cherry hel...      1
3  rooneys fucking untouchable isnt fucking dread...      0
4  pretty depressing u hit pan ur favourite highl...      3

Validation DataFrame:
                                        processed_text  label
369  user user trump whitehouse arent held accounta...      0
370      user chutiya producer invested crap deshdrohi      0
371  russia story infuriate trump today media other...      0
372                             shit getting irritated      0
373       user user didnt make angry id laughing tweet      0

Test DataFrame:
                                         processed_text  label
1416  need sparkling bodysuit occasion case emergenc...      1
1417  user ive finished reading simply mindblogging ...      3
1418  shaft abrasio

In [34]:
test_df.columns

Index(['processed_text', 'label'], dtype='object')

## 3. Implementing Naive Bayes from Scratch

## 3.1 Bernoulli Naive Bayes

### From Scratch

Recall that the Bernoulli Naive Bayes model is based on **Bayes' Theorem**:

$$
P(y \mid x) = \frac{P(x \mid y)P(y)}{P(x)}
$$

What we really want is to find the class \(c\) that maximizes \(P(c \mid x)\), so we can use the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(x \mid c)P(c)
$$

In the case of **Bernoulli Naive Bayes**, we assume that each word \(x_i\) in a sentence follows a **Bernoulli distribution**, meaning that the word either appears (1) or does not appear (0) in the document. We can simplify the formula using this assumption:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c) \prod_{i=1}^{n} P(x_i = 1 \mid c)^{x_i} P(x_i = 0 \mid c)^{1 - x_i}
$$

Where:

- $x_i = 1$ if the $i^{\text{th}}$ word is present in the document.
- $x_i = 0$ if the $i^{\text{th}}$ word is not present in the document.


We can estimate $P(c)$ by counting the number of times each class appears in our training data, and dividing by the total number of training examples. We can estimate $P(x_i = 1 \mid c)$ by counting the number of documents in class $c$ that contain the word $x_i$, and dividing by the total number of documents in class $c$.

### **Important: Laplace Smoothing**

When calculating $P(x_i = 1 \mid c)$ and $P(x_i = 0 \mid c)$, we apply **Laplace smoothing** to avoid zero probabilities. This is essential because, without it, any word that has not appeared in a document of class $c$ will have a probability of zero, which would make the overall product zero, leading to incorrect classification.

**Reason**: Laplace smoothing ensures that we don't encounter zero probabilities by adding a small constant (typically 1) to both the numerator and the denominator. This is particularly useful when a word has never appeared in the training data for a specific class.

The smoothed probability formula is:

$$
P(x_i = 1 \mid c) = \frac{\text{count of documents in class } c \text{ where } x_i = 1 + 1}{\text{total documents in class } c + 2}
$$

This ensures no word has a zero probability, even if it was unseen in the training data.

### Avoiding Underflow with Logarithms:

To avoid underflow errors due to multiplying small probabilities, we apply logarithms, which convert the product into a sum:

$$
\hat{c} = \underset{c}{\text{argmax}} \ \log P(c) + \sum_{i=1}^{n} \left[ x_i \log P(x_i = 1 \mid c) + (1 - x_i) \log P(x_i = 0 \mid c) \right]
$$

We will now implement this algorithm.

<span style="color: red;"> For this part, the only external library we will need is `numpy`.</span>


Now we'll use our implementation to train a Naive Bayes model on the training data, and generate predictions for the Validation Set.

Also, we'll report the Accuracy, Precision, Recall, and F1 score of our model on the validation data. We'll also display the Confusion Matrix. We can use `sklearn.metrics` for this.

In [23]:
def fit_naive_bayes(X, y):
    classes = np.unique(y)
    n_features = X.shape[1]
    class_priors = {}
    feature_probs = {}
    
    for c in classes:
        X_c = X[y == c]
        n_c = len(X_c)
        class_priors[c] = n_c / len(y)  # No Laplace smoothing for class priors, probs calculated on entire sample
        feature_probs[c] = np.zeros((n_features, 2))
        
        for i in range(n_features):
            n_c1 = np.sum(X_c[:, i] == 1)
            n_c0 = n_c - n_c1
            # Laplace smoothing for feature probabilities
            feature_probs[c][i, 1] = (n_c1 + 1) / (n_c + 2)  
            feature_probs[c][i, 0] = (n_c0 + 1) / (n_c + 2) 
    
    return classes, class_priors, feature_probs

def predict_naive_bayes(X, classes, class_priors, feature_probs):
    predictions = []
    for idx, x in enumerate(X):
        scores = {}
        for c in classes:
            score = np.log(class_priors[c])
            for i, xi in enumerate(x):
#                 if xi not in [0, 1]:
#                     print(f"Warning: Unexpected value {xi} at index {i} for sample {idx}")
#                     xi = round(xi)  # Round to nearest integer (0 or 1)
                score += xi * np.log(feature_probs[c][i, 1]) + (1 - xi) * np.log(feature_probs[c][i, 0])
            scores[c] = score
        
        if not scores:
            print(f"Warning: No valid scores for sample {idx}. Using default prediction.")
            predictions.append(max(class_priors, key=class_priors.get))
        else:
            predictions.append(max(scores, key=scores.get))
    
    return np.array(predictions)

# training
classes, class_priors, feature_probs = fit_naive_bayes(X_train, y_train)

y_pred = predict_naive_bayes(X_test, classes, class_priors, feature_probs)

print("Number of predictions:", len(y_pred))
print("Unique predicted classes:", np.unique(y_pred))
print("Unique actual classes:", np.unique(y_test))

# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
conf_matrix = confusion_matrix(y_test, y_pred)

# Results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)


Number of predictions: 2300
Unique predicted classes: [0 1]
Unique actual classes: [0 1]
Accuracy: 0.8200
Precision: 0.7685
Recall: 0.8200
F1 Score: 0.7494

Confusion Matrix:
[[1872   13]
 [ 401   14]]


## 3.2 Multinomial Naive Bayes (Manual Implementation)

### Vectorizing sentences with Bag of Words

Now that we have loaded in our data, we will need to vectorize our sentences - this is necessary to be able to numericalize our inputs before feeding them into our model. 

We will be using a Bag of Words approach to vectorize our sentences. This is a simple approach that counts the number of times each word appears in a sentence. 

The element at index $\text{i}$ of the vector will be the number of times the $\text{i}^{\text{th}}$ word in our vocabulary appears in the sentence. So, for example, if our vocabulary is `["the", "cat", "sat", "on", "mat"]`, and our sentence is `"the cat sat on the mat"`, then our vector will be `[2, 1, 1, 1, 1]`.

We will now create a `BagOfWords` class to vectorize our sentences. This will involve creating

1. A vocabulary from our corpus

2. A mapping from words to indices in our vocabulary

3. A function to vectorize a sentence in the fashion described above

It will help us to define something along the lines of a `fit` and a `vectorize` method.

In [25]:
# FINAL ATTEMPT!

import numpy as np

class BagOfWords:
    def __init__(self):
        self.vocabulary = []
        self.word_to_index = {}

    def fit(self, corpus):
       
        word_exists = {}
        for sentence in corpus:
            for word in sentence.split():
                word_exists[word] = word_exists.get(word, 0) + 1
        
    
        self.vocabulary = list(word_exists.keys())
        self.word_to_index = {word: i for i, word in enumerate(self.vocabulary)}
        print("Word to index mapping:", self.word_to_index)

    def vectorize(self, sentence):
        vector = np.zeros(len(self.vocabulary), dtype=int) 
        words = sentence.split()

        for word in words:
            if word in self.word_to_index:
                idx = self.word_to_index[word]
                vector[idx] += 1  

        return vector

    def vectorize_corpus(self, corpus):
        return np.array([self.vectorize(sentence) for sentence in corpus])


In [41]:
# class NaiveBayes:
#     def __init__(self):
#         self.class_log_prior = {}
#         self.log_likelihood = {}
#         self.vocab_size = 0
#         self.classes = None

#     def fit(self, X, y):
#         self.vocab_size = len(X[0])
#         self.classes = np.unique(y)
        
#         N_doc = len(y)
        
#         for c in self.classes:
#             N_c = np.sum(y == c)
#             self.class_log_prior[c] = np.log(N_c / N_doc)
            
#             X_c = np.array([x for x, label in zip(X, y) if label == c])
#             word_count = np.sum(X_c, axis=0) + 1  # Add-one smoothing
#             total_words = np.sum(word_count)
            
#             self.log_likelihood[c] = np.log(word_count / total_words)

#     def predict(self, X):
#         predictions = []
#         for x in X:
#             class_scores = {}
#             for c in self.classes:
#                 score = self.class_log_prior[c] + np.sum(x * self.log_likelihood[c])
#                 class_scores[c] = score
#             predictions.append(max(class_scores, key=class_scores.get))
#         return np.array(predictions)

#     def score(self, X, y):
#         predictions = self.predict(X)
#         return np.mean(predictions == y)


In [27]:
import numpy as np
bow_check = BagOfWords()

# Setting the vocabulary manually
example_vocabulary = ["the cat sat on the mat"] 
bow_check.fit(example_vocabulary)

test_sentence = "the cat sat on the mat"
vector = bow_check.vectorize(test_sentence)
# vector_check= list(vector.values())
print("Vocabulary:", bow_check.vocabulary)
print("Test sentence:", test_sentence)
print("Vectorized sentence:", vector)

expected_vector = np.array([2, 1, 1, 1, 1])
print("Expected: ", expected_vector)
print("Vectorization works: ",vector == expected_vector)


Word to index mapping: {'the': 0, 'cat': 1, 'sat': 2, 'on': 3, 'mat': 4}
Vocabulary: ['the', 'cat', 'sat', 'on', 'mat']
Test sentence: the cat sat on the mat
Vectorized sentence: [2 1 1 1 1]
Expected:  [2 1 1 1 1]
Vectorization works:  [ True  True  True  True  True]


For a sanity check, we can manually set the vocabulary of our `BagOfWords` object to the vocabulary of the example above, and check that the vectorization of the sentence is correct.

Once we have implemented the `BagOfWords` class, we need to fit it to the training data, and vectorize the training, validation, and test data.

In [29]:
bow = BagOfWords()

bow.fit(train_df['processed_text'])

X_train_mnb = bow.vectorize_corpus(train_df['processed_text'])

X_validation_mnb = bow.vectorize_corpus(validation_df['processed_text'])
X_test_mnb = bow.vectorize_corpus(test_df['processed_text'])




In [None]:
# print(f"Vocabulary size: {len(bow.vocabulary)}")
# print(f"Shape of training vectors: {train_vectors.shape}")
# print(f"Shape of validation vectors: {validation_vectors.shape}")
# print(f"Shape of test vectors: {test_vectors.shape}")



### From Scratch

Now that we have vectorized our sentences, we can implement our Naive Bayes model. Recall that the Naive Bayes model is based off of the Bayes Theorem:

$$
P(y \mid x) = \frac{P(x \mid y)P(y)}{P(x)}
$$

What we really want is to find the class $c$ that maximizes $P(c \mid x)$, so we can use the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(x \mid c)P(c)
$$

We can then use the Naive Bayes assumption to simplify this:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(c) \prod_{i=1}^{n} P(x_i \mid c)
$$

Where $x_i$ is the $i^{\text{th}}$ word in our sentence.

All of these probabilities can be estimated from our training data. We can estimate $P(c)$ by counting the number of times each class appears in our training data, and dividing by the total number of training examples. We can estimate $P(x_i \mid c)$ by counting the number of times the $i^{\text{th}}$ word in our vocabulary appears in sentences of class $c$, and dividing by the total number of words in sentences of class $c$.

It would help to apply logarithms to the above equation so that we translate the product into a sum, and avoid underflow errors. This will give us the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ \log P(c) + \sum_{i=1}^{n} \log P(x_i \mid c)
$$

We will now implement this algorithm. It would help to go through [this chapter from SLP3](https://web.stanford.edu/~jurafsky/slp3/4.pdf) to get a better understanding of the model - **it is recommended to base our implementation off the pseudocode that has been provided on Page 6**. We can either make a `NaiveBayes` class, or just implement the algorithm across two functions.

<span style="color: red;"> For this part, the only external library we will need is `numpy`. We are not allowed to use anything else.</span>

Now use our implementation to train a Naive Bayes model on the training data, and generate predictions for the Validation Set.

Report the Accuracy, Precision, Recall, and F1 score of our model on the validation data. Also display the Confusion Matrix. We are allowed to use `sklearn.metrics` for this.

In [31]:
print("Columns in train_df:", train_df.columns)
print("Columns in validation_df:", validation_df.columns)
print("Columns in test_df:", test_df.columns)


Columns in train_df: Index(['processed_text', 'label'], dtype='object')
Columns in validation_df: Index(['processed_text', 'label'], dtype='object')
Columns in test_df: Index(['processed_text', 'label'], dtype='object')


In [33]:
class MNaiveBayes:
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.class_word_counts = {}
        self.class_counts = {}
        self.total_word_counts = {}
        
        for cls in self.classes:
            class_indices = np.where(y == cls)[0]  
            class_word_counts = {}
            total_words = 0

            
            for idx in class_indices:
                
                for word_idx in range(X.shape[1]): 
                    count = X[idx, word_idx]
                    if count > 0:  # Only consider words with non-zero counts
                        class_word_counts[word_idx] = class_word_counts.get(word_idx, 0) + count
                        total_words += count

            # Store counts for each class
            self.class_word_counts[cls] = class_word_counts
            self.class_counts[cls] = len(class_indices)
            self.total_word_counts[cls] = total_words

    def predict(self, X):
        predictions = []
        for x in X:
            class_probabilities = {}

            for cls in self.classes:
            
                class_prob = np.log(self.class_counts[cls] / sum(self.class_counts.values()))
                total_words = self.total_word_counts[cls]
                class_word_counts = self.class_word_counts[cls]
                
                for word_idx, count in enumerate(x):
                    if count > 0:
                        word_freq = class_word_counts.get(word_idx, 0) + 1
                        class_prob += np.log(word_freq / (total_words + X.shape[1]))
                
                class_probabilities[cls] = class_prob

            predictions.append(max(class_probabilities, key=class_probabilities.get))

        return predictions




In [35]:
# VALIDATION DATASET
mnb = MNaiveBayes()
mnb.fit(X_train_mnb, train_labels)

val_predictions = mnb.predict(X_validation_mnb)


accuracy_mnb_val = accuracy_score(validation_labels, val_predictions)
precision_mnb_val = precision_score(validation_labels, val_predictions, average='weighted')
recall_mnb_val= recall_score(validation_labels, val_predictions, average='weighted') 
f1_mnb_val = f1_score(validation_labels, val_predictions, average='weighted')
conf_matrix_mnb_val = confusion_matrix(validation_labels, val_predictions)


In [37]:
# MODEL 2 ON TEST DATASET

mnbtest = MNaiveBayes()
mnbtest.fit(X_train_mnb, train_labels) 

# test dataset
test_predictions = mnbtest.predict(X_test_mnb)


accuracy_mnb_test = accuracy_score(test_labels, test_predictions)
precision_mnb_test = precision_score(test_labels, test_predictions, average='weighted')
recall_mnb_test= recall_score(test_labels, test_predictions, average='weighted') 
f1_mnb_test = f1_score(test_labels, test_predictions, average='weighted')
conf_matrix_mnb_test = confusion_matrix(test_labels, test_predictions)

In [None]:
# nb_model = NaiveBayes()
# print( "object called succesfully")
# nb_model.fit(train_vectors, train_df['processed_text'].values)
# print( " model fit ho gaya hai training data per")
# y_pred = nb_model.predict(validation_vectors)
# print(" yahan tak chal raha")

# # Calculate metrics
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# print("metrics are being calculated")
# accuracy = accuracy_score(validation_df['processed_text'].values, y_pred)
# precision = precision_score(validation_df['processed_text'].values, y_pred, average='weighted')
# recall = recall_score(validation_df['processed_text'].values, y_pred, average='weighted')
# f1 = f1_score(validation_df['processed_text'].values, y_pred, average='weighted')
# conf_matrix = confusion_matrix(validation_df['processed_text'].values, y_pred)

In [39]:
print(f"Val. Accuracy: {accuracy_mnb_val:.4f}")
print(f"Val Precision: {precision_mnb_val:.4f}")
print(f"Val Recall: {recall_mnb_val:.4f}")
print(f"Val F1 Score: {f1_mnb_val:.4f}")
print("\n Val Confusion Matrix:")
print(conf_matrix_mnb_val)

Val. Accuracy: 0.6578
Val Precision: 0.6762
Val Recall: 0.6578
Val F1 Score: 0.6340

 Val Confusion Matrix:
[[142   7   0  11]
 [ 36  46   1  14]
 [ 14   2   4   8]
 [ 28   7   0  54]]


In [41]:
print("\n THIS IS FOR THE TEST DATASET \n")

print(f" Test Accuracy: {accuracy_mnb_test:.4f}")
print(f" Test Precision: {precision_mnb_test:.4f}")
print(f" Test Recall: {recall_mnb_test:.4f}")
print(f" Test F1 Score: {f1_mnb_test:.4f}")
print("\n Confusion Matrix:")
print(conf_matrix_mnb_test)


 THIS IS FOR THE TEST DATASET 

 Test Accuracy: 0.6467
 Test Precision: 0.6653
 Test Recall: 0.6467
 Test F1 Score: 0.6209

 Confusion Matrix:
[[495  20   2  41]
 [123 166   3  66]
 [ 73  11  15  24]
 [117  19   3 243]]


## 4. Implementing Naive Bayes using sklearn

In this section, We will compare our manual implementations with `sklearn`'s implementations of both of the Naive Bayes models we have covered above.

In [45]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import (accuracy_score, precision_score,
                             recall_score, f1_score, confusion_matrix,
                             precision_recall_fscore_support)


print("\nSklearn Bernoulli Naive Bayes Results:")
bnb = BernoulliNB()
bnb.fit(X_train, y_train)

y_pred_nb = bnb.predict(X_test)

accuracy_nb = accuracy_score(y_test, y_pred_nb)
precision_nb = precision_score(y_test, y_pred_nb, average='weighted', zero_division=0)
recall_nb = recall_score(y_test, y_pred_nb, average='weighted', zero_division=0)
f1_nb = f1_score(y_test, y_pred_nb, average='weighted', zero_division=0)

# Golf TEST DATA SKLEARN
print(f"{'-'*20}\nAccuracy : {accuracy_nb:.3f}\nPrecision: {precision_nb:.3f}\nRecall   : {recall_nb:.3f}\nF1 Score : {f1_nb:.3f}")

# Confusion matrix on the test dataset GOLF
conf_matrix_nb = confusion_matrix(y_test, y_pred_nb)
print("\nConfusion Matrix:")
print(conf_matrix_nb)

#Mnaive Bayes   - Model 1-------------Validation TWITTER
msklearn = MultinomialNB()
msklearn.fit(X_train_mnb, train_labels)

msklearn_predictions = msklearn.predict(X_validation_mnb)

accuracy_mnb = accuracy_score(validation_labels, msklearn_predictions)
precision_mnb, recall_mnb, f1_mnb, _ = precision_recall_fscore_support(validation_labels, msklearn_predictions, average='weighted')
print("\nSklearn Multinomial Naive Bayes Results on Validation :\n\n")
print(f"Accuracy: {accuracy_mnb:.3f}")
print(f"Precision: {precision_mnb:.3f}")
print(f"Recall: {recall_mnb:.3f}")
print(f"F1 Score: {f1_mnb:.3f}")
conf_matrix_mnb_val = confusion_matrix(validation_labels, msklearn_predictions)
print("Confusion Matrix for Validation: ", conf_matrix_mnb_val)




Sklearn Bernoulli Naive Bayes Results:
--------------------
Accuracy : 0.820
Precision: 0.769
Recall   : 0.820
F1 Score : 0.749

Confusion Matrix:
[[1872   13]
 [ 401   14]]

Sklearn Multinomial Naive Bayes Results on Validation :


Accuracy: 0.650
Precision: 0.685
Recall: 0.650
F1 Score: 0.625
Confusion Matrix for Validation:  [[141   7   0  12]
 [ 38  44   0  15]
 [ 15   2   4   7]
 [ 29   6   0  54]]


In [47]:
# Model 2 on TEST DATASET TWIITER

msklearn2 = MultinomialNB()
msklearn2.fit(X_train_mnb, train_labels)

# Generate predictions
msklearn2_predictions = msklearn2.predict(X_test_mnb)

accuracy_mnb2 = accuracy_score(test_labels, msklearn2_predictions)
precision_mnb2, recall_mnb2, f1_mnb2, _ = precision_recall_fscore_support(test_labels, msklearn2_predictions, average='weighted')
print("\nSklearn Multinomial Naive Bayes Results on Test:\n")
print(f"Accuracy: {accuracy_mnb2:.3f}")
print(f"Precision: {precision_mnb2:.3f}")
print(f"Recall: {recall_mnb2:.3f}")
print(f"F1 Score: {f1_mnb2:.3f}")

conf_matrix_mnb2 = confusion_matrix(test_labels, msklearn2_predictions)
print("Confusion Matrix: " ,conf_matrix_mnb2)



Sklearn Multinomial Naive Bayes Results on Test:

Accuracy: 0.652
Precision: 0.672
Recall: 0.652
F1 Score: 0.626
Confusion Matrix:  [[501  19   2  36]
 [120 173   2  63]
 [ 74  12  14  23]
 [120  20   3 239]]


## 5. Conclusion

1. Explain the key factors we considered when determining which dataset is more suitable for **Multinomial Naive Bayes** and which is better suited for **Bernoulli Naive Bayes**.

The number of classes to be specified made most of the difference. BNB is mostly used for binary features as was the case in Gold Data which had features with binary classes while in the Twitter Data, there were many classes in our features which led to the use of MNB.
BNB model works better with binary/boolean data as it is ideal for situations where each feature is treated as a binary indicator (0 or 1), regardless of its frequency. MNB is particularly effective when dealing with frequency-based data as it expects the input features to be counts of occurrences (case in point: the number of times a word appears in a document) Thus, MNB was most suitable for the Twitter data which had us take a bag of words approach. 

Feature Distribution and Data Sparsity were considered as well. BNB works well with features that follow a Bernoulli distribution, where each feature is independent and can either be present or absent. It also works well on sparse datasets where most feature values are zero as was the case in some of our features in Golf data. The opposite is true for MNB.