# CS535/EE514 Machine Learning - Spring 2024 - Assignment 3

# Naive Bayes and Sentiment Classification

**Date Assigned**: Friday, February 23, 2024

**Date Due**: Monday, March 8, 2024 (11:55 pm)

**Important Notes**

1.   The assignment integrates tasks spanning methods as well as principles. Some tasks will involve implementation (in Python) and some may require mathematical analysis.  
2.   All cells must be run once before submission and should be displaying the results(graphs/plots etc). Failure to do so will result in deduction of points.
3.   While discussions with your peers is strongly encouraged, please keep in mind that the assignment is to be attempted on an individual level. Any plagiarism (from your peers) will be referred to the DC without hestitation.
4. For tasks requiring mathematical analysis, students familiar with latex may type their solutions directly in the appropriate cells of this notebook. Alternatively, they may submit a hand-written solution as well.
5. Use procedural programming style and comment your code properly.
5. Upload your solutions as a zip folder with name `RollNumber_A3` on the Assignment tab and submit your hand-written solutions in the drop-box next to the instructor's office.
5. **Policy on Usage of Generative AI Tools**. Students are most welcome to use generative AI tools as partners in their learning journey. However, it should be kept in mind that these tools cannot be blindly trusted for the tasks in this assignment (hopefully) and therefore it is important for students to rely on their own real intelligence (pun intended) before finalizing their solution/code. It is also mandatory for students to write a statement on how exactly have they used any AI tool in completing this assignment.
5. **Vivas** The teaching staff reserves the right to conduct a viva for any student.   
5. **Policy on Late Submission**. Late solutions will be accepted with a 10% penalty per day till Wednesday, March 6 2024 (11:55 pm) . No submissions will be accepted after that.      




The following packages are required for this assignment.

In [1]:
# Required Libraries
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

# Comment below if not on Goolge Colab
# from google.colab import drive
# drive.mount('/content/drive')

  from pandas.core import (


### Overview

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. They are highly scalable and can quickly predict the class of a given data point. The effectiveness of Naive Bayes, however, significantly relies on the estimation of probability distributions of features, which is where Maximum Likelihood Estimation (MLE) estimation comes into play.

There are four tasks in this assignment. Tasks 1 and 2 will solely focus on mathematical derivations/analysis that will help students get a better understanding of MLE. Task 3 with focus on implementing NBC (from scratch) for sentiment classification using the bag of words approach and Task 4 will ask you to use the `sklearn` library to implement the same task.


## Task 1: Maximum Likelihood Estimation (20 points)

1. Consider i.i.d drawing of random variables $X_1, X_2, \ldots, X_N$ from a Gaussian distribution with unknown mean and variance. Given the observation $X_1 = x_1, \ldots, X_N = x_N$, derive the MLE for the unknown mean and variance. Recall that the MLE for the unknown parameters can be obtained as 
$$
\hat{\mu}, \hat{\sigma}^2 = \arg \max_{\mu, \sigma^2} f_{X_1, \ldots, X_N}\left(x_1, \ldots, x_N \Big| \mu, \sigma^2 \right)
$$


2. Now consider an i.i.d. drawing $X_1, X_2, \ldots, X_N$ from a Poisson distribution with unknown parameter  $\lambda$ (e.g., $X_i$ could represent the number of customers arriving at a service desk per hour over a day). Given the observation $X_1 = x_1, \ldots, X_N = x_N$, show that the MLE for the unknown parameter is given as
$$
\hat{\lambda} = \frac{1}{N} \sum_{i=1}^N x_i 
$$
Recall that the MLE for $\lambda$ will be obtained as
$$
\hat{\lambda} = \arg \max_{\lambda} P_{X_1, \ldots, X_N}\left(x_1, \ldots, x_N \Big| \lambda \right)
$$
Also recall that the Poisson distribution is given as $P_{X} \left(x\right) = \frac{\lambda^x e^{-\lambda}}{x!}$ for $x$ being a non-negative integer.



## Task 2: Poisson Naive Bayes (20 Points)

Recall that (most of) our class discussions focused on the case where our features were binary, i.e., $X_j \in \{0,1\}$. Thus, conditioned on each class label, we were required to only estimate one parameter $p_{j|c} = P(X_j = 1|Y=c)$.
Now consider the case where the feature $X_j$ is non-binary, e.g., the frequency with which the word index-$j$ of a dictionary appears in a document. This feature can take on an any non-negative integer and therefore a brute force method would require us to estimate infinitely many parameters,

$$
p_{j|c}^{(k)} \triangleq P\left(X_j = k \Big| Y = c\right), \:\:\: \text{for } k = 0, 1, 2,3, \ldots
$$
This brute-force method is cleary infeasible in practice. Consequently, we resort to some modeling assumptions.

One possible modeling assumption (another moedling assumption is the multinomial model discussed in Task 3) is to assume that the feature $X_j$ conditioned on $Y=c$ follows a Poisson distribution which is described  by a single parameter $\lambda_{j|c}$. Because of this modeling assumption, the training process is simplified requiring one to estimate only one parameter per feature per class.

Now consider that you are given the training data $\mathcal{D} = \{ \mathbf{x}_i, y_i\}_{i=1}^N$, where $\mathbf{x}_i = [x_{i,1}, \ldots, x_{i,M}]^T$ is a feature vector with $x_{i,j}$ being the observation of feature-$j$ of example-$i$ (a non-negative integer). Extending the Poisson parameter estimate derived in Task 1, the training process in Poisson Naive Bayes will involve estimating $\lambda_{j|c}$ for all $j$ and $c$ as follows:
$$ \hat{\lambda_{j|c}} = \frac{1}{N_c} \sum_{i=1}^N x_{i,j} \times \: \mathbb{I}\left(y_i = c\right)  $$

Estimating the class priors will be done similar to what was discussed in class.

1. Consider the Poisson Naive Bayes' binary classification problem where the class label is either 0 or 1. Given the test point $\mathbf{x} = [x_1, x_2, \ldots, x_M]^T$, show that the classifier is linear with the decision rule being Class-0 (and Class-1 otherwise) if
$$ \sum_{j=1}^M w_j x_j + b > 0$$
with
$$ b = \log \hat{\pi}_0 - \log \hat{\pi}_1 +  \sum_{j=1}^M \left( \hat{\lambda}_{j|1} - \hat{\lambda}_{j|0} \right) $$
$$w_j = \log \frac{\hat{\lambda}_{j|0}}{\hat{\lambda}_{j|1}} $$

2. Let $M=2$, $w_1 = 2$, $w_2 = -1$, and $b =3$. Sketch the region in a two-dimensional feature space (with $x_1$ on the x-axis and $x_2$ on the y-axis) for which your decision will always be Class-0.

3. Suppose that for a certain feature set with $M=2$, the training data yields $w_1 = 2$, $w_2 = 1$ and $b = 3$. Mr. X claims that training data to be rubbish? Do you agree or disagree with Mr. X? Justify your answer.

## Task 3: Naive Bayes and Sentiment Classification (50 Points)

For text classification, a commonly utilized method is the bag of words approach. The Bernoulli feature model only checks if a certain word in the dictionary is present in the document or not. However, ignoring the number of times that word was used in the document may result in loss of information. To account for this, an alternate feature model that is often used is where $X_j$, $j=1, \ldots, M$ indicates the frequency with which the word index-$j$ of the dictionary appears in the document.

The probabilistic model followed is a multinomial model, where every time a word is generated for class-$c$, it would be one of the $M$ words in the vocabulary/dictionary, with word at index-$j$ appearing with probability $p_{j|c}$ and  $p_{1|c} + p_{2|c} + \ldots, p_{M|c} = 1$. Moreover, it is assumed that each word is generated independently of the others. For a given class $c$, the likelihood of observing a feature vector $\mathbf{x} = [x_1, \ldots, x_M]^T$ is given as

$$ P(\mathbf{x} | c) = \frac{L!}{\prod_{j=1}^M x_j!} \prod_{j=1}^{M} p_{j|c}^{x_j} $$

where $L = \sum_{j=1}^M x_j$ is the total number of words in the document/tweet.
The term $\frac{L!}{\prod_{j=1}^M x_j!}$ is just a constant that stays the same for all class labels.

The probabilities (with Laplace smoothing) are estimated from the training data as follows:

$$ \hat{p}_{j|c} = \frac{N_{j,c} + 1}{N_{c} + M} $$

where:
- $N_{j,c}$ is the number of dictionary word at index $j$ in class-$c$ examples,
- $N_{c}$ is the total count of all words all class-$c$ examples,
- $M$ is the size of the dictionary



In this task you will be implementing Naive Bayes Classifier (from scratch) for sentiment classification using the bag of words approach. You are given a dataset with two columns:

- `Tweet` → Column of type `string` containing sentences
- `Sentiment` → Column of type `object (categorical)`. It contains the sentiment of each of the tweets either *positive*, *neutral* or *negative*.

The datasets have already been split into `train` and `test` for you. Please import the provided `train.csv` and `test.csv`. The file `stopwords.txt` has been imported for you. Just change your directory path accordingly. More on that soon. Apply the following encoding to your `Sentiment` column:

- positive: 0
- neutral: 1
- negative: 2

In [2]:
# Code Here

# Enclose your path within the below ' '
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
stopwords = pd.read_table('stop_words.txt', header = None)[0]

# Apply Encoding of Sentiment Column here
# Define the mapping
sentiment_map = {
    "positive":0,
    "neutral":1,
    "negative":2,
}

# Apply the mapping to train and test
df_train['Sentiment'] = df_train['Sentiment'].replace(sentiment_map)
df_test['Sentiment'] = df_test['Sentiment'].replace(sentiment_map)

df_train

  df_train['Sentiment'] = df_train['Sentiment'].replace(sentiment_map)
  df_test['Sentiment'] = df_test['Sentiment'].replace(sentiment_map)


Unnamed: 0,Sentiment,Tweet
0,1,"@united to be clear on my luggage comment, I a..."
1,0,@united I wanna be grand staff
2,2,@united. DAY to IAD and CVG to IAD both Cancel...
3,0,@JetBlue Looking cool
4,2,@USAirways disappointment? Making own arrangem...
...,...,...
5851,0,@VirginAmerica is the best airline I have flow...
5852,2,@americanair never fails to disappoint. waitin...
5853,1,That's not the case.. They sent me a email wit...
5854,2,@USAirways 1 hour and counting on hold :( why?...


Now that the data has been imported, next comes preprocessing the dataset. As you have seen from the small glimpse of the data above, there is much cleaning to be done including removing hyperlinks, urls, digits, excess spaces and more. For naive bayes to work properly, such preprocessing is neccessary when data is in a bad shape like this one.

Complete the below function which when given the tweet column of any dataframe performs the aforementioned cleaning tasks. It is recommended you use the powerful library of *regular expressions* - `re` for this purpose. This library is an essential skill tool for dealing with *Natural Language Problems (NLP)*.

To get used to the art of regular expressions, you can use this website https://regexr.com/. Here you can enter any text you want and experiment what regular expressions capture what.

If these are too intimidating for you, you can use methods that work with strings. Documentation is also provided.

*   [.casefold()](https://www.w3schools.com/python/ref_string_casefold.asp)
*   [.lstrip()](https://www.w3schools.com/python/ref_string_lstrip.asp)
*   [re.sub()](https://www.w3schools.com/python/python_regex.asp)
*   [.rstrip()](https://www.w3schools.com/python/ref_string_rstrip.asp)
*   [.replace](https://www.w3schools.com/python/ref_string_replace.asp)

In [3]:
def preprocess_df(df):

    """
    Preprocesses the tweet column of the train/test dataframe

    Args:
        df (Pandas.DataFrame or Pandas.Series)  : Dataframe before tweet column preprocessed

    Returns:
        df (Pandas.DataFrame or Pandas.Series)  : Dataframe after tweet column preprocessed

    """
    # converting words to lower casse
    df['Tweet'] = df['Tweet'].apply(lambda x: x.lower() if isinstance(x, str) else x)

    # Remove digits and next line symbols if present
    digit_pattern = r"[0-9]"
    df['Tweet']= df["Tweet"].replace(digit_pattern, ' ', regex=True)

    nextline_pattern = r"[\n]+"
    df['Tweet']= df["Tweet"].replace(nextline_pattern, ' ', regex=True)

    # Remove usernames and hyperlinks if present
    username_pattern = r"@[^\s]+"
    df['Tweet'] = df["Tweet"].replace(username_pattern, ' ', regex=True)

    hyperlink_pattern = r"(http[^\s]+)"
    df['Tweet'] = df["Tweet"].replace(hyperlink_pattern, 'url', regex=True)

    # Remove punctuation and symbols if present
    punctuation_symbol_pattern = r"[!@$%\^&\[\]\{\}…,;:#'\"?.()/\-]"
    df['Tweet'] = df["Tweet"].replace(punctuation_symbol_pattern, ' ', regex=True)

    # Remove excess space if present
    whitespaces_pattenr = r"[\s]+"
    df['Tweet'] = df["Tweet"].replace(whitespaces_pattenr, ' ', regex=True)

    # Removing stop words i.e. words in the stopwords.txt file should not be present in any tweet
    def remove_stopwords(tweet):
        words = tweet.split()
        filtered_words = []
        for word in words:
            if word not in stopwords:
                filtered_words.append(word)
        result = ' '.join(filtered_words)
        return result

    with open("stop_words.txt", "r") as file:
        stopwords = set(word.strip() for word in file.readlines())
        stopwords = np.array(stopwords)

    df["Tweet"] = df["Tweet"].apply(lambda x: remove_stopwords(x))

    # Any other preprocessing you deem neccessary

    # keeping everything alphabetical
    alphabet_pattern = r"[^a-z]"
    df['Tweet'] = df["Tweet"].replace(alphabet_pattern, ' ', regex=True)

    return df

Apply the `preprocess_df` function on your train and test data, Split both into their seperate columns i.e. `train_tweet`, `train_sentiment` and same for test.

In [4]:
# Code Here

# Apply preprocess_df here
df_train, df_test = preprocess_df(df_train), preprocess_df(df_test)

# Split into their respective columns here

train_tweet = df_train["Tweet"]
train_labels = df_train["Sentiment"]

test_tweet = df_test["Tweet"]
test_labels = df_test["Sentiment"]

df_train

Unnamed: 0,Sentiment,Tweet
0,1,to be clear on my luggage comment i am referen...
1,0,i wanna be grand staff
2,2,day to iad and cvg to iad both cancelled fligh...
3,0,looking cool
4,2,disappointment making own arrangements for me ...
...,...,...
5851,0,is the best airline i have flown on easy to ch...
5852,2,never fails to disappoint waiting at jfk an ho...
5853,1,that s not the case they sent me a email with ...
5854,2,hour and counting on hold why url qwmxr


### Vectorizing sentences with Bag of Words

Now that we have loaded in our data, we will need to vectorize our tweets - this is necessary to be able to numericalize our inputs before feeding them into our model.

We will be using a Bag of Words (BOW) approach to vectorize our sentences. This is a simple approach that counts the number of times each word appears in a sentence.

The element at index $\text{i}$ of the vector will be the number of times the $\text{i}^{\text{th}}$ word in our vocabularly appears in the sentence. So, for example, if our vocabularly is `["the", "cat", "sat", "on", "mat"]`, and our sentence is `"the cat sat on the mat"`, then our vector will be `[2, 1, 1, 1, 1]`.

You will now create a `BagOfWords` class to vectorize our sentences. This will involve creating

1. A vocabularly from our corpus

2. A mapping from words to indices in our vocabularly

3. A function to vectorize a sentence in the fashion described above

The class below contains the skeleton code of the functions you need to fill in.

*Note #1:* It will benefit you later on to orgamize your vocabularly to contain words in the exact order they come in the corpus or tweets data.

*Note #2:* Note that each $(i, j)$ entry of vectorized data shows how many times the $j^{th}$ word in the vocabularly (list object) appears in the $i^{th}$ observation of the tweet column assuming you have followed **Note #1**

*Helpful Function:* You are encouraged to make use of [.split()](https://www.w3schools.com/python/ref_string_split.asp) method.

In [5]:
class BagOfWords:
    def __init__(self):
        """
        Initializes an empty vocabularly list to store unique words from the corpus.

        Attributes:
        - vocab (list): A list that will contain unique words from the corpus.
        """
        self.vocab = []

    def fit(self, corpus, df = True):
        """
        Builds the vocabularly from the given corpus. The corpus can be a DataFrame/Series with a 'text' column or a single string.

        Args:
        - corpus (DataFrame/Series str): The corpus to build the vocabularly from.
          If `df` is True, expects a DataFrame with a column named 'text'.
          If `df` is False, expects a single string.
        - df (bool): A flag to indicate whether the input corpus is a DataFrame or a single string.
          Defaults to True.

        Returns:
        None. Modifies the `self.dict` attribute in place by adding unique words from the corpus.
        """

        if not df:
            # Process a single string of text
            words = corpus.split()
            self.vocab = words
        else:
            # Process a DataFrame corpus
            words = (' '.join(corpus["Tweet"])).split()
            for word in words:
                if word not in self.vocab:
                    self.vocab.append(word)

    def get_idx(self, word):
        """
        Returns the index of a word in the vocabularly.

        Args:
        - word (str): The word to find in the vocabularly.

        Returns:
        int: The index of the word in the vocabularly. Raises a ValueError if the word is not in the vocabularly.

        """
        if word in self.vocab:
            return self.vocab.index(word)
        
        return -1

    def vectorize(self, sentence):
        """
        Converts a sentence into a vector based on the Bag of Words model, where each element of the vector represents
        the frequency of a vocabularly word in the sentence.

        Args:
        - sentence (str): The sentence to vectorize.

        Returns:
        np.array: A numpy array where each element corresponds to the frequency of a word from the vocabularly in the sentence.
        """
        vector = np.zeros(len(self.vocab))
        words = sentence.split()
        for word in words:
            index = self.get_idx(word)
            if index != -1:
                vector[index] += 1

        
        return vector

# Example usage
# Create an instance of BagOfWordsthi
bow = BagOfWords()

# Fit the model to the training corpus
bow.fit("the cat sat on mat", df = False)

# Vectorize an example sentence
example_sentence = "the cat sat on the mat"
vector = bow.vectorize(example_sentence)

print(vector)
# >>> array([2., 1., 1., 1., 1.])

[2. 1. 1. 1. 1.]


Once you have implemented the `BagOfWords` class, fit it to the training data, and vectorize the training, and test data.

*Note:* The `vectorize` method of the `BagOfWords` class vectorizes a single sentence but not a column of text.

*Helpful Functions:* You are encouraged to look into the methods [df.apply()](https://www.w3schools.com/python/pandas/ref_df_apply.asp), [np.vstack()](https://www.w3resource.com/numpy/manipulation/vstack.php), [lambda operator](https://www.geeksforgeeks.org/python-lambda-anonymous-functions-filter-map-reduce/).

In [6]:
# Initalize BOW object
bow = BagOfWords()

# Fit it to Training Data
bow.fit(df_train)

# Vectorize Train and Test Data to get their BOW matrix.
train_bow_matrix = np.vstack(df_train["Tweet"].apply(bow.vectorize))
test_bow_matrix = np.vstack(df_test["Tweet"].apply(bow.vectorize))

train_bow_matrix

array([[2., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [2., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

### Naive Bayes From Scratch

Now that we have vectorized our tweets, we can implement our Naive Bayes model. Recall that the Naive Bayes model is based off of the Bayes Theorem:

$$
P(y \mid x) = \frac{P(x \mid y)P(y)}{P(x)}
$$

What we really want is to find the class $c$ that maximizes $P(c \mid x)$, so we can use the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(x \mid c)P(c)
$$

where $c \in \{positive, neutral, negative\}$

We can then use the Naive Bayes assumption to simplify this:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(c) \prod_{i=1}^{L} P(x_i \mid c)
$$

Where $L$ is the total number of words in our tweet and  $x_i$ is the $i^{\text{th}}$ word in our tweet.

All of these probabilities can be estimated from our training data. We can estimate $P(c)$ by counting the number of times each class appears in our training data, and dividing by the total number of training examples. On the other hand, if $x_i$: the $i^{\text{th}}$ word in our tweet is equal to the $j^{\text{th}}$ word in our dictionary, we have $P(x_i \mid c) = \hat{p}_{j|c}$. It would help to apply logarithms to the above equation so that we translate the product into a sum, and avoid underflow errors. This will give us the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ \log \hat{\pi}_c  + \sum_{i=1}^{L} \log P(x_i \mid c)
$$

*Note* For words in the test data but not in the train data i.e. out of vocabularly/dictionary words (OOV), simply ignore them.

Many of you may be wondering how will the `BagOfWords` class come into play. This part of the assignment could have been solved by two ways. The first method does not vectorize the corpus and instead uses dictionary objects or look-up tables to store counts of each word for each class however building them is not too computationally efficient for large data corpus.

The other method involving `BOW` rests on the shoulders of `numpy` which enables extremely fast computation of the counts of each word/slicing/indexing and more!. Thumb of Rule: if you are able to translate your code into numpy do it. Dont think anything else! This is precisely how the `sklearn` library implements it which is what you have to learn.

You will now implement this algorithm in the following class `NaiveBayesClassifier`. It is highly encouraged for you to vectorize your computation as much as possible. For this make use of the `BOW` object being passed to the constructor of the class and the `bow_matrix` passed in the `fit()` method. This holds the key to everything in the aformentioned algorithm for computing the log probabilities of observation given a class so minimize your usage of the dataframe. The whole point of vectorizing the train and test corpus is to use the powerful and fast methods of matrix algebra.

*Note #1:* We have provided many attributes in the constructor and neccessary functions to solve this. If you require any other functions, feel free to do so and if you think some attributes in the constructor class are unncessary feel free to remove them but the essence of each method should remain the same and be respected.

*Note #2:* For those who have embraced vectorization, we have provided hints here and there in the class as in what kind of attributes will you be needing. For example `class_word_sum (np.ndarray)` which is a matrix where each row corresponds to a class and contains the count of each vocabularly word in that class. Think of how you can access these counts from the vectorized train matrix `bow_matrix` itself. Recall its intepretation from above and remember you are dealing with a numpy matrix so all operations such as `np.sum()`, `np.dot()` and more are all legal!


*Note #3:* Know that you have access to the dataframe as well as the vectorized train matrix `bow_matrix` in the `fit()` method of the below class. Since (hopefully) you have maintained the order of words as they come in the corpus in the `bow_matrix`, you can now easily index this matrix for a specific class since the indices in the dataframe of a specific class will be same for the matrix.

For this part, the only external library you will need is `numpy`. You are not allowed to use anything else.

Report the Accuracy, Precision, Recall, and F1 score of your model on train and test data. Also display the Confusion Matrix. You are allowed to use [.classification_report()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html), [.confusion_matrix()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) methods from `sklearn.metrics` for the evaluation.

In [7]:
class NaiveBayesClassifier:
    def __init__(self, bow):
        """
        Initializes the Naive Bayes Classifier with a given Bag of Words model.

        Args:
        - bow (BagOfWords): An instance of the BagOfWords class containing the vocabularly.

        Attributes:
        - log_class_priors (np.ndarray): Logarithm of the class priors, log(P(c)) for each class c.
        - bow (BagOfWords): The Bag of Words model passed during instantiation.
        - classes (np.ndarray): Array containing unique class Sentiments.
        - class_word_sum (np.ndarray): Matrix where each row corresponds to a class and contains the count of each vocabularly word in that class.
        - total_words_in_class (np.ndarray): Array where each element is the total count of words in a class.
        - vocab_length (int): The length of the vocabularly in the Bag of Words model.
        """

        self.log_class_priors = None
        self.bow = bow
        self.classes = None
        self.class_word_sum = None
        self.total_words_in_class = None
        self.vocab_length = None

    def fit(self, df, bow_matrix):
        """
        Fits the classifier to the training data, initializing class priors, dictionary counts per class, and total word counts per class.

        Args:
        - df (pd.DataFrame): DataFrame containing the training data with at least two columns: 'Tweet' and 'Sentiment'.
        - bow_matrix (np.ndarray): Bag of Words matrix representation of the training data, where rows correspond to documents and columns to dictionary words.

        Outputs:
        None. Modifies the classifier's attributes in place to store the calculated values.
        """

        num_classes = len(df["Sentiment"].unique())

        self.log_class_priors = np.zeros(num_classes)
        self.classes = sorted(np.array(df["Sentiment"].unique()))
        self.class_word_sum = np.zeros((num_classes, bow_matrix.shape[1]))
        self.total_words_in_class = np.zeros(num_classes)
        self.vocab_length = bow_matrix.shape[1]

        # calculating class priors ==> P(0), P(1), and P(2)
        total_examples = len(df["Sentiment"])
        priors = np.zeros(num_classes)
        for i in range(num_classes):
            class_indices = df[df["Sentiment"] == i]
            class_indices = np.array(class_indices.index)

            class_examples = len(df[df["Sentiment"] == i])
            prior = class_examples / total_examples
            priors[i] = prior

            # updating class attributes
            self.log_class_priors[i] = np.log(prior)
            self.class_word_sum[i, :] = np.sum(bow_matrix[class_indices, :], axis=0)
            self.total_words_in_class[i] = np.sum(self.class_word_sum[i, :])

    def predict(self, X_bow):
        """
        Predicts the class for a given set of documents represented as Bag of Words vectors.

        Args:
        - X_bow (np.ndarray): A 2D array where each row represents a document vectorized using the dictionary.

        Returns:
        np.ndarray: Array of predicted class indices for each document.
        """
        # calculating log-likelyhood probability with laplace smoothing (gets rid of -inf)
        log_likelyhood_probabilities = np.array(np.log(self.class_word_sum + 1) - np.log(self.total_words_in_class.reshape(-1, 1) + self.vocab_length))

        # calculating the probability of a sentence being in each class and extracting the class with the highest probability
        class_probabilities = np.dot(X_bow, log_likelyhood_probabilities.T) + self.log_class_priors
        predictions = np.argmax(class_probabilities, axis=1)

        return predictions

    def evaluate(self, X_bow, y_true, split = 'Validation'):
        """
        Evaluates the classifier's performance on a given dataset.

        Args:
        - X_bow (np.ndarray): Bag of Words matrix of the dataset to evaluate.
        - y_true (np.ndarray): True class Sentiments for the dataset.
        - split (str): Sentiment for the dataset being evaluated (e.g.,'Train', 'Test'). Defaults to 'Train'.

        Outputs:
        Prints the classification report and confusion matrix.
        """
        y_pred = self.predict(X_bow)
        report = classification_report(y_true, y_pred)
        confusion_mat = confusion_matrix(y_true, y_pred)
        
        return report, confusion_mat
        

In [8]:
# Initializing NB class Object
train = df_train
test = df_test

X_train = train_bow_matrix
y_train = np.array(train["Sentiment"])

X_test = test_bow_matrix
y_test = np.array(test["Sentiment"])

modelNB = NaiveBayesClassifier(bow)

# Fitting NB to train
modelNB.fit(df_train, X_train)

# Evaluating performance on each split
train_report, confusion_matrix_train = modelNB.evaluate(X_train, y_train)
test_report, confusion_matrix_test = modelNB.evaluate(X_test, y_test)

# Training Performance
print("\nTrain Data Evaluation Report:")
print(train_report)
print("Train Data Confusion Matrix:")
print(confusion_matrix_train)

# Testing Performance
print("\nTest Data Evaluation Report:")
print(test_report)
print("Test Data Confusion Matrix:")
print(confusion_matrix_test)



Train Data Evaluation Report:
              precision    recall  f1-score   support

           0       0.91      0.69      0.79       948
           1       0.89      0.60      0.72      1288
           2       0.83      0.98      0.90      3620

    accuracy                           0.85      5856
   macro avg       0.88      0.76      0.80      5856
weighted avg       0.86      0.85      0.84      5856

Train Data Confusion Matrix:
[[ 656   41  251]
 [  40  772  476]
 [  24   55 3541]]

Test Data Evaluation Report:
              precision    recall  f1-score   support

           0       0.80      0.51      0.63       239
           1       0.71      0.34      0.46       301
           2       0.77      0.97      0.86       924

    accuracy                           0.76      1464
   macro avg       0.76      0.61      0.65      1464
weighted avg       0.76      0.76      0.74      1464

Test Data Confusion Matrix:
[[123  20  96]
 [ 22 102 177]
 [  8  22 894]]


## Task 2 Using `sklearn` (10 Points)

Now that you have implemented your own Naive Bayes model, you will use the `sklearn` library to train a Naive Bayes model on the same data. Alongside this, you will use their implementation of the Bag of Words model, the `CountVectorizer` class, to vectorize your sentences.

You can use the `MultinomialNB` class to train a Naive Bayes model. Go through the relevant documentation to figure out how to use it, and how it differs from the model you implemented.

When you finish training your model, report the same metrics as above on the Testing set

In [9]:
# Vectorize the text data for each split
train = df_train
test = df_test
count_vectorizer = CountVectorizer()

X_train_bow = count_vectorizer.fit_transform(train["Tweet"])
X_train_bow = np.array(X_train_bow.todense())

X_test_bow = count_vectorizer.transform(test["Tweet"])
X_test_bow = np.array(X_test_bow.todense())

# Initialize and Fit the Model on vectorized instances of train set
X_train = X_train_bow
y_train = train["Sentiment"]

X_test = X_test_bow
y_test = test["Sentiment"]

modelNB = MultinomialNB().fit(X_train, y_train)
y_pred_train = modelNB.predict(X_train)
y_pred_test = modelNB.predict(X_test)

# Evaluating performance on each split
train_report = classification_report(y_train, y_pred_train)
test_report = classification_report(y_test, y_pred_test)

# creating confusion matrices
confusion_matrix_train = confusion_matrix(y_train, y_pred_train)
confusion_matrix_test = confusion_matrix(y_test, y_pred_test)   

# Training Performance
print("\nTrain Data Evaluation Report:")
print(train_report)
print("Train Data Confusion Matrix:")
print(confusion_matrix_train)

# Testing Performance
print("\nTest Data Evaluation Report:")
print(test_report)
print("Test Data Confusion Matrix:")
print(confusion_matrix_test)


Train Data Evaluation Report:
              precision    recall  f1-score   support

           0       0.91      0.70      0.79       948
           1       0.88      0.59      0.71      1288
           2       0.83      0.98      0.90      3620

    accuracy                           0.85      5856
   macro avg       0.87      0.76      0.80      5856
weighted avg       0.85      0.85      0.84      5856

Train Data Confusion Matrix:
[[ 663   45  240]
 [  40  766  482]
 [  24   59 3537]]

Test Data Evaluation Report:
              precision    recall  f1-score   support

           0       0.81      0.51      0.63       239
           1       0.71      0.34      0.46       301
           2       0.77      0.97      0.86       924

    accuracy                           0.77      1464
   macro avg       0.76      0.61      0.65      1464
weighted avg       0.76      0.77      0.74      1464

Test Data Confusion Matrix:
[[122  21  96]
 [ 21 103 177]
 [  7  21 896]]


Once you have implemented the `BagOfWords` class, fit it to the training data, and vectorize the training, and test data.

*Note:* The `vectorize` method of the `BagOfWords` class vectorizes a single sentence but not a column of text.

*Helpful Functions:* You are encouraged to look into the methods [df.apply()](https://www.w3schools.com/python/pandas/ref_df_apply.asp), [np.vstack()](https://www.w3resource.com/numpy/manipulation/vstack.php), [lambda operator](https://www.geeksforgeeks.org/python-lambda-anonymous-functions-filter-map-reduce/).

## Student Statement on Usage of Generative AI Tools

Students MUST write a statement in this cell detailing their usage of any generative AI tools. If no such tool was used, write "*I have not used any generative AI tool for completing this assignment*".

In case such tools were used (and you are allowed to), the statement should read "*I have used generative AI tools for completing this assignment for Tasks (list the tasks) as per the following details*:". This should be followed by the following information:

1. What tools were used?
2. How exactly were they used?