# PA1.2 Naive Bayes for Text Classification

### Introduction

In this notebook, you will be implementing a Naive Bayes model to classify sentences based off their emotions.

The Naive Bayes model is a probabilistic model that uses Bayes' Theorem to calculate the probability of a label given some observed features. In this case, we will be using the Naive Bayes model to calculate the probability of a sentence belonging to a certain emotion given the words in the sentence.

For reference and additional details, please go through [Chapter 4](https://web.stanford.edu/~jurafsky/slp3/4.pdf) of the SLP3 book.


### Instructions

- Follow along with the notebook, filling out the necessary code where instructed.

- <span style="color: red;">Read the Submission Instructions, Plagiarism Policy, and Late Days Policy in the attached PDF.</span>

- <span style="color: red;">Make sure to run all cells for credit.</span>

- <span style="color: red;">Do not remove any pre-written code.</span>

- <span style="color: red;">You must attempt all parts.</span>

In [1]:
# import all required libraries here
%pip install datasets sklearn
import datasets
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
import sklearn

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


## Loading and Preprocessing the Dataset

We will be working with the [dair-ai/emotion](https://huggingface.co/datasets/dair-ai/emotion) dataset. This contains 6 classes of emotions: `joy`, `sadness`, `anger`, `fear`, `love`, and `surprise`.

Instead of downloading the dataset manually, we will be using the [`datasets`](https://huggingface.co/docs/datasets) library to download the dataset for us. This is a library in the HuggingFace ecosystem that allows us to easily download and use datasets for NLP tasks. Outside of just downloading the dataset, it also provides a standard interface for accessing the data, which makes it easy to use with other libraries like Pandas and PyTorch. You can take a look at the huge list of datasets available [here](https://huggingface.co/datasets).

In the following cells,

1. Load in the dataset (It should already be split into train, validation, and test sets.)

2. Define a dictionary mapping the emotion labels to integers. You can find these on the dataset page linked above.

3. Format each split of the dataset into a Pandas DataFrame. The columns should be `text` and `label`, where `text` is the sentence and `label` is the emotion label.

In [2]:
# code here
from datasets import load_dataset
## Data set looks like this:
# The data fields are:
# text: a string feature.
# label: a classification label, with possible values including sadness (0), joy (1), love (2), anger (3), fear (4), surprise (5).

dataset = load_dataset("dair-ai/emotion")
# print(dataset)

emotions_mapping = {  
    0: "sadness",
    1:"joy",
    2:"love",
    3:"anger",
    4:"fear",
    5:"surprise"
}

training_df = pd.DataFrame(dataset['train'])
print('Training Data frame: ', training_df)
validation_df = pd.DataFrame(dataset['validation'])
print('Validation Data frame: ', validation_df)
test_df = pd.DataFrame(dataset['test'])
print('Testing Data frame: ', test_df)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Training Data frame:                                                      text  label
0                                i didnt feel humiliated      0
1      i can go from feeling so hopeless to so damned...      0
2       im grabbing a minute to post i feel greedy wrong      3
3      i am ever feeling nostalgic about the fireplac...      2
4                                   i am feeling grouchy      3
...                                                  ...    ...
15995  i just had a very brief time in the beanbag an...      0
15996  i am now turning and i feel pathetic that i am...      0
15997                     i feel strong and good overall      1
15998  i feel like this was such a rude comment and i...      3
15999  i know a lot but i feel so stupid because i ca...      0

[16000 rows x 2 columns]
Validation Data frame:                                                     text  label
0     im feeling quite sad and sorry for myself but ...      0
1     i feel like i am still looki

Now that we've gotten a feel for the dataset, we might want to do some cleaning or preprocessing before continuing. For example, we might want to remove punctuation and other alphanumeric characters, lowercase all the text, strip away extra whitespace, and remove stopwords.

In the cell below, write a function that does exactly the following described above. You can use the `re` library to help you with this. You can also use the `nltk` library to help you with removing stopwords.

Once you are done, you can simply `apply` this function to the `text` column of the dataset to get the preprocessed text.

In [3]:
nltk.download('stopwords')
nltk.download('punkt')
def preProcess(input_text):

    #1 converting text to lower case
    input_text = re.sub(r'[^a-zA-Z\s]', '', input_text)
    # print(input_text)

    #2 lower casing
    input_text= input_text.lower()

    #3 tokenizing the text and then removing stop words from it
    word_tokens = nltk.word_tokenize(input_text)
    stop_words = set(stopwords.words('english'))
    processed=[word for word in word_tokens if word not in stop_words]

    processed_sentence = ' '.join(processed)

    #4 removing whitespace
    processed_sentence=processed_sentence.strip()

    return processed_sentence

training_df['text'] = training_df['text'].apply(preProcess)
validation_df['text'] = validation_df['text'].apply(preProcess)
test_df['text'] = test_df['text'].apply(preProcess)

print('Training Data frame: ', training_df)
print('Validation Data frame: ', validation_df)
print('Testing Data frame: ', test_df)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mohsintanveer/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mohsintanveer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Training Data frame:                                                      text  label
0                                  didnt feel humiliated      0
1      go feeling hopeless damned hopeful around some...      0
2              im grabbing minute post feel greedy wrong      3
3      ever feeling nostalgic fireplace know still pr...      2
4                                        feeling grouchy      3
...                                                  ...    ...
15995      brief time beanbag said anna feel like beaten      0
15996  turning feel pathetic still waiting tables sub...      0
15997                           feel strong good overall      1
15998                     feel like rude comment im glad      3
15999                       know lot feel stupid portray      0

[16000 rows x 2 columns]
Validation Data frame:                                                     text  label
0              im feeling quite sad sorry ill snap soon      0
1     feel like still looking blan

### Vectorizing sentences with Bag of Words

Now that we have loaded in our data, we will need to vectorize our sentences - this is necessary to be able to numericalize our inputs before feeding them into our model. 

We will be using a Bag of Words approach to vectorize our sentences. This is a simple approach that counts the number of times each word appears in a sentence. 

The element at index $\text{i}$ of the vector will be the number of times the $\text{i}^{\text{th}}$ word in our vocabulary appears in the sentence. So, for example, if our vocabulary is `["the", "cat", "sat", "on", "mat"]`, and our sentence is `"the cat sat on the mat"`, then our vector will be `[2, 1, 1, 1, 1]`.

You will now create a `BagOfWords` class to vectorize our sentences. This will involve creating

1. A vocabulary from our corpus

2. A mapping from words to indices in our vocabulary

3. A function to vectorize a sentence in the fashion described above

It may help you to define something along the lines of a `fit` and a `vectorize` method.

In [4]:
# code here
class BagOfWords:
    def __init__(self):
        self.__vocabulary={}
        self.__index0fWord=[]

    #Learning the vocabulary
    def fit(self,data):
        index_of_word=0
        for sentence in data:
            words=sentence.split()
            for w in words:
                if w not in self.__vocabulary:
                    self.__vocabulary[w]=index_of_word
                    self.__index0fWord.append(w)
                    index_of_word+=1

    #Vectorizing a sentence
    def vectorize_a_sentence(self,sentence):
        vocabulary_length=len(self.__vocabulary)
        processed_sentence=[0]*vocabulary_length

        words=sentence.split()
        for w in words:
            if w in self.__vocabulary:
                processed_sentence[self.__vocabulary[w]]+=1
        
        return processed_sentence
    
    def get_vocabulary(self):
        return self.__vocabulary

    def get_indexOfword(self):
        return self.__index0fWord




For a sanity check, you can manually set the vocabulary of your `BagOfWords` object to the vocabulary of the example above, and check that the vectorization of the sentence is correct.

Once you have implemented the `BagOfWords` class, fit it to the training data, and vectorize the training, validation, and test data.

In [5]:

# # Testing on sample
# sample_data=["the cat sat on the mat"]
# bag_of_words=BagOfWords()
# bag_of_words.fit(sample_data)
# print("vocab:", bag_of_words.get_vocabulary())
# s=bag_of_words.vectorize_a_sentence(sample_data[0])
# print(s)

bag_of_words=BagOfWords()

# Fitting to training data
bag_of_words.fit(training_df['text'])
print("vocab:", bag_of_words.get_vocabulary())

#Vectorizing each set
training_data_vectorized=training_df['text'].apply(bag_of_words.vectorize_a_sentence).tolist()
validation_data_vectorized= validation_df['text'].apply(bag_of_words.vectorize_a_sentence).tolist()
test_data_vectorized = test_df['text'].apply(bag_of_words.vectorize_a_sentence).tolist()


i =bag_of_words.get_indexOfword()
training_df_vectorized = pd.DataFrame(training_data_vectorized, columns=i)
validation_df_vectorized = pd.DataFrame(validation_data_vectorized, columns=i)
test_df_vectorized = pd.DataFrame(test_data_vectorized, columns=i)
print("Training data vectorized:",training_df_vectorized.head())
print("Validation data vectorized:",validation_df_vectorized.head())
print("Test data vectorized:",test_df_vectorized.head())


Training data vectorized:    didnt  feel  humiliated  go  feeling  hopeless  damned  hopeful  around  \
0      1     1           1   0        0         0       0        0       0   
1      0     0           0   1        1         1       1        1       1   
2      0     1           0   0        0         0       0        0       0   
3      0     0           0   0        1         0       0        0       0   
4      0     0           0   0        1         0       0        0       0   

   someone  ...  pandora  cosmopolitian  monkees  tearing  celebrities  \
0        0  ...        0              0        0        0            0   
1        1  ...        0              0        0        0            0   
2        0  ...        0              0        0        0            0   
3        0  ...        0              0        0        0            0   
4        0  ...        0              0        0        0            0   

   irrelevant  braeden  calvin  beanbag  subbing  
0        

## Naive Bayes

### From Scratch

Now that we have vectorized our sentences, we can implement our Naive Bayes model. Recall that the Naive Bayes model is based off of the Bayes Theorem:

$$
P(y \mid x) = \frac{P(x \mid y)P(y)}{P(x)}
$$

What we really want is to find the class $c$ that maximizes $P(c \mid x)$, so we can use the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(x \mid c)P(c)
$$

We can then use the Naive Bayes assumption to simplify this:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(c) \prod_{i=1}^{n} P(x_i \mid c)
$$

Where $x_i$ is the $i^{\text{th}}$ word in our sentence.

All of these probabilities can be estimated from our training data. We can estimate $P(c)$ by counting the number of times each class appears in our training data, and dividing by the total number of training examples. We can estimate $P(x_i \mid c)$ by counting the number of times the $i^{\text{th}}$ word in our vocabulary appears in sentences of class $c$, and dividing by the total number of words in sentences of class $c$.

It would help to apply logarithms to the above equation so that we translate the product into a sum, and avoid underflow errors. This will give us the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ \log P(c) + \sum_{i=1}^{n} \log P(x_i \mid c)
$$

You will now implement this algorithm. It would help to go through [this chapter from SLP3](https://web.stanford.edu/~jurafsky/slp3/4.pdf) to get a better understanding of the model - **it is recommended base your implementation off the pseudocode that has been provided on Page 6**. You can either make a `NaiveBayes` class, or just implement the algorithm across two functions.

<span style="color: red;"> For this part, the only external library you will need is `numpy`. You are not allowed to use anything else.</span>

In [6]:
# code here
# input to the fucntion will be the training data frame and the classes -> which is essentially a dictionary
# we created earlier, and the labels
def Train_Naive_Bayes(Document, Classes,labels):
    #Calculating log prior
    total_docs=len(Document)
    log_prior_dictionary={}
    log_likelihood_dictionary={}

    vocabulary=bag_of_words.get_vocabulary()
    vocab_length=len(vocabulary)

    for class_integer_mapping in range(len(Classes)):
        no_of_docs_of_that_class=0
        doc_per_class=[]
        
        for i in range(len(labels)):
            label=labels[i]
            if class_integer_mapping== label:
                no_of_docs_of_that_class+=1
                doc_per_class.append(Document.loc[i])
        
        #calculating log prior for each class and printing
        log_prior_of_a_class=np.log(no_of_docs_of_that_class/total_docs)
        #storing each of the log priors in dictionary
        log_prior_dictionary[class_integer_mapping]=log_prior_of_a_class
        
        log_likelihood_per_class=[]
        for word in vocabulary:
            count_per_word_of_that_class=0
            log_likelihood=0
            for doc in doc_per_class:
                if doc[word]>0:
                    count_per_word_of_that_class+=int(doc[word])

            #Ensuring smoothing
            log_likelihood=np.log((count_per_word_of_that_class+1)/(no_of_docs_of_that_class+vocab_length))
            log_likelihood_per_class.append(log_likelihood)
        
        log_likelihood_dictionary[class_integer_mapping]={'words':vocabulary,'loglikelihood':log_likelihood_per_class}

    return log_prior_dictionary,log_likelihood_dictionary
          

def TestNaiveBayes(testdoc,logprior,loglikelihood,Classes):
    class_belonging=-1
    max=-100000000000000000000
    vocab=bag_of_words.get_vocabulary()
    for class_integer_mapping in range(len(Classes)):
        sum_prob=0
        sum_prob=logprior[class_integer_mapping]
        words=testdoc.split()
        # Checking the probability word by word
        for word in words:
            if word in vocab:
                location=loglikelihood[class_integer_mapping]['words'][word]
                sum_prob=sum_prob+loglikelihood[class_integer_mapping]['loglikelihood'][location]
        # Need the max prob
        if sum_prob>max:
            max=sum_prob
            class_belonging=class_integer_mapping

    return class_belonging

Now use your implementation to train a Naive Bayes model on the training data, and generate predictions for the Validation Set.

Report the Accuracy, Precision, Recall, and F1 score of your model on the validation data. Also display the Confusion Matrix. You are allowed to use `sklearn.metrics` for this.

In [7]:
# code here
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
#Calling the training function and learning the probabilities
log_prior,log_likelihood=Train_Naive_Bayes(training_df_vectorized,emotions_mapping,training_df['label'])

result_labels=[]
for sentence in validation_df['text']:
    label_class=TestNaiveBayes(sentence,log_prior,log_likelihood,emotions_mapping)
    result_labels.append(label_class)

true_labels=validation_df['label'].tolist()
predicted_labels=result_labels

accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')
conf_matrix = confusion_matrix(true_labels, predicted_labels)

# Display the metrics
print("Accuracy:", accuracy*100, "%")
print("Precision:", precision*100, "%")
print("Recall:", recall*100, "%")
print("F1 Score:", f1*100, "%")
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 65.3 %
Precision: 72.80706571371847 %
Recall: 65.3 %
F1 Score: 57.0518040513073 %
Confusion Matrix:
[[500  50   0   0   0   0]
 [ 11 693   0   0   0   0]
 [ 31 140   7   0   0   0]
 [ 89 119   0  67   0   0]
 [ 69 102   0   2  39   0]
 [ 30  51   0   0   0   0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Using `sklearn`

Now that you have implemented your own Naive Bayes model, you will use the `sklearn` library to train a Naive Bayes model on the same data. Alongside this, you will use their implementation of the Bag of Words model, the `CountVectorizer` class, to vectorize your sentences.

You can use the `MultinomialNB` class to train a Naive Bayes model. Go through the relevant documentation to figure out how to use it, and how it differs from the model you implemented.

When you finish training your model, report the same metrics as above on the Validation Set.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

#Initialising
vectorizer=CountVectorizer()
clf=MultinomialNB()
Y=training_df['label']
true_labels=validation_df['label']

#Creating the bag of words
data = pd.concat([training_df['text'], validation_df['text']], ignore_index=True)
vectorizer.fit(data)
vectorized_training_data = vectorizer.transform(training_df['text'])
vectorized_validation_data=vectorizer.transform(validation_df['text'])

#Training the model and predicting
clf.fit(vectorized_training_data,Y)
predicted_labels = clf.predict(vectorized_validation_data)

#Metrics
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')
conf_matrix = confusion_matrix(true_labels, predicted_labels)

# Display the metrics
print("Accuracy:", accuracy*100, "%")
print("Precision:", precision*100, "%")
print("Recall:", recall*100, "%")
print("F1 Score:", f1*100, "%")
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 78.95 %
Precision: 80.0331134801813 %
Recall: 78.95 %
F1 Score: 76.87046662259364 %
Confusion Matrix:
[[516  18   2   5   9   0]
 [ 27 659   9   6   2   1]
 [ 36  74  65   2   1   0]
 [ 48  29   1 193   4   0]
 [ 44  23   0   8 135   2]
 [ 31  27   0   1  11  11]]
