# Naive Bayes for Text Classification

### Introduction

In this notebook, you will be implementing a Naive Bayes model to classify sentences based off their emotions.

The Naive Bayes model is a probabilistic model that uses Bayes' Theorem to calculate the probability of a label given some observed features. In this case, we will be using the Naive Bayes model to calculate the probability of a sentence belonging to a certain emotion given the words in the sentence.

In [1]:
# import all required libraries here

import re
import numpy as np
import pandas as pd
import string
import math

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score


## Loading and Preprocessing the Dataset

We will be working with the [dair-ai/emotion](https://huggingface.co/datasets/dair-ai/emotion) dataset. This contains 6 classes of emotions: `joy`, `sadness`, `anger`, `fear`, `love`, and `surprise`.

Instead of downloading the dataset manually, we will be using the [`datasets`](https://huggingface.co/docs/datasets) library to download the dataset for us. This is a library in the HuggingFace ecosystem that allows us to easily download and use datasets for NLP tasks. Outside of just downloading the dataset, it also provides a standard interface for accessing the data, which makes it easy to use with other libraries like Pandas and PyTorch. You can take a look at the huge list of datasets available [here](https://huggingface.co/datasets).

In the following cells,

1. Load in the dataset (It should already be split into train, validation, and test sets.)

2. Define a dictionary mapping the emotion labels to integers. You can find these on the dataset page linked above.

3. Format each split of the dataset into a Pandas DataFrame. The columns should be `text` and `label`, where `text` is the sentence and `label` is the emotion label.

In [2]:
from datasets import load_dataset
dataset = load_dataset("dair-ai/emotion")

  from .autonotebook import tqdm as notebook_tqdm
No config specified, defaulting to: emotion/split
Found cached dataset emotion (C:/Users/khuze/.cache/huggingface/datasets/dair-ai___emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd)
100%|██████████| 3/3 [00:00<00:00, 597.11it/s]


In [3]:
emotion_list = ["sadness", "joy", "love", "anger", "fear", "surprise"]
emo_dict = dict()
for i in range(len(emotion_list)):
    emo_dict[emotion_list[i]] = i

print(emo_dict)

{'sadness': 0, 'joy': 1, 'love': 2, 'anger': 3, 'fear': 4, 'surprise': 5}


In [4]:
# Available splits: ['test', 'train', 'validation']

def convert_dataset_to_pd(data):
    new_format = pd.DataFrame.from_dict(data)
    return new_format

train_data = convert_dataset_to_pd(dataset['train'])
validation_data = convert_dataset_to_pd(dataset['validation'])
test_data = convert_dataset_to_pd(dataset['test'])

print(validation_data)

                                                   text  label
0     im feeling quite sad and sorry for myself but ...      0
1     i feel like i am still looking at a blank canv...      0
2                        i feel like a faithful servant      2
3                     i am just feeling cranky and blue      3
4     i can have for a treat or if i am feeling festive      1
...                                                 ...    ...
1995  im having ssa examination tomorrow in the morn...      0
1996  i constantly worry about their fight against n...      1
1997  i feel its important to share this info for th...      1
1998  i truly feel that if you are passionate enough...      1
1999  i feel like i just wanna buy any cute make up ...      1

[2000 rows x 2 columns]


Now that we've gotten a feel for the dataset, we might want to do some cleaning or preprocessing before continuing. For example, we might want to remove punctuation and other alphanumeric characters, lowercase all the text, strip away extra whitespace, and remove stopwords.

In the cell below, there is a function that does exactly the following described above. 

In [5]:
def punc_rem(text):
    text = text.lower()
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

def tokenize(text):
    text = re.split('\W+', text)
    return text

stopword = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 
             'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some','for', 'do', 'its', 'yours', 
             'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 
             'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 
             'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 
             'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 
             'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 
             'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 
             'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 
             'further', 'was', 'here', 'than', ''] 

def rem_stop(text):
    text = tokenize(text)
    text = [word for word in text if word not in stopword]
    return text

def preprocess(text):
    text = punc_rem(text)
    text = rem_stop(text)
    return text

In [6]:
train_data['Processed_Text'] = train_data['text'].apply(lambda x: preprocess(x))
validation_data['Processed_Text'] = validation_data['text'].apply(lambda x: preprocess(x))
test_data['Processed_Text'] = test_data['text'].apply(lambda x: preprocess(x))

### Vectorizing sentences with Bag of Words

Now that we have loaded in our data, we will need to vectorize our sentences - this is necessary to be able to numericalize our inputs before feeding them into our model. 

We will be using a Bag of Words approach to vectorize our sentences. This is a simple approach that counts the number of times each word appears in a sentence. 

The element at index $\text{i}$ of the vector will be the number of times the $\text{i}^{\text{th}}$ word in our vocabulary appears in the sentence. So, for example, if our vocabulary is `["the", "cat", "sat", "on", "mat"]`, and our sentence is `"the cat sat on the mat"`, then our vector will be `[2, 1, 1, 1, 1]`.

We will now create a `BagOfWords` class to vectorize our sentences. This will involve creating

1. A vocabulary from our corpus

2. A mapping from words to indices in our vocabulary

3. A function to vectorize a sentence in the fashion described above

In [7]:
# Making the Vocabulary using the Train Data
all_texts = train_data['Processed_Text']

vocabulary = set()
for row in all_texts:
    for element in row:
        vocabulary.add(element)
        
vocabulary = list(vocabulary)
print(len(vocabulary))

15086


In [8]:
# Function to make a vector from a string
def vectorize(text,vocab):
    words = text       
    bag_vector = np.zeros(len(vocab),dtype=np.uint16)        
    for w in words:
        # A problem of out of vocabulary words occured when seeing new data. for classification purposes, we can simply ignore such words
        if w in vocab:
            ind = vocab.index(w)
            bag_vector[ind] += 1 
    return bag_vector

# Function to vectorize a pandas data structure
def vectorize_pd(text_list,vocab):
    total_bags = []
    for sentence in text_list:   
        bag_vector = vectorize(sentence,vocab)
        total_bags.append(np.array(bag_vector))     
    return total_bags

For a sanity check, we manually set the vocabulary of your `BagOfWords` object to the vocabulary of the example above, and check that the vectorization of the sentence is correct.

Once we have implemented the `BagOfWords` class, we fit it to the training data, and vectorize the training, validation, and test data.

In [9]:
# Sanity check
example_vocab = ["the", "cat", "sat", "on", "mat"]
s = "the cat sat on the mat"
print(vectorize(tokenize(s),example_vocab))
print(vectorize(preprocess(s),example_vocab))

[2 1 1 1 1]
[0 1 1 0 1]


In [10]:
bags = vectorize_pd(train_data['Processed_Text'],vocabulary)
train_data['bow'] = bags

bags2 = vectorize_pd(validation_data['Processed_Text'],vocabulary)
validation_data['bow'] = bags2

bags3 = vectorize_pd(test_data['Processed_Text'],vocabulary)
test_data['bow'] = bags3

## Naive Bayes

### From Scratch

Now that we have vectorized our sentences, we can implement our Naive Bayes model. Recall that the Naive Bayes model is based off of the Bayes Theorem:

$$
P(y \mid x) = \frac{P(x \mid y)P(y)}{P(x)}
$$

What we really want is to find the class $c$ that maximizes $P(c \mid x)$, so we can use the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(x \mid c)P(c)
$$

We can then use the Naive Bayes assumption to simplify this:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(c) \prod_{i=1}^{n} P(x_i \mid c)
$$

Where $x_i$ is the $i^{\text{th}}$ word in our sentence.

All of these probabilities can be estimated from our training data. We can estimate $P(c)$ by counting the number of times each class appears in our training data, and dividing by the total number of training examples. We can estimate $P(x_i \mid c)$ by counting the number of times the $i^{\text{th}}$ word in our vocabulary appears in sentences of class $c$, and dividing by the total number of words in sentences of class $c$.

It would help to apply logarithms to the above equation so that we translate the product into a sum, and avoid underflow errors. This will give us the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ \log P(c) + \sum_{i=1}^{n} \log P(x_i \mid c)
$$

We will now implement this algorithm. 

In [11]:
# helper function for counting frequencies in 2d array
def counter(twod):
    n = len(twod[0])
    res = np.zeros((n,))
    
    for row in range(len(twod)):
        for col in range(n):
            if twod[row][col] > 0:
                res[col] += 1
            
    return res

In [12]:
def trainNaiveBayes(D,C,vocab):
    logprior = dict()
    logliklihood = dict()
    
    # Repeating for all classes
    for c in C:
        N_doc = len(D)
        N_c = (D['label'] == c).sum()
        logprior[c] = math.log(N_c/N_doc)
        
        # Getting vocabulary of one class
        bigdoc = D.loc[D['label'] == c]
        class_vocab = set()
        class_texts = bigdoc['Processed_Text']
        for row in class_texts:
            for element in row:
                class_vocab.add(element)
                     
        # Getting frequencies
        bigdoc = bigdoc['bow'].values.tolist()
        counts = counter(bigdoc)
        
        for word in vocab:
            index = vocab.index(word)
            count_w = counts[index]        
            count_not_w = len(vocab) - 1
            count_not_w += counts.sum()
            count_not_w -= count_w
            frac = (count_w+1)/count_not_w
            logliklihood[(word,c)] = math.log(frac)
            
    return logprior,logliklihood

def testNaiveBayes(testdoc,logprior,logliklihood,C):
    sums = dict()
    for c in C:
        sums[c] = logprior[c]
        for word in testdoc:
            if word in vocabulary:
                sums[c] += logliklihood[(word,c)]
            
    return max(sums, key=sums.get)

Now using the implementation to train a Naive Bayes model on the training data, we generate predictions for the Validation Set.

We report the Accuracy, Precision, Recall, and F1 score of the model on the validation data.

In [13]:
# Training model
classes = [0,1,2,3,4,5]
prior, likelihoods = trainNaiveBayes(train_data,classes,vocabulary)

# Making predictions on test data
test_features = (test_data['Processed_Text']).values.tolist()
correct = (test_data['label']).values.tolist()
prediction_list = []
for f in test_features:
    p = testNaiveBayes(f,prior,likelihoods,classes)
    prediction_list.append(p)


In [14]:
print("Accuracy is: ", accuracy_score(prediction_list,correct))
print("Precision is: ", precision_score(prediction_list,correct,average='weighted'))
print("Recall is: ", recall_score(prediction_list,correct,average='weighted'))
print("F1 is: ", f1_score(prediction_list,correct,average='weighted'))
print("Confusuion Matrix is: ")
print(confusion_matrix(prediction_list,correct))

Accuracy is:  0.7865
Precision is:  0.8747333462672048
Recall is:  0.7865
F1 is:  0.8133268336502549
Confusuion Matrix is: 
[[541  15  26  64  49  17]
 [ 33 673  80  32  33  32]
 [  0   5  48   0   0   0]
 [  4   0   4 173   7   0]
 [  3   2   1   6 135  14]
 [  0   0   0   0   0   3]]


### Using `sklearn`

Now that we have implemented your own Naive Bayes model, we will use the `sklearn` library to train a Naive Bayes model on the same data. Alongside this, we will use their implementation of the Bag of Words model, the `CountVectorizer` class, to vectorize your sentences.

In [15]:
# Making Bag of Words for train and test data
corpus = train_data["text"]
vectorizer = CountVectorizer()
sk_train_vectors = vectorizer.fit_transform(corpus)
sk_covab = vectorizer.get_feature_names_out()

sk_test_vectors = vectorizer.transform(test_data["text"])

In [16]:
# Training model
clf = MultinomialNB()
clf.fit(sk_train_vectors, train_data['label'])

# Making predictions on test data
pred = clf.predict(sk_test_vectors)
correct = (test_data['label']).values.tolist()

In [17]:
print("Accuracy is: ", accuracy_score(pred,correct))
print("Precision is: ", precision_score(pred,correct,average='weighted'))
print("Recall is: ", recall_score(pred,correct,average='weighted'))
print("F1 is: ", f1_score(pred,correct,average='weighted'))
print("Confusuion Matrix is: ")
print(confusion_matrix(pred,correct))

Accuracy is:  0.7655
Precision is:  0.8783994550412492
Recall is:  0.7655
F1 is:  0.8007682644551712
Confusuion Matrix is: 
[[546  16  27  66  59  21]
 [ 29 674  91  46  39  32]
 [  0   2  36   0   0   0]
 [  2   1   4 156   7   0]
 [  4   2   1   7 119  13]
 [  0   0   0   0   0   0]]


  _warn_prf(average, modifier, msg_start, len(result))
