# PA1.2 Naive Bayes for Text Classification

### Introduction

In this notebook, you will be implementing a Naive Bayes model to classify sentences based off their emotions.

The Naive Bayes model is a probabilistic model that uses Bayes' Theorem to calculate the probability of a label given some observed features. In this case, we will be using the Naive Bayes model to calculate the probability of a sentence belonging to a certain emotion given the words in the sentence.

For reference and additional details, please go through [Chapter 4](https://web.stanford.edu/~jurafsky/slp3/4.pdf) of the SLP3 book.


### Instructions

- Follow along with the notebook, filling out the necessary code where instructed.

- <span style="color: red;">Read the Submission Instructions, Plagiarism Policy, and Late Days Policy in the attached PDF.</span>

- <span style="color: red;">Make sure to run all cells for credit.</span>

- <span style="color: red;">Do not remove any pre-written code.</span>

- <span style="color: red;">You must attempt all parts.</span>

In [1]:
# import all required libraries here
import pandas as pd
import numpy as np
from datasets import load_dataset_builder
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


## Loading and Preprocessing the Dataset

We will be working with the [dair-ai/emotion](https://huggingface.co/datasets/dair-ai/emotion) dataset. This contains 6 classes of emotions: `joy`, `sadness`, `anger`, `fear`, `love`, and `surprise`.

Instead of downloading the dataset manually, we will be using the [`datasets`](https://huggingface.co/docs/datasets) library to download the dataset for us. This is a library in the HuggingFace ecosystem that allows us to easily download and use datasets for NLP tasks. Outside of just downloading the dataset, it also provides a standard interface for accessing the data, which makes it easy to use with other libraries like Pandas and PyTorch. You can take a look at the huge list of datasets available [here](https://huggingface.co/datasets).

In the following cells,

1. Load in the dataset (It should already be split into train, validation, and test sets.)

2. Define a dictionary mapping the emotion labels to integers. You can find these on the dataset page linked above.

3. Format each split of the dataset into a Pandas DataFrame. The columns should be `text` and `label`, where `text` is the sentence and `label` is the emotion label.

In [2]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("dair-ai/emotion")

# Inspect dataset description
ds_builder.info.description


# Inspect dataset features
ds_builder.info.features


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

In [3]:
from datasets import load_dataset

dataset = load_dataset("dair-ai/emotion", trust_remote_code=True)


In [4]:
dataset

# Load train split into a pandas DataFrame
train_df = pd.DataFrame(dataset['train'])
print("Train DataFrame:")
print(train_df.head())

# Load validation split into a pandas DataFrame
validation_df = pd.DataFrame(dataset['validation'])
print("\nValidation DataFrame:")
print(validation_df.head())

# Load test split into a pandas DataFrame
test_df = pd.DataFrame(dataset['test'])
print("\nTest DataFrame:")
print(test_df.head())

Train DataFrame:
                                                text  label
0                            i didnt feel humiliated      0
1  i can go from feeling so hopeless to so damned...      0
2   im grabbing a minute to post i feel greedy wrong      3
3  i am ever feeling nostalgic about the fireplac...      2
4                               i am feeling grouchy      3

Validation DataFrame:
                                                text  label
0  im feeling quite sad and sorry for myself but ...      0
1  i feel like i am still looking at a blank canv...      0
2                     i feel like a faithful servant      2
3                  i am just feeling cranky and blue      3
4  i can have for a treat or if i am feeling festive      1

Test DataFrame:
                                                text  label
0  im feeling rather rotten so im not very ambiti...      0
1          im updating my blog because i feel shitty      0
2  i never make her separate from me becaus

In [5]:
dataset_map = {0:'sadness', 1:'joy', 2:'love', 3:'anger', 4:'fear', 5:'surprise'}

Now that we've gotten a feel for the dataset, we might want to do some cleaning or preprocessing before continuing. For example, we might want to remove punctuation and other alphanumeric characters, lowercase all the text, strip away extra whitespace, and remove stopwords.

In the cell below, write a function that does exactly the following described above. You can use the `re` library to help you with this. You can also use the `nltk` library to help you with removing stopwords.

Once you are done, you can simply `apply` this function to the `text` column of the dataset to get the preprocessed text.

In [6]:
# code here
import nltk
from nltk.corpus import stopwords
import string

train_df['text'] = train_df['text'].astype(str)
validation_df['text'] = validation_df['text'].astype(str)
test_df['text'] = test_df['text'].astype(str)

nltk.download('stopwords')



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sehar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
def preprocess_text(text):
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    text = ''.join([char for char in text if char.isalpha() or char.isspace()])

    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])

    return text


In [8]:
print(train_df.columns)
print(type(train_df))

validation_df['processed_text'] = validation_df['text'].apply(preprocess_text)
train_df['processed_text'] = train_df['text'].apply(preprocess_text)


Index(['text', 'label'], dtype='object')
<class 'pandas.core.frame.DataFrame'>


In [9]:
print(validation_df.columns)

Index(['text', 'label', 'processed_text'], dtype='object')


In [10]:
print("Train DataFrame:")
print(train_df.iloc[0])

print(train_df.iloc[5])

Train DataFrame:
text              i didnt feel humiliated
label                                   0
processed_text      didnt feel humiliated
Name: 0, dtype: object
text              ive been feeling a little burdened lately wasn...
label                                                             0
processed_text        ive feeling little burdened lately wasnt sure
Name: 5, dtype: object


### Vectorizing sentences with Bag of Words

Now that we have loaded in our data, we will need to vectorize our sentences - this is necessary to be able to numericalize our inputs before feeding them into our model. 

We will be using a Bag of Words approach to vectorize our sentences. This is a simple approach that counts the number of times each word appears in a sentence. 

The element at index $\text{i}$ of the vector will be the number of times the $\text{i}^{\text{th}}$ word in our vocabulary appears in the sentence. So, for example, if our vocabulary is `["the", "cat", "sat", "on", "mat"]`, and our sentence is `"the cat sat on the mat"`, then our vector will be `[2, 1, 1, 1, 1]`.

You will now create a `BagOfWords` class to vectorize our sentences. This will involve creating

1. A vocabulary from our corpus

2. A mapping from words to indices in our vocabulary

3. A function to vectorize a sentence in the fashion described above

It may help you to define something along the lines of a `fit` and a `vectorize` method.

In [11]:
# BagOfWords class
class BagOfWords:
    def __init__(self):
        self.vocabulary = None

    def fit(self, corpus):
        # Build vocabulary from the corpus
        unique_words = set(word for sentence in corpus for word in sentence.split())
        # Assign each unique word an index in the vocabulary
        self.vocabulary = {word: i for i, word in enumerate(unique_words)}

    def vectorize(self, sentences):
        # Vectorize each sentence into Bag of Words vectors
        vectors = []
        for sentence in sentences:
             # Count occurrences of each word in the sentence and create a vector
            vector = [sentence.split().count(word) for word in self.vocabulary]
            vectors.append(vector)

        return vectors

For a sanity check, you can manually set the vocabulary of your `BagOfWords` object to the vocabulary of the example above, and check that the vectorization of the sentence is correct.

Once you have implemented the `BagOfWords` class, fit it to the training data, and vectorize the training, validation, and test data.

In [12]:
# code here
# SANITY CHECK


data = {'text': ["the cat sat on the mat", "the dog barked", "the cat chased the mouse"],
        'labels': [1, 0, 1]}
df = pd.DataFrame(data)

corpus = df['text'].astype(str).values


# Initialize and fit the Bag of Words model
bow_model = BagOfWords()
bow_model.fit(corpus)

# Vectorize the sentences in the DataFrame
vec = (bow_model.vectorize(corpus))
vec = np.array(vec)



print((vec))



[[1 1 2 0 0 1 0 0 1]
 [0 0 1 0 0 0 1 1 0]
 [1 0 2 1 1 0 0 0 0]]


In [13]:

vectorizer = BagOfWords()
corpus1 = train_df['processed_text'].astype(str).values
corpus2 = validation_df['processed_text'].astype(str).values

vectorizer.fit(corpus1)

train_vec = vectorizer.vectorize(corpus1)
train_vec = np.array(train_vec)
validation_vec = vectorizer.vectorize(corpus2)
validation_vec = np.array(validation_vec)

## Naive Bayes

### From Scratch

Now that we have vectorized our sentences, we can implement our Naive Bayes model. Recall that the Naive Bayes model is based off of the Bayes Theorem:

$$
P(y \mid x) = \frac{P(x \mid y)P(y)}{P(x)}
$$

What we really want is to find the class $c$ that maximizes $P(c \mid x)$, so we can use the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(x \mid c)P(c)
$$

We can then use the Naive Bayes assumption to simplify this:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(c) \prod_{i=1}^{n} P(x_i \mid c)
$$

Where $x_i$ is the $i^{\text{th}}$ word in our sentence.

All of these probabilities can be estimated from our training data. We can estimate $P(c)$ by counting the number of times each class appears in our training data, and dividing by the total number of training examples. We can estimate $P(x_i \mid c)$ by counting the number of times the $i^{\text{th}}$ word in our vocabulary appears in sentences of class $c$, and dividing by the total number of words in sentences of class $c$.

It would help to apply logarithms to the above equation so that we translate the product into a sum, and avoid underflow errors. This will give us the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ \log P(c) + \sum_{i=1}^{n} \log P(x_i \mid c)
$$

You will now implement this algorithm. It would help to go through [this chapter from SLP3](https://web.stanford.edu/~jurafsky/slp3/4.pdf) to get a better understanding of the model - **it is recommended base your implementation off the pseudocode that has been provided on Page 6**. You can either make a `NaiveBayes` class, or just implement the algorithm across two functions.

<span style="color: red;"> For this part, the only external library you will need is `numpy`. You are not allowed to use anything else.</span>

In [14]:
#Train SET
train_x = train_vec
train_y = train_df['label']
#VAL SET
val_x = validation_vec
val_y = validation_df['label']

In [15]:
print(type(train_x))
train_x

<class 'numpy.ndarray'>


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [16]:
import numpy as np

class NaiveBayes:
    def __init__(self):
        self.prior = None
        self.conditional_log_prob = None
        self.classes = None

    # Fit the classifier on training data
    def fit(self, input_data, output_labels):  # likelihood P(X | y)

        # Get the number of samples and features in input data
        num_samples, num_features = input_data.shape

        self.classes = np.unique(output_labels)  # Unique classes in output data or unique outcomes in y
        num_classes = len(self.classes)  # Number of classes

        # Calculate prior probabilities of each class
        self.prior = np.zeros(num_classes)  # P(X)
        
        # First initialize an array of size (num_classes x num_features)
        self.conditional_log_prob = np.zeros((num_classes, num_features))

        for i, class_label in enumerate(self.classes):  # Iterate over each unique outcome in y

            # Count the number of samples/instances with label 'class_label'
            num_samples_in_class = np.sum(output_labels == class_label)

            # Calculate the prior probability of the current class
            prior_prob_of_class = num_samples_in_class / num_samples

            # Store the prior probability of the current class in the array
            self.prior[i] = np.log(prior_prob_of_class)
            # Extract the features of all samples with class 'class_label'.
            features_in_class = input_data[output_labels == class_label]

            # Calculate the frequency of each feature in class 'class_label'.
            # We add 1 to each frequency to avoid zero probabilities.
            feature_frequencies = features_in_class.sum(axis=0) + 1

            # Calculating the denominator of the conditional probability equation:
            # the sum of the feature frequencies for class 'class_label' plus the number of features.
            # The total count of all feature occurrences in class 'class_label' + number of features
            denominator = np.sum(features_in_class) + num_features

            # Calculate the conditional probability of each feature for class 'class_label' by dividing
            # the feature frequencies by the denominator and taking the logarithm.
            conditional_log_probabilities = np.log(feature_frequencies / denominator)

            # Set the conditional probability of each feature for class 'class_label' in the
            # conditional_log_prob array.
            self.conditional_log_prob[i, :] = conditional_log_probabilities





    # Predict class labels of new input data
    def predict(self, new_input_data):
        predicted_labels = []
        for x in new_input_data:
            posterior_log_probs = []  # P(y | X)
            # Naive assumption (independence):
            # P(x1, x2 | Y) = P(x1 | Y) * P(x2 | Y)

            for i, class_label in enumerate(self.classes):

                prior_log_prob = self.prior[i]
                # Calculate conditional probability of features given class 'class_label'
                conditional_log_prob = np.sum(self.conditional_log_prob[i, :] * x)
                # Calculate posterior probability of class 'class_label'
                posterior_log_probs.append(prior_log_prob + conditional_log_prob)

            # Predict class with the highest posterior probability
            predicted_labels.append(self.classes[np.argmax(posterior_log_probs)])

        return predicted_labels


In [17]:
model = NaiveBayes()
model.fit(train_x,train_y)

Now use your implementation to train a Naive Bayes model on the training data, and generate predictions for the Validation Set.

Report the Accuracy, Precision, Recall, and F1 score of your model on the validation data. Also display the Confusion Matrix. You are allowed to use `sklearn.metrics` for this.

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


# Predict on the validation set
y_pred_val = model.predict(val_x)

# Calculate metrics
accuracy = accuracy_score(val_y, y_pred_val)
precision = precision_score(val_y, y_pred_val, average='weighted')
recall = recall_score(val_y, y_pred_val, average='weighted')
f1 = f1_score(val_y, y_pred_val, average='weighted')
conf_matrix = confusion_matrix(val_y, y_pred_val)

# Display the metrics and confusion matrix
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

print("\nConfusion Matrix:")
print(conf_matrix)


Accuracy: 0.7875
Precision: 0.8103
Recall: 0.7875
F1 Score: 0.7642

Confusion Matrix:
[[519  20   1   5   5   0]
 [ 31 668   3   2   0   0]
 [ 38  77  60   2   1   0]
 [ 52  30   0 189   4   0]
 [ 49  24   0   8 129   2]
 [ 31  30   0   1   9  10]]


### Using `sklearn`

Now that you have implemented your own Naive Bayes model, you will use the `sklearn` library to train a Naive Bayes model on the same data. Alongside this, you will use their implementation of the Bag of Words model, the `CountVectorizer` class, to vectorize your sentences.

You can use the `MultinomialNB` class to train a Naive Bayes model. Go through the relevant documentation to figure out how to use it, and how it differs from the model you implemented.

When you finish training your model, report the same metrics as above on the Validation Set.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


# Initialize and fit the CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
X_train_sklearn = vectorizer.fit_transform(train_df['processed_text'])
X_validation_sklearn = vectorizer.transform(validation_df['processed_text'])

# Initialize and fit the Multinomial Naive Bayes model
nb_model_sklearn = MultinomialNB()
nb_model_sklearn.fit(X_train_sklearn, train_df['label'])

# Predict on the validation set
y_pred_validation_sklearn = nb_model_sklearn.predict(X_validation_sklearn)

# Calculate metrics
accuracy_sklearn = accuracy_score(validation_df['label'], y_pred_validation_sklearn)
precision_sklearn = precision_score(validation_df['label'], y_pred_validation_sklearn, average='weighted')
recall_sklearn = recall_score(validation_df['label'], y_pred_validation_sklearn, average='weighted')
f1_sklearn = f1_score(validation_df['label'], y_pred_validation_sklearn, average='weighted')
conf_matrix_sklearn = confusion_matrix(validation_df['label'], y_pred_validation_sklearn)

# Display the metrics and confusion matrix
print("Metrics using sklearn:")
print(f"Accuracy: {accuracy_sklearn:.4f}")
print(f"Precision: {precision_sklearn:.4f}")
print(f"Recall: {recall_sklearn:.4f}")
print(f"F1 Score: {f1_sklearn:.4f}")

print("\nConfusion Matrix:")
print(conf_matrix_sklearn)


Metrics using sklearn:
Accuracy: 0.7975
Precision: 0.8170
Recall: 0.7975
F1 Score: 0.7785

Confusion Matrix:
[[515  24   0   4   6   1]
 [ 29 664   5   4   2   0]
 [ 32  73  69   3   1   0]
 [ 51  27   0 195   2   0]
 [ 44  22   0   6 139   1]
 [ 27  29   0   1  11  13]]
