# Naive Bayes Text Classifier

## Overview

Naive Bayes is a probabilistic classification algorithm based on Bayes' Theorem, commonly used in Natural Language Processing (NLP) tasks such as spam detection, sentiment analysis, and document classification. It assumes that features (words in our case) are conditionally independent given the class label. This "naive" assumption of independence simplifies calculations and makes the model computationally efficient, even for large datasets.

## Bayes' Theorem

Bayes' Theorem provides a way to calculate the posterior probability of a class given observed data. The formula for Bayes' Theorem is:
 
#### P(C|X) = P(C) * Prod(P(Xi|C)) /P(X) ; i = 1 to n(number of samples)
 
- P(C|X) is the conditional probability of X being in class C 
- P(C) is the prior probability of each class c
- P(Xi|C) is the probability of the ith example being in class C

### Dataset used:
Stanford NLP's IMDB dataset:
`Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).`

### Steps
1. **Preprocessing**: Tokenize text and build a vocabulary of unique words.
2. **Feature Representation**: Represent each document as a Bag of Words (BoW) vector.
3. **Model Training**: Implement Naive Bayes to estimate the probabilities for each class and each word given a class.
4. **Prediction and Evaluation**: Use the model to classify new documents and evaluate its performance.
5. **Comparison**: Compare the custom implementation with `scikit-learn`â€™s `MultinomialNB` to observe differences in efficiency and accuracy.

### Discussion

Naive Bayes classifiers are popular for text classification because of their simplicity and efficiency, especially with BoW data. While the independence assumption is rarely true for natural language data, Naive Bayes still performs well in practice. This project demonstrates how the model's assumptions and mechanics work under the hood, and it provides insight into how simple probabilistic methods can be effective for text classification.

---

Through this project, we gain practical experience with both the fundamentals of the Naive Bayes algorithm and text classification workflows. By implementing and comparing with a pre-built library, we gain a deeper understanding of model assumptions, optimizations, and real-world application of machine learning in NLP.


In [1]:
from datasets import load_dataset
import numpy as np
from collections import defaultdict
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
imdb = load_dataset("imdb") # loading the dataset
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [3]:
# Extracting the required data and shuffling splitting the train set itself as train and test
train_X = imdb['train']['text']
train_Y = imdb['train']['label']
train_X, test_X, train_Y, test_Y = train_test_split(
    train_X, train_Y, train_size=20000, shuffle=True
) 


In [4]:
test_X = test_X[:100]
test_Y = test_Y[:100]

In [5]:
# This function normalizes the input data then tokenizes and removes stop words to reduce noise in dataset.
def preprocess(texts):
    stop_words = set(stopwords.words("english"))
    vocab = set()
    processed_texts = []

    for i, text in enumerate(texts):
        print(i, end='\r')
        words = [
            word for word in nltk.word_tokenize(text.lower()) if word not in stop_words
        ]
        vocab.update(words)  # Adds words to vocab
        processed_texts.append(" ".join(words))

    return vocab, processed_texts
vocab, processed_texts = preprocess(train_X)
vocab, processed_texts_test = preprocess(test_X)

99999

In [6]:
vocab = list(vocab)
vocab_size = len(vocab)

In [7]:
# Found it efficient use the existing function than to implement it as the vocab size is quiet big
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(vocabulary=vocab)
bows = vectorizer.transform(processed_texts).toarray()
bows_test = vectorizer.transform(processed_texts_test).toarray()

In [8]:
X = np.array(bows)
Y = np.array(train_Y)
X_test = np.array(bows_test)
Y_test = np.array(test_Y)

In [9]:
# This class implements Naive Bayes
class NaiveBayes:
    def __init__(self):
        self.class_prior = {} # to store P(C) for every C
        self.likelihood = defaultdict(lambda: defaultdict(lambda: 0)) # to store P(X|C), for every X and C
        self.classes = [] # store unique classes
        self.class_word_counts = defaultdict(int) # How many words of the vocab lies in this class
        self.vocab_size = 0

    def train(self, X, Y):
        self.classes = np.unique(Y) # Unique labels are the classes
        self.vocab_size = X.shape[1] # X is of shape number of samples x vocab_size (so using that or can use vocab_size from above as well)

        for c in self.classes:
            self.class_prior[c] = np.mean(Y == c) # Divides number of y == c with len of y (probability of occurunce of this class in the dataset)

        for c in self.classes:
            class_sample = X[Y == c] # Samples that has y=c as labels
            self.class_word_counts[c] = np.sum(class_sample) # number of words in class c

            for word_idx in range(self.vocab_size):
                self.likelihood[c][word_idx] = (
                    np.sum(class_sample[:, word_idx]) + 1
                )  # +1 is added to avoid zero probabilities of words that don't occur in a class. np.sum(class_sample[:word_idx]) gives the number of samples in class C that has the word in word_idx

            self.likelihood[c] = {
                word_idx: self.likelihood[c][word_idx]
                / (self.class_word_counts[c] + self.vocab_size)
                for word_idx in self.likelihood[c]
            }  # We are normalising the probability with the number of words in class C and to avoid division by zero, we are adding self.vocab_size
    
    
    def predict(self, X):
        predictions = []
        for x in X:
            class_probs = {}
            for c in self.class_prior:
                log_prob = np.log(self.class_prior[c]) #using log to avoid overflowing

                for word_idx in range(len(x)):
                    if (
                        x[word_idx] > 0
                    ):  
                        log_prob += np.log(self.likelihood[c][word_idx]) # Adding instead of multipluying as we have taken log

                class_probs[c] = log_prob

            predictions.append(max(class_probs, key=class_probs.get))

        return np.array(predictions)

In [10]:
nb_classifier = NaiveBayes()
nb_classifier.train(X, Y)

In [11]:
predictions = nb_classifier.predict(X_test)
accuracy = np.mean(predictions == Y_test)
print("Accuracy of implemented Naive Bayes Classifier: ", accuracy)

Accuracy of implemented Naive Bayes Classifier:  0.88


In [30]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

nb_model = MultinomialNB()

nb_model.fit(X, Y)

test_predictions = nb_model.predict(X_test)

print("Test Accuracy:", accuracy_score(Y_test, test_predictions))


Test Accuracy: 0.87


In [28]:
# custom input
text = ["This was a very interseting movie!! I had tears in my eyes at the end but this movie will be very close to me!", "Oh man the movie is just my money going to water:("]
_, processed = preprocess(text)
vectorized_text = np.array(vectorizer.transform(processed).toarray())
prediction = nb_classifier.predict(vectorized_text)

1

In [29]:
prediction

array([1, 0])

4710