## **Naïve Bayes Classification**  
The Naïve Bayes classifier is a probabilistic model based on **Bayes' Theorem**. The key assumption in Naïve Bayes is that **words in a document are conditionally independent given the class**.

---

## **Bayes' Theorem**  

Bayes' Theorem describes the probability of a class \( Y \) given a document \( X \):  


$P(Y \mid X) = \frac{P(X \mid Y) P(Y)}{P(X)}$

<br>

### **Naïve Bayes for Text Classification**  

Since \( P(X) \) is constant for all classes, we only need to compute:  


$P(Y \mid X) \propto P(X \mid Y) P(Y)$

For **text classification**, assuming a document consists of words \( w_1, w_2, ..., w_n \), the likelihood is modeled as:  

$
P(Y \mid w_1, w_2, ..., w_n) \propto P(Y) \prod_{i=1}^{n} P(w_i \mid Y)
$

where:  

- **\$( P(Y \mid X)$ \) → Posterior Probability**  
  - The probability of class \( Y \) given the words in the document.  

- **\( $P(w_i \mid Y$) \) → Likelihood**  
  - The probability of word \( w_i \) occurring given class \( Y \).  

- **\( $P(Y)$ \) → Prior Probability**  
  - How common class \( Y \) is.  

- **\( $P(X) $\) → Evidence**  
  - The total probability of document \( X \), which is ignored for classification.  

In **Naïve Bayes for text classification**, we assume that **words occur independently** within a document, leading to the **Bag of Words model**.



## **Naïve Bayes Classification Process**

### **1. Loading Data**
- Load the **IMDB dataset** containing movie reviews and their sentiment labels (positive/negative).

### **2. Preprocessing**
- Remove **HTML tags, URLs, and non-alphanumeric characters**.
- Convert text to **lowercase** and remove extra spaces.
- Remove **stopwords** (common words like "the", "is", "and", etc.).

### **3. Label Encoding**
- Convert categorical labels:
  - `positive → 1`
  - `negative → 0`
- This allows numerical processing of sentiment labels.

### **4. Calculating Prior Probabilities \( P(Y) \)**
- Compute the probability of a review being **positive or negative** in the dataset:

$P(Y) = \frac{\text{Count of class Y reviews}}{\text{Total reviews}}$

### **5. Building Vocabulary & Word Counts**
- Create a vocabulary of **unique words** from the dataset.
- Separate **positive** and **negative** reviews.
- Count how often each word appears in **positive and negative** reviews.

### **6. Naïve Bayes Classification**



- **Uses Laplace smoothing** to avoid zero probabilities.  
- Computes likelihood \( P(w \mid Y) \) for each word using:  

$
P(w \mid Y) = \frac{\text{Word count in class } Y + 1}{\text{Total words in class } Y + |V|}
$

- **Uses log probabilities** to prevent underflow.  
- Predicts the class using:  

$
Y_{\text{pred}} = \arg\max_Y P(Y) + \sum_{i=1}^{n} \log P(w_i \mid Y)
$



### **7. Model Evaluation**  

- **Splits data** into training and test sets.  
- **Runs Naïve Bayes** on test data.  
- **Computes Confusion Matrix and F1 Score** to evaluate performance.  

<br>

----

##### Comments:
- To ensure that no word probability is ever zero, we use **Laplace Smoothing**:  Laplace smoothing prevents zero probabilities and ensures numerical stability when computing likelihoods.
- Prevents probabilities from being exactly zero → Avoids total loss of information.
- Does NOT fix the small number multiplication issue → Probabilities are still very small, so underflow can still happen. Use log probabilities to prevent underflow.


# Import libraries

In [None]:
import pandas as pd
import numpy as np

In [None]:
import kagglehub
import os
import zipfile

# Loading data

In [None]:
df = pd.read_csv("IMDB Dataset.csv")

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
len(df)

50000

In [None]:
len(df[df["sentiment"]=="positive"])

25000

# Preprocessing

- remove all unwanted tags, urls, non-alphanumeric characters
- convert to lowercase
- remove extra space
- remove stopwords





In [None]:
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_tags(string):
    removelist = ""  # You can add characters to keep if needed

    result = re.sub(r'<.*?>', '', string)  # Remove HTML tags
    result = re.sub(r'\.<br\s*/><br\s*/>', '', result)  # Remove <br /><br />
    result = re.sub(r'https?://\S+', '', result)  # Remove URLs
    result = re.sub(r'[^a-zA-Z0-9' + removelist + ']', ' ', result)  # Remove non-alphanumeric characters
    result = result.lower().strip()  # Convert to lowercase and remove extra spaces
    return result

# Apply text cleaning to the 'review' column
df['review'] = df['review'].apply(remove_tags)

# Remove stopwords
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

print(df.head())  # Check the cleaned data


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                              review sentiment
0  one reviewers mentioned watching 1 oz episode ...  positive
1  wonderful little production filming technique ...  positive
2  thought wonderful way spend time hot summer we...  positive
3  basically family little boy jake thinks zombie...  negative
4  petter mattei love time money visually stunnin...  positive


Label encoding converts categorical values into numerical values, making them suitable for input in Naive Bayes.

In [None]:
df['sentiment'] = df['sentiment'].replace({'positive': 1, 'negative': 0})

  df['sentiment'] = df['sentiment'].replace({'positive': 1, 'negative': 0})


In [None]:
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,1
1,wonderful little production filming technique ...,1
2,thought wonderful way spend time hot summer we...,1
3,basically family little boy jake thinks zombie...,0
4,petter mattei love time money visually stunnin...,1


# Naïve Bayes Classification

## Calculate prior Probabilities

From the formula for Bayes Theorem, calculate prior (P(Y=y)).

In [None]:

# Calculate prior P(Y=y)
import math
def calculate_prior(df,Y):
  arr = []
  classes = df[Y].unique()
  for clas in classes:
    clas_div_total = len(df[df[Y]==clas])/ len(df)
    arr.append(math.log(clas_div_total))
  return arr



prior probability for each class in basically total number of rows that belong to that class divided by the total number of rows. Since this is a constant value, its easier to calculate it and save it.

In [None]:
calculate_prior(df,"sentiment")

[-0.6931471805599453, -0.6931471805599453]

## Building Vocabulary and word counts

Create a vocabulary dictionary which is basically a set containing all the words from the dataset. It doesnt have repeating values and thus total document length is higher than the vocabulary length.

In [None]:
# need to create a vocab dictionary
total_vocab = df["review"].apply(lambda x: x.split()).tolist()
vocab = set(" ".join(df["review"]).split()) # Unique words across all reviews
#

In [None]:
len(vocab)

103149

Calculate total number of words in each class.

In [None]:
#create a bag of words for demominator:
df_positive = df[df["sentiment"]== 1]
df_negative = df[df["sentiment"]== 0]

In [None]:
no_of_words_positive = " ".join(df_positive["review"]).split()
no_of_words_negative =  " ".join(df_negative["review"]).split()

In [None]:
len(no_of_words_positive)

3033174

In [None]:
len(no_of_words_negative)

2945946

Calculate the denominator for each class for the formula:

$P(w | Y) = \frac{\text{Word count in class Y} + 1}{\text{Total words in class Y} + |V|}$


In [None]:
def denominator_for_class(label):
  if label == 1:
    denominator = len(no_of_words_positive) + len(vocab)

  if label ==0:
    denominator = len(no_of_words_negative) + len(vocab)

  return denominator



In [None]:
denominator_for_positive_class = len(no_of_words_positive) + len(vocab)
denominator_for_negative_class = len(no_of_words_negative) + len(vocab)

In [None]:
denominator_for_positive_class

3136323

In [None]:
#inefficient code, but does the job.
# The given code processes a document by counting how many reviews contain each word in the total_vocab.

array_of_words_in_doc = doc.split()
total_count= []
for word in array_of_words_in_doc:
      count = 0
      for i in total_vocab:
        for j in i:
          if word == j:
            count+=1
            break
      total_count.append(count+1)


Efficient code for calculating how many reviews contain each word in total_vocab.

In [None]:
from collections import Counter

# Convert each review (list of words) into a set for fast lookups
review_sets = [set(review) for review in total_vocab]

# Count how many reviews contain each word at least once
word_counts = Counter(word for review in review_sets for word in review)

# Compute total count for words in `doc`


Efficient code for calculating how many reviews contain each word in vocabulary of positive words.

In [None]:
from collections import Counter
vocab_positive =  df_positive["review"].apply(lambda x: x.split()).tolist() # Unique words across all positive reviews

# Convert each review (list of words) into a set for fast lookups
review_sets_positive = [set(review) for review in vocab_positive]

# Count how many reviews contain each word at least once
word_counts_positive = Counter(word for review in review_sets_positive for word in review)

# Compute total count for words in `doc`

Efficient code for calculating how many reviews contain each word in vocabulary of negative words.

In [None]:
from collections import Counter
vocab_negative =  df_negative["review"].apply(lambda x: x.split()).tolist() # Unique words across all reviews

# Convert each review (list of words) into a set for fast lookups
review_sets_negative = [set(review) for review in vocab_negative]

# Count how many reviews contain each word at least once
word_counts_negative = Counter(word for review in review_sets_negative for word in review)

# Compute total count for words in `doc`

In [None]:
def call_word_count(label):
  if label ==1:
    return word_counts_positive
  if label ==0:
    return word_counts_negative

### **Underflow in Naïve Bayes?**  
In Naïve Bayes, we calculate the probability of a class \( Y \) given a document \( X \) using:  

$P(Y | X) \propto P(Y) \prod_{i=1}^{n} P(w_i | Y)$

<br>

- $P(w_i \mid Y)$ (the likelihood) is a small probability because words rarely occur in every document.  
- Multiplying many small probabilities together makes the final probability **extremely small**, leading to **numerical underflow** (rounding to zero).  



### **How to Fix Underflow? Use Log Probabilities**  
Instead of multiplying probabilities, we use **logarithms** to convert **multiplication into addition**:  

$\log P(Y | X) \propto \log P(Y) + \sum_{i=1}^{n} \log P(w_i | Y)$


## naive_bayes function

In [None]:
def naive_bayes(df,X,Y):
  prior = calculate_prior(df,"sentiment")
  unique_labels = sorted(df[Y].unique())
  extreme_final =[]
  for x in X: # loop over each row of test data
    doc = x[0]

    final = []

    for j in unique_labels:
      if j ==1:
        total_count = [word_counts_positive.get(word, 0) + 1 for word in doc.split()]
      if j ==0:
        total_count = [word_counts_negative.get(word, 0) + 1 for word in doc.split()]

      total_count = np.array(total_count, dtype=np.float128)/ (denominator_for_class(j) + len(vocab))

      result = np.sum(total_count)
      final.append(float(result))
      #print(final)

    #print(final)
    new=[prior[0]*final[0],prior[1]*final[1]]

    extreme_final.append(np.argmax(new))

  return extreme_final



# Run model and evaluate.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score


train, test = train_test_split(df, test_size=.2, random_state=49)   # Train, test dataset split

X_test = test.iloc[:,: -1].values # Remove the Y column
Y_test = test.iloc[:,-1].values   # Keep the Y column

Y_pred = naive_bayes(train, X= X_test,Y= "sentiment")   # Apply Naive - Bayes algorithm

print("Confusion Matrix:")
print(confusion_matrix( Y_test, Y_pred))
print('\n')
print("f1 Score:")
print(f1_score(Y_test, Y_pred))

Confusion Matrix:
[[ 119 4808]
 [1577 3496]]


f1 Score:
0.5226881961575839


## **Comments**  

This notebook was used to learn **Naïve Bayes from scratch** and did not focus on improving model performance.  

### **Ways to Improve Model Performance:**  
- **Better Text Preprocessing:** Removing stopwords, stemming, and lemmatization.  
- **TF-IDF Features:** Using **Term Frequency-Inverse Document Frequency (TF-IDF)** instead of raw word counts.  
- **Handling Imbalanced Data:** Using **class weighting** or **oversampling/undersampling** methods.  
- **Feature Engineering:** Extracting **n-grams** (bigrams, trigrams) to capture word relationships.  
- **Hyperparameter Tuning:** Adjusting **Laplace smoothing** and feature selection methods.  
- **Using Word Embeddings:** Incorporating **word vectors (e.g., Word2Vec, GloVe)** for better representation.  
