# Rebuilding Naive Bayes — From First Principles

> This notebook began as a university assignment, but I kept going. 
> I wanted to answer: *What really happens inside a Naive Bayes classifier?*
>
> Here, I implement **Bernoulli** and **Multinomial Naive Bayes from scratch**, compare them to `sklearn`, and even explore Shannon-style text generation.
>
> Core models live in `src/naive_bayes.py`. This notebook is my playground for testing, reflecting, and learning.
>
> — Touseef Ali

In [None]:
# Install dependencies if needed (uncomment if running locally for the first time)
# !pip install numpy pandas scikit-learn matplotlib --quiet

In [None]:
# Core imports
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, ConfusionMatrixDisplay
)
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

# Import our from-scratch implementations
import sys
sys.path.append('..')
from src.naive_bayes import BernoulliNaiveBayes, MultinomialNaiveBayes
from src.vectorizer import BagOfWords

## 1. Loading the Datasets

In [None]:
# Mushroom Dataset
mushroom_df = pd.read_csv("../data/mushrooms.csv")
print("Mushroom Dataset (first 5 rows):")
print(mushroom_df.head())

In [None]:
# AG-News Dataset
agnews_train = pd.read_csv("../data/AG-News/train.csv")
agnews_test = pd.read_csv("../data/AG-News/test.csv")
print("\nAG-News Train (first 5 rows):")
print(agnews_train.head())

## 2. Data Preprocessing

### 2.1 Mushroom Dataset → One-hot encoding

In [None]:
mush_features = mushroom_df.drop('class', axis=1)
mush_labels = mushroom_df['class']
mush_features_encoded = pd.get_dummies(mush_features, dtype=int)

train_mush_features, test_mush_features, train_mush_labels, test_mush_labels = train_test_split(
    mush_features_encoded, mush_labels, test_size=0.3, random_state=42
)

print("Before split:", mush_features_encoded.shape)
print("Train shape:", train_mush_features.shape)
print("Test shape:", test_mush_features.shape)

### 2.2 AG-News → Text Cleaning

In [None]:
# Load stopwords
with open("../data/english_stopwords.txt", "r") as f:
    stopwords = set(f.read().splitlines())

def clean_text(text, stopwords_set):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)  # Keep alphanum + space
    words = text.lower().split()
    words = [w for w in words if w not in stopwords_set and w != '']
    return ' '.join(words)

agnews_train['cleaned'] = agnews_train['Description'].apply(lambda x: clean_text(x, stopwords))
agnews_test['cleaned'] = agnews_test['Description'].apply(lambda x: clean_text(x, stopwords))

print("First 5 cleaned AG-News descriptions:")
print(agnews_train[['Description', 'cleaned']].head())

## 3. Vectorizing Text with Bag-of-Words

In [None]:
bow = BagOfWords()
bow.fit(agnews_train['cleaned'])

X_train_news = bow.transform(agnews_train['cleaned'])
X_test_news = bow.transform(agnews_test['cleaned'])

y_train_news = agnews_train['Category'].values
y_test_news = agnews_test['Category'].values

print("Vocabulary size:", len(bow.vocab))
print("X_train_news shape:", X_train_news.shape)
print("X_test_news shape:", X_test_news.shape)

## 4. From-Scratch: Bernoulli Naive Bayes (Mushroom)

In [None]:
bnb_model = BernoulliNaiveBayes()
bnb_model.fit(train_mush_features, train_mush_labels)
y_pred_mush = bnb_model.predict(test_mush_features)

In [None]:
accuracy = accuracy_score(test_mush_labels, y_pred_mush)
precision = precision_score(test_mush_labels, y_pred_mush, pos_label='e')
recall = recall_score(test_mush_labels, y_pred_mush, pos_label='e')
f1 = f1_score(test_mush_labels, y_pred_mush, pos_label='e')
matrix = confusion_matrix(test_mush_labels, y_pred_mush)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print("Confusion Matrix:")
ConfusionMatrixDisplay(matrix).plot()

## 5. From-Scratch: Multinomial Naive Bayes (AG-News)

In [None]:
nb_model = MultinomialNaiveBayes()
nb_model.fit(X_train_news, y_train_news)
y_pred_news = nb_model.predict(X_test_news)

In [None]:
accuracy = accuracy_score(y_test_news, y_pred_news)
precision = precision_score(y_test_news, y_pred_news, average='macro')
recall = recall_score(y_test_news, y_pred_news, average='macro')
f1 = f1_score(y_test_news, y_pred_news, average='macro')
matrix = confusion_matrix(y_test_news, y_pred_news)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print("Confusion Matrix:")
ConfusionMatrixDisplay(matrix).plot()

## 6. Comparison with scikit-learn

In [None]:
# MultinomialNB (AG-News)
mnb_sk = MultinomialNB(alpha=1.0)
mnb_sk.fit(X_train_news, y_train_news)
y_pred_mnb = mnb_sk.predict(X_test_news)

print("=== scikit-learn MultinomialNB (AG-News) ===")
print(f"Accuracy: {accuracy_score(y_test_news, y_pred_mnb):.4f}")

In [None]:
# BernoulliNB (Mushroom)
bnb_sk = BernoulliNB(alpha=1.0, binarize=None)
bnb_sk.fit(train_mush_features, train_mush_labels)
y_pred_bnb = bnb_sk.predict(test_mush_features)

print("=== scikit-learn BernoulliNB (Mushroom) ===")
print(f"Accuracy: {accuracy_score(test_mush_labels, y_pred_bnb):.4f}")

## 7. Generative Fun: Shannon-Style Text Generation

In [None]:
def shannon_generate(model, vocab, label_idx, n_words=10):
    log_probs = model.feature_log_prob_[label_idx]
    probs = np.exp(log_probs)
    probs /= probs.sum()
    sampled_idx = np.random.choice(len(vocab), size=n_words, p=probs)
    return [vocab[i] for i in sampled_idx]

# Get vocab in correct order
vocab_list = [word for word, idx in sorted(bow.vocab.items(), key=lambda x: x[1])]

for idx, label in enumerate(mnb_sk.classes_):
    words = shannon_generate(mnb_sk, vocab_list, idx, n_words=10)
    print(f"Class '{label}': {' '.join(words)}")

## Reflection

- **Why Bernoulli for Mushroom?** After one-hot encoding, every feature is binary (0/1) — perfect for Bernoulli.
- **Why Multinomial for AG-News?** Word counts are discrete frequencies — the domain of Multinomial NB.
- **Key insight**: My from-scratch versions match `sklearn` within ~1% — not because I copied, but because I *understood* the math.
- **Biggest surprise**: Even a "simple" model like NB can *generate* text that reflects class semantics.

This isn’t just code. It’s my path to deeper understanding.