## **1. Understanding the Formula**
The **Multinomial Naïve Bayes** classifier is used for **text classification** problems, such as **spam detection**.

### **Bayes' Theorem:**
\[
P(Class | Features) = \frac{P(Features | Class) \times P(Class)}{P(Features)}
\]

Where:
- \( P(Class | Features) \) = Posterior probability of a class (Spam or Normal) given the message features.
- \( P(Features | Class) \) = Likelihood (probability of word occurrence in a class).
- \( P(Class) \) = Prior probability of a class.
- \( P(Features) \) = Evidence (probability of message features occurring).

### **Multinomial Naïve Bayes Formula:**
For a message containing words \( w_1, w_2, ..., w_n \):

\[
P(Class | Message) \propto P(Class) \times \prod_{i=1}^{n} P(w_i | Class)
\]

- **Each word’s probability is calculated using:**
  \[
  P(w | Class) = \frac{\text{Count of word } w \text{ in class} + 1}{\text{Total words in class} + \text{Total unique words}}
  \]
- This is known as **Laplace Smoothing** to prevent zero probabilities.

### **Final Decision Rule**
The class with the **highest probability** is chosen:
\[
Class = \arg\max P(Class | Message)
\]


In [27]:
# Import libraries
import numpy as np
import pandas as pd
import re
import string
from collections import defaultdict
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
file_path = "./SMSSpamCollection"

# Load the dataset
df = pd.read_csv(file_path, sep='\t', header=None, names=["Label", "Message"])

# Display first 5 rows
df.head()


# Rename columns for clarity
df.columns = ["Label", "Message"]

# Convert labels to binary (0 = Normal, 1 = Spam)
df["Label"] = df["Label"].map({"ham": 0, "spam": 1})

# Display dataset structure
df.head()


Unnamed: 0,Label,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [29]:
# Function to clean and tokenize text
def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(f"[{string.punctuation}]", "", text)  # Remove punctuation
    words = text.split()  # Tokenize
    return words

# Apply text preprocessing
df["Processed"] = df["Message"].apply(clean_text)

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(df["Processed"], df["Label"], test_size=0.2, random_state=42)


In [31]:
class MultinomialNaiveBayes:
    def __init__(self, alpha=1.0):  # Laplace smoothing parameter
        self.alpha = alpha
        self.class_probs = {}
        self.word_probs = defaultdict(lambda: defaultdict(float))
        self.vocab = set()

    def fit(self, X_train, y_train):
        class_counts = defaultdict(int)
        word_counts = defaultdict(lambda: defaultdict(int))

        # Count word occurrences for each class
        for words, label in zip(X_train, y_train):
            class_counts[label] += 1
            for word in words:
                word_counts[label][word] += 1
                self.vocab.add(word)

        # Calculate class probabilities
        total_docs = sum(class_counts.values())
        for label in class_counts:
            self.class_probs[label] = class_counts[label] / total_docs

        # Calculate word probabilities with Laplace smoothing
        vocab_size = len(self.vocab)
        for label in word_counts:
            total_words = sum(word_counts[label].values())
            for word in self.vocab:
                self.word_probs[label][word] = (word_counts[label][word] + self.alpha) / (total_words + self.alpha * vocab_size)

    def predict(self, X_test):
        predictions = []
        for words in X_test:
            class_scores = {}
            for label in self.class_probs:
                class_scores[label] = np.log(self.class_probs[label])  # Start with log prior probability
                for word in words:
                    if word in self.vocab:  # Only consider known words
                        class_scores[label] += np.log(self.word_probs[label].get(word, self.alpha / (self.alpha * len(self.vocab))))

            # Predict class with highest log-probability
            predictions.append(max(class_scores, key=class_scores.get))
        return predictions


In [33]:
# Train the Multinomial Naïve Bayes Model
nb = MultinomialNaiveBayes(alpha=1.0)
nb.fit(X_train, y_train)


In [35]:
y_pred = nb.predict(X_test)

In [37]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.9874439461883409