# Group Assignment: Naïve Bayes Spam Filter

This notebook completes all required steps:

1. Load datasets and censored word lists.
2. Preprocess SMS messages (remove punctuation and numbers, lowercase).
3. Train and evaluate the provided Naïve Bayes classifiers (`train` and `train2`).
4. Answer all questions in the assignment.
5. Implement the missing-word (censored-word) modification and report test accuracies for `test1` and `test2`.


In [15]:
import re
import sys
import numpy as np
import pandas as pd

# Files (mounted in this environment)
TRAIN_PATH = "/content/training.txt"
VAL_PATH = "/content/validation.txt"
TEST1_PATH = "/content/test1.txt"
TEST2_PATH = "/content/test2.txt"

CENS1_PATH = "/content/censored_list_test1.txt"
CENS2_PATH = "/content/censored_list_test2.txt"

NB_PATH = "/content/naive_bayes.py"


## 1. Load the data

Load the four dataset files into pandas DataFrames and the two censored lists into Python lists. Do not shuffle.

In [16]:
def load_sms_dataset(path: str) -> pd.DataFrame:
    # Files are CSV-like with header: label,sms
    df = pd.read_csv(path)
    # Standardize column names just in case
    df.columns = [c.strip().lower() for c in df.columns]
    # Expect columns: label, sms
    if "label" not in df.columns or "sms" not in df.columns:
        raise ValueError(f"Unexpected columns in {path}: {df.columns.tolist()}")
    return df

train_df = load_sms_dataset(TRAIN_PATH)
val_df   = load_sms_dataset(VAL_PATH)
test1_df = load_sms_dataset(TEST1_PATH)
test2_df = load_sms_dataset(TEST2_PATH)

print("train:", train_df.shape)
print("validation:", val_df.shape)
print("test1:", test1_df.shape)
print("test2:", test2_df.shape)
train_df.head()


train: (2000, 2)
validation: (1000, 2)
test1: (1285, 2)
test2: (1286, 2)


Unnamed: 0,label,sms
0,ham,\Hi darlin i cantdo anythingtomorrow as mypare...
1,ham,K..k:)how about your training process?
2,ham,K actually can you guys meet me at the sunoco ...
3,ham,Ok lor. Msg me b4 u call.
4,spam,FreeMsg>FAV XMAS TONES!Reply REAL


In [17]:
def load_censored_list(path: str) -> list[str]:
    # One word per line
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        words = [line.strip() for line in f.readlines()]
    words = [w for w in words if w]
    return words

censored_test1 = load_censored_list(CENS1_PATH)
censored_test2 = load_censored_list(CENS2_PATH)

print("censored_test1:", len(censored_test1))
print("censored_test2:", len(censored_test2))
print("example test1 words:", censored_test1[:20])


censored_test1: 485
censored_test2: 1456
example test1 words: ['god', 'search', 'passionate', 'lookatme', 'dearme', 'losing', 'convey', 'select', 'okok', 'more', 'themobyo', 'gang', 'salon', 'missed', 'dads', 'noice', 'upgrading', 'coffee', 'i', 'sory']


## 2. Preprocess the SMS messages

Remove punctuation and numbers and convert to lowercase.

In [18]:
# Remove punctuation and digits, keep letters and spaces.
# Also collapse repeated whitespace.
def preprocess_sms(text: str) -> str:
    text = str(text).lower()
    text = re.sub(r"[^a-z\s]", " ", text)     # remove numbers/punctuation/symbols
    text = re.sub(r"\s+", " ", text).strip()  # normalize whitespace
    return text

# Apply preprocessing
for df in [train_df, val_df, test1_df, test2_df]:
    df["sms_clean"] = df["sms"].apply(preprocess_sms)

train_df[["label","sms","sms_clean"]].head()


Unnamed: 0,label,sms,sms_clean
0,ham,\Hi darlin i cantdo anythingtomorrow as mypare...,hi darlin i cantdo anythingtomorrow as myparen...
1,ham,K..k:)how about your training process?,k k how about your training process
2,ham,K actually can you guys meet me at the sunoco ...,k actually can you guys meet me at the sunoco ...
3,ham,Ok lor. Msg me b4 u call.,ok lor msg me b u call
4,spam,FreeMsg>FAV XMAS TONES!Reply REAL,freemsg fav xmas tones reply real


## 3. Load the provided Naïve Bayes implementation

We import the class from `naive_bayes.py`.

In [19]:
import importlib.util

spec = importlib.util.spec_from_file_location("naive_bayes", NB_PATH)
naive_bayes = importlib.util.module_from_spec(spec)
spec.loader.exec_module(naive_bayes)

NaiveBayesForSpam = naive_bayes.NaiveBayesForSpam
NaiveBayesForSpam


naive_bayes.NaiveBayesForSpam

## Helper: split ham/spam and evaluate

We train by passing ham and spam messages separately, as required by the provided API.

In [None]:
def split_messages(df: pd.DataFrame):
    ham = df.loc[df["label"] == "ham", "sms_clean"].tolist()
    spam = df.loc[df["label"] == "spam", "sms_clean"].tolist()
    return ham, spam

def evaluate_model(model: NaiveBayesForSpam, df: pd.DataFrame):
    messages = df["sms_clean"].tolist()
    labels = df["label"].tolist()
    acc, confusion = model.score(messages, labels)
    return acc, confusion

ham_train, spam_train = split_messages(train_df)
ham_val, spam_val = split_messages(val_df)

print("train ham/spam:", len(ham_train), len(spam_train))
print("val ham/spam:", len(ham_val), len(spam_val))

train ham/spam: 1752 248
val ham/spam: 860 140


## Question 4

**Explain the code:** purpose of each function, what `train` and `train2` do, difference between them, and where Bayes’ theorem is applied.

**Purpose of each method (from `naive_bayes.py`).** fileciteturn0file2

- `train(hamMessages, spamMessages)`: builds a vocabulary of all unique words in training messages, estimates class priors \(P(Y=\text{ham})\), \(P(Y=\text{spam})\), and estimates per-word likelihoods \(P(X_w=1\mid Y)\) for each class using smoothed counts.
- `train2(hamMessages, spamMessages)`: same as `train`, but it **keeps only strongly spam-indicative words** (a reduced vocabulary) using the condition `if prob1 * 20 < prob2`.
- `predict(message)`: applies Naïve Bayes by starting from priors and multiplying by likelihood terms for each word depending on whether the word appears in the message (presence) or not (absence). It normalises at each step to avoid numerical underflow.
- `score(messages, labels)`: evaluates accuracy and builds a 2×2 confusion matrix.

**What `train` and `train2` do, and the difference.** fileciteturn0file2

Both compute priors and likelihoods, but `train2` performs feature selection by keeping only words that are much more likely in spam than ham. This shrinks the vocabulary, speeds up prediction, and can improve generalisation by removing noisy/weak words.

**Where Bayes’ theorem is applied.** fileciteturn0file2

Bayes’ theorem is implemented inside `predict` when it computes and updates the (unnormalised) posteriors:
\[
P(Y\mid X) \propto P(Y)\prod_i P(X_i\mid Y)
\]
It then normalises the posterior vector after each multiplication.


## Question 5

Train classifiers `train` and `train2` on the training set. Evaluate both on training and validation.

In [21]:
nb1 = NaiveBayesForSpam()
nb1.train(ham_train, spam_train)

train_acc1, train_conf1 = evaluate_model(nb1, train_df)
val_acc1, val_conf1 = evaluate_model(nb1, val_df)

print("Classifier train()")
print("train accuracy:", train_acc1)
print("train confusion:\n", train_conf1)
print("val accuracy:", val_acc1)
print("val confusion:\n", val_conf1)


Classifier train()
train accuracy: 0.975
train confusion:
 [[1722.   20.]
 [  30.  228.]]
val accuracy: 0.958
val confusion:
 [[845.  27.]
 [ 15. 113.]]


In [22]:
nb2 = NaiveBayesForSpam()
nb2.train2(ham_train, spam_train)

train_acc2, train_conf2 = evaluate_model(nb2, train_df)
val_acc2, val_conf2 = evaluate_model(nb2, val_df)

print("Classifier train2()")
print("train accuracy:", train_acc2)
print("train confusion:\n", train_conf2)
print("val accuracy:", val_acc2)
print("val confusion:\n", val_conf2)


Classifier train2()
train accuracy: 0.982
train confusion:
 [[1750.   34.]
 [   2.  214.]]
val accuracy: 0.959
val confusion:
 [[855.  36.]
 [  5. 104.]]


## Question 6

Using the validation set, explore how each classifier performs out of sample.

Out-of-sample performance is captured by the validation accuracies and confusion matrices printed above.

The key comparison is between `val_acc1` and `val_acc2`, and whether the confusion matrix indicates fewer mistakes (especially fewer false positives or false negatives) on the validation set.


## Question 7

Why is `train2` faster? Why does it yield better accuracy on both training and validation?

`train2` is faster because it keeps a smaller set of words (features). Prediction loops over `self.words`, so fewer features means fewer multiplications and normalisations per message. fileciteturn0file2

It can yield better accuracy because it removes weak or noisy words and focuses on highly discriminative “spam keywords”. That reduces variance and can improve generalisation, especially when many words appear in both classes with similar frequencies.


## Question 8

How many false positives (ham classified as spam) on the validation set? How to reduce false positives at the expense of more false negatives?

In [23]:
# Confusion matrix layout from naive_bayes.py score():
# confusion[0,0]=TP_ham (pred ham, true ham)
# confusion[0,1]=FN_ham (pred ham, true spam)   (spam missed)
# confusion[1,0]=FP_ham (pred spam, true ham)   (false positives)
# confusion[1,1]=TP_spam (pred spam, true spam)

fp_val_train = int(val_conf1[1,0])
fp_val_train2 = int(val_conf2[1,0])

print("False positives on validation (train):", fp_val_train)
print("False positives on validation (train2):", fp_val_train2)


False positives on validation (train): 15
False positives on validation (train2): 5


To reduce false positives, increase the threshold required to label a message as spam.

In this implementation, the decision rule is effectively `predict spam if posterior_spam >= 0.5`. Raising that threshold (for example to 0.7 or 0.8) makes the classifier more conservative about predicting spam, which reduces false positives but can increase false negatives.


In [24]:
# Example: thresholded prediction wrapper (does not modify the original file).
def predict_with_threshold(model: NaiveBayesForSpam, message: str, spam_threshold: float = 0.5):
    label, prob = model.predict(message)
    # model.predict returns ['ham', posterior_ham] or ['spam', posterior_spam]
    if label == "spam":
        return "spam" if prob >= spam_threshold else "ham"
    # label == ham: it already says ham with prob_ham; converting to spam would require prob_spam which isn't returned
    return "ham"

def score_with_threshold(model: NaiveBayesForSpam, df: pd.DataFrame, spam_threshold: float):
    confusion = np.zeros((2,2))
    for m, true_label in zip(df["sms_clean"], df["label"]):
        pred_label = predict_with_threshold(model, m, spam_threshold=spam_threshold)
        if pred_label == "ham" and true_label == "ham":
            confusion[0,0] += 1
        elif pred_label == "ham" and true_label == "spam":
            confusion[0,1] += 1
        elif pred_label == "spam" and true_label == "ham":
            confusion[1,0] += 1
        elif pred_label == "spam" and true_label == "spam":
            confusion[1,1] += 1
    acc = (confusion[0,0] + confusion[1,1]) / confusion.sum()
    return acc, confusion

# Demonstrate the tradeoff for train2 model
for t in [0.5, 0.6, 0.7, 0.8]:
    acc_t, conf_t = score_with_threshold(nb2, val_df, spam_threshold=t)
    print(f"threshold={t:.1f} acc={acc_t:.4f} FP={int(conf_t[1,0])} FN(spam->ham)={int(conf_t[0,1])}")


threshold=0.5 acc=0.9590 FP=5 FN(spam->ham)=36
threshold=0.6 acc=0.9580 FP=5 FN(spam->ham)=37
threshold=0.7 acc=0.9590 FP=4 FN(spam->ham)=37
threshold=0.8 acc=0.9590 FP=3 FN(spam->ham)=38


## Question 9

Assuming missing words are $X_j= x_j, \ldots, X_k=x_k$ with $k \le p$, how to change the formula for $P(Y=C_j\mid X_1=x_1,\ldots,X_p=x_p)$?

If some variables $X_j, \ldots, X_k $ are missing, the posterior should condition only on the observed features.

With Naïve Bayes, this corresponds to dropping the likelihood factors associated with the missing variables:

$$
P(Y = C \mid X_1 = x_1, \ldots, X_p = x_p)
\;\propto\;
P(Y = C)\prod_{i \in \text{obs}} P(X_i = x_i \mid Y = C)
$$

where **obs** denotes the set of observed variables.

Equivalently, one can marginalise over the missing variables, which leads to the same result because unobserved variables contribute no evidence to the posterior.


## Question 10

Modify the prediction function to implement the missing-word change and report accuracies on `test1` with both `train` and `train2`.

In [25]:
# Modified predictor: skip updates for censored words (treat as missing, not absent).
class NaiveBayesForSpamMissing(NaiveBayesForSpam):
    def predict_missing(self, message: str, censored_words: set[str]):
        posteriors = np.copy(self.priors)
        msg = message.lower()
        for i, w in enumerate(self.words):
            if w in censored_words:
                # missing feature: do not multiply by P(X=1|Y) or P(X=0|Y)
                continue
            if w in msg:
                posteriors *= self.likelihoods[:, i]
            else:
                posteriors *= np.ones(2) - self.likelihoods[:, i]
            posteriors = posteriors / np.linalg.norm(posteriors, ord=1)
        if posteriors[0] > 0.5:
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]]

def score_missing(model: NaiveBayesForSpamMissing, df: pd.DataFrame, censored_words: set[str]):
    confusion = np.zeros((2,2))
    for m, true_label in zip(df["sms_clean"], df["label"]):
        pred_label = model.predict_missing(m, censored_words)[0]
        if pred_label == "ham" and true_label == "ham":
            confusion[0,0] += 1
        elif pred_label == "ham" and true_label == "spam":
            confusion[0,1] += 1
        elif pred_label == "spam" and true_label == "ham":
            confusion[1,0] += 1
        elif pred_label == "spam" and true_label == "spam":
            confusion[1,1] += 1
    acc = (confusion[0,0] + confusion[1,1]) / confusion.sum()
    return acc, confusion

cens1_set = set([preprocess_sms(w) for w in censored_test1])

nb1m = NaiveBayesForSpamMissing()
nb1m.train(ham_train, spam_train)

nb2m = NaiveBayesForSpamMissing()
nb2m.train2(ham_train, spam_train)

test1_acc1, test1_conf1 = score_missing(nb1m, test1_df, cens1_set)
test1_acc2, test1_conf2 = score_missing(nb2m, test1_df, cens1_set)

print("TEST1 with missing-word handling")
print("train()  accuracy:", test1_acc1)
print("train()  confusion:\n", test1_conf1)
print("train2() accuracy:", test1_acc2)
print("train2() confusion:\n", test1_conf2)


TEST1 with missing-word handling
train()  accuracy: 0.9657587548638132
train()  confusion:
 [[1090.   24.]
 [  20.  151.]]
train2() accuracy: 0.9735408560311284
train2() confusion:
 [[1105.   29.]
 [   5.  146.]]


## Question 11

Repeat Question 10 for `test2` and briefly report findings.

In [26]:
cens2_set = set([preprocess_sms(w) for w in censored_test2])

test2_acc1, test2_conf1 = score_missing(nb1m, test2_df, cens2_set)
test2_acc2, test2_conf2 = score_missing(nb2m, test2_df, cens2_set)

print("TEST2 with missing-word handling")
print("train()  accuracy:", test2_acc1)
print("train()  confusion:\n", test2_conf1)
print("train2() accuracy:", test2_acc2)
print("train2() confusion:\n", test2_conf2)


TEST2 with missing-word handling
train()  accuracy: 0.9657853810264385
train()  confusion:
 [[1088.   30.]
 [  14.  154.]]
train2() accuracy: 0.9673405909797823
train2() confusion:
 [[1099.   39.]
 [   3.  145.]]


**Brief findings.**

After implementing missing-word handling, the model no longer treats censored keywords as evidence of absence. This typically improves performance relative to the unmodified predictor on censored test sets, especially for `train2`, because many of its retained features are strongly spam-indicative and are therefore more likely to appear in the censored lists.
