# Machine Learning - Group Assignment 2

- Ankita Kokkera - 06032419
- Aria Wang - 06047688
- Tsamara Esperanti Erwin - 06042275
- Jean-Marc Yao - 06055972
- Amer Mulla - 06027165

In [43]:
import pandas as pd
import numpy as np
from naive_bayes import NaiveBayesForSpam

### 1. Load the data files and list files

In [44]:
training = pd.read_csv("training.txt")
validation = pd.read_csv("validation.txt")
test1 = pd.read_csv("test1.txt")
test2 = pd.read_csv("test2.txt")

with open("censored_list_test1.txt", "r", encoding="utf-8") as f:
    test1_list = f.read().splitlines()

with open("censored_list_test2.txt", "r", encoding="utf-8") as f:
    test2_list = f.read().splitlines()


### 2. Pre-process the SMS messages

In [45]:
for df in [training, validation, test1, test2]:
    df["sms"] = (
        df["sms"]
          .str.lower()
          .str.replace(r"[0-9]", "", regex=True)
          .str.replace(r"[^\w\s]", "", regex=True)
          .str.replace(r"\s+", " ", regex=True)
          .str.strip()
    )

### 3. Review the provided Naïve Bayes code

### 4. Explain the model functions

**`train`**: fits the Naïve Bayes model using the full vocabulary from the training set by estimating the prior probabilities for ham and spam and the word likelihood probabilities for each class. It stores these parameters so that posterior probabilities can be computed later in `predict` for new messages.

**`train2`**: fits a Naïve Bayes model by estimating the prior probabilities and word likelihood probabilities, but it applies feature selection by retaining only words that are at least 20 times more likely to appear in spam than in ham. It stores these filtered parameters so that posterior probabilities can be computed later in `predict` using a reduced set of spam indicative words.

**Difference between `train` and `train2`**: The main difference between `train` and `train2` is feature selection and the resulting vocabulary size. `train` uses the full vocabulary from the training set, including many neutral or common words, whereas `train2` keeps only words that are much more likely to appear in spam than in ham, producing a smaller and more spam focused model.

**`predict`**: applies Bayes’ Theorem under the Naïve Bayes independence assumption to compute the posterior probabilities of ham and spam for a given message. It then assigns the label with the higher posterior probability and returns that label with its probability.

**`score`** evaluates the performance of the classifier on a labelled dataset by using `predict` to generate a predicted class for each message and comparing it to the true class. It returns the prediction accuracy and the confusion matrix.

**Bayes’ Theorem** is applied inside the `for` loop in `predict`, where the prior probabilities are updated using the learned word likelihood probabilities to form posterior probabilities for ham and spam under the Naïve Bayes independence assumption.

### 5. Train the two classifiers

In [46]:
train_ham = training[training["label"] == "ham"]["sms"].tolist()
train_spam = training[training["label"] == "spam"]["sms"].tolist()

nb_all = NaiveBayesForSpam()
nb_all.train(train_ham, train_spam)

nb_spam = NaiveBayesForSpam()
nb_spam.train2(train_ham, train_spam)

### 6. Evaluate performance on the validation set

In [47]:
validation_labels = validation["label"].tolist()
validation_sms = validation["sms"].tolist()

In [48]:
# Evaluate train()
acc_all, conf_all = nb_all.score(validation_sms, validation_labels)
print("Classifier using train():")
print("Accuracy:", acc_all)
print("Confusion matrix:\n", conf_all)

Classifier using train():
Accuracy: 0.955
Confusion matrix:
 [[844.  29.]
 [ 16. 111.]]


In [49]:
# Evaluate train2()
acc_spam, conf_spam = nb_spam.score(validation_sms, validation_labels)
print("Classifier using train2():")
print("Accuracy:", acc_spam)
print("Confusion matrix:\n", conf_spam)

Classifier using train2():
Accuracy: 0.963
Confusion matrix:
 [[856.  33.]
 [  4. 107.]]


### 7. Explain why `train2` is faster and more accurate

`train2` is faster than `train` because it applies feature selection and therefore reduces the dimensionality of the model. By keeping only words that are much more likely to appear in spam than in ham, it produces a much smaller list of `self.words`, so `predict` and `score` perform fewer posterior updates for each message. 

`train2` yields better accuracy because it concentrates on highly informative spam indicators and reduces the influence of common words that appear in both classes, which can add noise. This more focused feature set can generalise better to the validation set when the removed words are largely uninformative.


### 8. Count false positives and reduce them at the expense of false negatives

False positives are ham messages classified as spam. In the confusion matrix returned by `score`, rows correspond to the predicted class and columns correspond to the true class, so false positives are the entries in row predicted spam and column true ham, which is `conf[1, 0]`. Using `train`, the number of false positives on the validation set is 16, and using `train2`, it is 4.

To reduce false positives at the expense of potentially increasing false negatives, the decision threshold for predicting spam can be increased in `predict`. Instead of predicting spam when $P(\text{spam}\mid X) > 0.5$, require $P(\text{spam}\mid X) > \tau$ for some $\tau > 0.5$ (for example, $0.8$), which makes the classifier more conservative and reduces ham to spam errors.


### 9. Handle missing words

When no features are missing, the formula uses the full product over all $p$ features, so the index range is $i = 1,...,p$, as in the standard Naïve Bayes formula. However, when some words are missing, the corresponding features $X_j,...,X_k$ are not observed, so the formula can no longer run over all indices. Instead, the formula is restricted to the observed indices only, and so we exclude the missing indices from the range.

That is, we replace the full index set $\{1,...,p\}$ with a set $Q$ that is defined as $\{1,...,p\} - \{j,...,k\}$, which means all  indices from 1 to $p$ except those between $j$ and $k$. The formula is then given by

$$
P(Y = C_j \mid X_{\text{obs}}) \propto P(Y = C_j)\prod_{i \in Q} P(X_i = x_i \mid Y = C_j).
$$

### 10. Modify `predict` for missing words and evalute on `test1`

In [50]:
def predict(self, message):
    posteriors = np.copy(self.priors)
    msg = message.lower()

    censored_words = getattr(self, "censored_words", set())

    for i, w in enumerate(self.words):
        if w in censored_words:
            continue

        if w in msg:
            posteriors *= self.likelihoods[:, i]
        else:
            posteriors *= np.ones(2) - self.likelihoods[:, i]

        posteriors = posteriors / np.linalg.norm(posteriors, ord = 1)

    if posteriors[0] > 0.5:
        return ["ham", posteriors[0]]
    return ["spam", posteriors[1]]

In [None]:
test1_labels = test1["label"].tolist()
test1_sms = test1["sms"].tolist()
censored_test1 = set(test1_list)

nb_all.censored_words = censored_test1
acc_test1_all, conf_test1_all = nb_all.score(test1_sms, test1_labels)
print("Test1 classifier using train():")
print("Accuracy:", acc_test1_all)
print("Confusion matrix:\n", conf_test1_all)

nb_spam.censored_words = censored_test1
acc_test1_spam, conf_test1_spam = nb_spam.score(test1_sms, test1_labels)
print("\nTest1 classifier using train2():")
print("Accuracy:", acc_test1_spam)
print("Confusion matrix:\n", conf_test1_spam)

### 11. Evaluate on `test2` and summarise the results

In [None]:
test2_labels = test2["label"].tolist()
test2_sms = test2["sms"].tolist()
censored_test2 = set(test2_list)

nb_all.censored_words = censored_test2
acc_test2_all, conf_test2_all = nb_all.score(test2_sms, test2_labels)
print("Test2 classifier using train():")
print("Accuracy:", acc_test2_all)
print("Confusion matrix:\n", conf_test2_all)

nb_spam.censored_words = censored_test2
acc_test2_spam, conf_test2_spam = nb_spam.score(test2_sms, test2_labels)
print("\nTest2 classifier using train2():")
print("Accuracy:", acc_test2_spam)
print("Confusion matrix:\n", conf_test2_spam)

Both models perform better on `test1` than on `test2`, and this is because of the greater censoring in `test2` which reduces the amount of observable evidence available to update the posterior probabilities. Using `train`, the accuracy drops from 0.970 on `test1` to 0.949 on `test2`, and the confusion matrix shows more misclassifications in `test2`, especially with 57 false negatives (spam messages being predicted as ham) on `test2` compared to 28 on `test1`. Using `train2`, the accuracy also drops from 0.974 on `test1` to 0.961 on `test2`, but it still outperforms `train` and produces fewer false positives (ham messages being predicted as spam) and fewer false negatives. 