## EX4
`Author:    Hongru He`<br>
`Date:      01/23/2026`

#### Setting Notice
1. Make sure you put the dataset folders under the same folder with this notebook.
2. Make sure adjust the value of `BASE_DIR` based on the exact path of your base directory
3. Make sure the dataset folders' names are consistent with those in `ham_dirs` and `spam_dirs` in the function `load_emails_from_dir`

#### Dataset Selection Reasoning
The 2003 (20030228) SpamAssassin dataset was chosen because it provides a clean, self-contained snapshot that includes both spam and non-spam emails, allowing for a balanced and consistent binary classification setup. The 2002 datasets were excluded as they are older and less thoroughly cleaned, while the 2005 data was not used because it contains only spam messages and no corresponding ham, which would introduce class imbalance and temporal mismatch. Using only the 2003 dataset avoids data leakage and ensures fair, interpretable evaluation results.

### 1. Load emails and assign labels

In [1]:
import os
import re

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, confusion_matrix

BASE_DIR = "/Users/hhe/Desktop/Academia/MSCS/CPSC5310/EXs/EX4"      # Change this based on the exact path of base directory

def load_emails_from_dir(dir_path, label):
    texts = []
    labels = []
    for filename in os.listdir(dir_path):
        # Skip hidden files like .DS_Store
        if filename.startswith('.'):
            continue
        file_path = os.path.join(dir_path, filename)
        if not os.path.isfile(file_path):
            continue
        with open(file_path, "r", encoding="latin-1", errors="ignore") as f:
            text = f.read()
        texts.append(text)
        labels.append(label)
    return texts, labels

ham_dirs = [
    "easy_ham",
    "easy_ham_2",
    "hard_ham",
]
spam_dirs = [
    "spam",
    "spam_2",
]

X_texts = []
y_labels = []

# Ham = 0, Spam = 1
for d in ham_dirs:
    texts, labels = load_emails_from_dir(os.path.join(BASE_DIR, d), label=0)
    X_texts.extend(texts)
    y_labels.extend(labels)

for d in spam_dirs:
    texts, labels = load_emails_from_dir(os.path.join(BASE_DIR, d), label=1)
    X_texts.extend(texts)
    y_labels.extend(labels)

X_texts = np.array(X_texts)
y_labels = np.array(y_labels)

print("Total emails:", len(X_texts))
print("Spam ratio:", y_labels.mean())

Total emails: 6052
Spam ratio: 0.31378056840713814


In [2]:
emails_df = pd.DataFrame({"text": X_texts, "label": y_labels})
emails_df.head()

Unnamed: 0,text,label
0,From fork-admin@xent.com Tue Sep 24 17:55:30 ...,0
1,From rpm-list-admin@freshrpms.net Mon Sep 9 ...,0
2,From secprog-return-625-jm=jmason.org@security...,0
3,Return-Path: nas@python.ca\nDelivery-Date: Thu...,0
4,From fork-admin@xent.com Thu Aug 29 11:03:51 ...,0


### 2. Split train/test datasets

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X_texts,
    y_labels,
    test_size=0.2,
    random_state=42,
    stratify=y_labels,
)

print("Train size:", len(X_train), "Test size:", len(X_test))
print("Train spam ratio:", y_train.mean(), "Test spam ratio:", y_test.mean())

Train size: 4841 Test size: 1211
Train spam ratio: 0.3137781450113613 Test spam ratio: 0.3137902559867878


### 3. Text cleaning and vectorization

In [4]:
URL_RE = re.compile(r"(http|https)://\S+|www\.\S+")
NUM_RE = re.compile(r"\d+")

def simple_preprocess(text):
    # 1. lower-case
    text = text.lower()
    # 2. replace URLs and numbers
    text = URL_RE.sub(" URL ", text)
    text = NUM_RE.sub(" NUMBER ", text)
    # 3. keep only letters and a few separators
    text = re.sub(r"[^a-z]+", " ", text)
    return text

#### Pipeline with Logistic Regression

In [5]:
binary_vectorizer = CountVectorizer(
    preprocessor=simple_preprocess,
    binary=True,          # presence/absence
    min_df=2,             # ignore very rare words
)

log_reg_clf = Pipeline([
    ("vect", binary_vectorizer),
    ("clf", LogisticRegression(
        max_iter=1000,
        n_jobs=-1,
    )),
])

log_reg_clf.fit(X_train, y_train)

y_pred = log_reg_clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["ham", "spam"]))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       831
        spam       0.98      0.97      0.98       380

    accuracy                           0.99      1211
   macro avg       0.98      0.98      0.98      1211
weighted avg       0.99      0.99      0.99      1211

[[824   7]
 [ 10 370]]


### 4. Multiple classifiers
#### Multinomial Naive Bayes

In [6]:
count_vectorizer = CountVectorizer(
    preprocessor=simple_preprocess,
    binary=False,   # use counts
    min_df=2,
)

nb_clf = Pipeline([
    ("vect", count_vectorizer),
    ("clf", MultinomialNB()),
])

nb_clf.fit(X_train, y_train)
y_pred_nb = nb_clf.predict(X_test)

print("=== MultinomialNB ===")
print(classification_report(y_test, y_pred_nb, target_names=["ham", "spam"]))
print(confusion_matrix(y_test, y_pred_nb))

=== MultinomialNB ===
              precision    recall  f1-score   support

         ham       0.95      0.96      0.95       831
        spam       0.92      0.88      0.90       380

    accuracy                           0.94      1211
   macro avg       0.93      0.92      0.93      1211
weighted avg       0.94      0.94      0.94      1211

[[800  31]
 [ 46 334]]


#### Linear SVM

In [7]:
svm_vectorizer = CountVectorizer(
    preprocessor=simple_preprocess,
    binary=True,
    min_df=2,
)

svm_clf = Pipeline([
    ("vect", svm_vectorizer),
    ("clf", LinearSVC()),
])

svm_clf.fit(X_train, y_train)
y_pred_svm = svm_clf.predict(X_test)

print("=== LinearSVC ===")
print(classification_report(y_test, y_pred_svm, target_names=["ham", "spam"]))
print(confusion_matrix(y_test, y_pred_svm))

=== LinearSVC ===
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       831
        spam       0.99      0.98      0.98       380

    accuracy                           0.99      1211
   macro avg       0.99      0.99      0.99      1211
weighted avg       0.99      0.99      0.99      1211

[[826   5]
 [  9 371]]


### Summary of Findings
Using the Apache SpamAssassin 2003 corpus, I trained and evaluated several machine learning models for spam classification. The dataset contained 6,052 emails with a spam ratio of approximately 31%, and a stratified 80/20 train–test split preserved this distribution across both sets.

All models benefited from basic text preprocessing and bag-of-words feature representations, but their performance varied noticeably. Multinomial Naive Bayes served as a reasonable baseline, achieving 94% overall accuracy. However, it produced a higher number of misclassifications, particularly false negatives, resulting in lower spam recall (0.88). This behavior is consistent with Naive Bayes’ strong independence assumptions, which are often violated in real email text.

Linear models performed substantially better. Logistic Regression achieved near-perfect performance, with spam precision of 0.98 and spam recall of 0.97, indicating a strong balance between minimizing false positives and false negatives. The Linear Support Vector Machine (LinearSVC) achieved the best overall results, with spam precision of 0.99 and spam recall of 0.98, and the fewest total errors on the test set.

Overall, the results show that simple bag-of-words features combined with linear classifiers are highly effective for spam detection. In particular, using binary word-presence features proved sufficient to separate spam from ham with very high accuracy. While these results are likely optimistic compared to real-world email filtering due to the curated nature of the dataset, they clearly demonstrate the effectiveness of linear text classifiers and the importance of feature representation in supervised learning tasks.