# Spam Filter  
**Data Science I, Assignment B**  
**Student:** Fabian Augschöll  
**Date:** June 2025  

**Abstract**  
In this notebook we build and evaluate three spam‑classification models (Naive Bayes, Logistic Regression, SVM) on the Apache SpamAssassin corpus. We’ll also experiment with different text‑preprocessing pipelines to maximize combined precision + recall.

## Objectives  
1. Load “easy_ham_2” vs. “spam_2” emails  
2. Build a flexible preprocessing pipeline  
3. Vectorize text with a bag‑of‑words model  
4. Train & evaluate three classifiers  
5. Experiment with preprocessing hyperparameters  
6. Demonstrate the best model on a fresh sample  

## Imports & helper functions

In [3]:
import os, re
os.chdir('..')
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, precision_score, recall_score

def load_emails(directory, label):
    """Load all .txt emails in `directory`, assign label 0=ham, 1=spam."""
    emails = []
    for fname in os.listdir(directory):
        path = os.path.join(directory, fname)
        try:
            with open(path, 'r', encoding='utf-8', errors='ignore') as f:
                emails.append((f.read(), label))
        except Exception:
            pass
    return emails

class EmailPreprocessor:
    """Flexible email cleaner: strip headers, lowercase, URL/NUM replacement, punctuation removal."""
    def __init__(self, strip_headers=True, lowercase=True,
                 remove_punct=True, replace_urls=True, replace_nums=True):
        self.strip_headers, self.lowercase = strip_headers, lowercase
        self.remove_punct, self.replace_urls = remove_punct, replace_urls
        self.replace_nums = replace_nums

    def preprocess(self, text):
        if self.strip_headers:
            parts = text.split('\n\n', 1)
            text = parts[1] if len(parts) > 1 else parts[0]
        if self.lowercase:
            text = text.lower()
        if self.replace_urls:
            text = re.sub(r'http[s]?://\S+', 'URL', text)
        if self.replace_nums:
            text = re.sub(r'\d+', 'NUMBER', text)
        if self.remove_punct:
            text = re.sub(r'[^\w\s]', ' ', text)
        return ' '.join(text.split())


## Data loading
We’ll load both ham and spam folders, shuffle, then split out texts & labels.

In [4]:
def prepare_data(base_path='data'):
    ham = load_emails(os.path.join(base_path, 'easy_ham_2'), 0)
    spam = load_emails(os.path.join(base_path, 'spam_2'), 1)
    data = ham + spam
    np.random.seed(42)
    np.random.shuffle(data)
    texts, labels = zip(*data)
    return list(texts), np.array(labels)

texts, labels = prepare_data('data')
print(f"Loaded {len(texts)} emails: {labels.sum()} spam, {len(labels)-labels.sum()} ham.")


Loaded 2798 emails: 1397 spam, 1401 ham.


## Text preprocessing
We’ll use the default configuration (strip headers, lowercase, remove punctuation, replace URLs & numbers).

In [5]:
pre = EmailPreprocessor()
processed_texts = [pre.preprocess(t) for t in texts]


## Feature extraction  
Limit to top 1,000 features; filter out English stop‑words.

In [6]:
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(processed_texts)
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.3, stratify=labels, random_state=42)
print(f"Train/test sizes: {X_train.shape[0]}/{X_test.shape[0]}")


Train/test sizes: 1958/840


## Model training & evaluation
We’ll train Naive Bayes, Logistic Regression, and linear SVM, then report precision & recall.

In [7]:
def train_and_report(X_tr, X_te, y_tr, y_te):
    models = {
        'Naive Bayes': MultinomialNB(),
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'SVM': SVC(kernel='linear', probability=True)
    }
    results = {}
    for name, model in models.items():
        model.fit(X_tr, y_tr)
        y_pred = model.predict(X_te)
        p, r = precision_score(y_te, y_pred), recall_score(y_te, y_pred)
        print(f"--- {name} ---")
        print(f"Precision: {p:.3f}, Recall: {r:.3f}")
        print(classification_report(y_te, y_pred, target_names=['Ham', 'Spam']))
        results[name] = {'model': model, 'precision': p, 'recall': r}
    return results

results = train_and_report(X_train, X_test, y_train, y_test)
best_name = max(results, key=lambda k: results[k]['precision']+results[k]['recall'])
print(f"**Best model:** {best_name}\n")


--- Naive Bayes ---
Precision: 0.987, Recall: 0.726
              precision    recall  f1-score   support

         Ham       0.78      0.99      0.88       421
        Spam       0.99      0.73      0.84       419

    accuracy                           0.86       840
   macro avg       0.89      0.86      0.86       840
weighted avg       0.89      0.86      0.86       840

--- Logistic Regression ---
Precision: 0.993, Recall: 0.981
              precision    recall  f1-score   support

         Ham       0.98      0.99      0.99       421
        Spam       0.99      0.98      0.99       419

    accuracy                           0.99       840
   macro avg       0.99      0.99      0.99       840
weighted avg       0.99      0.99      0.99       840

--- SVM ---
Precision: 0.978, Recall: 0.976
              precision    recall  f1-score   support

         Ham       0.98      0.98      0.98       421
        Spam       0.98      0.98      0.98       419

    accuracy              

## Demonstration
Try the best model on a new email

In [8]:
sample = "Subject: Win money now! Click http://spam.link"
proc = pre.preprocess(sample)
vec = vectorizer.transform([proc])
pred = results[best_name]['model'].predict(vec)[0]
prob = results[best_name]['model'].predict_proba(vec)[0]
print(f"Prediction: {'SPAM' if pred else 'HAM'} (Ham={prob[0]:.2f}, Spam={prob[1]:.2f})")


Prediction: SPAM (Ham=0.29, Spam=0.71)


## Hyperparameter experimentation
We’ll compare four settings by combined precision+recall on Naive Bayes.

In [9]:
def hyperparam_experiment(texts, labels):
    configs = [
        {'strip_headers': True,  'lowercase': True,  'remove_punct': True,  'replace_urls': True,  'replace_nums': True},
        {'strip_headers': False, 'lowercase': True,  'remove_punct': True,  'replace_urls': True,  'replace_nums': True},
        {'strip_headers': True,  'lowercase': False, 'remove_punct': True,  'replace_urls': True,  'replace_nums': True},
        {'strip_headers': True,  'lowercase': True,  'remove_punct': False, 'replace_urls': True,  'replace_nums': True},
    ]
    best, best_cfg = 0, None
    for i, cfg in enumerate(configs, 1):
        pre = EmailPreprocessor(**cfg)
        proc = [pre.preprocess(t) for t in texts]
        X = CountVectorizer(max_features=1000, stop_words='english').fit_transform(proc)
        Xtr, Xte, ytr, yte = train_test_split(X, labels, test_size=0.3,
                                              random_state=42, stratify=labels)
        model = MultinomialNB().fit(Xtr, ytr)
        yp = model.predict(Xte)
        p, r = precision_score(yte, yp), recall_score(yte, yp)
        print(f"Config {i}: P={p:.3f}, R={r:.3f}, Sum={p+r:.3f}")
        if p+r > best:
            best, best_cfg = p+r, cfg
    print(f"\n**Best config:** {best_cfg} (Sum={best:.3f})")

hyperparam_experiment(texts, labels)


Config 1: P=0.987, R=0.726, Sum=1.713
Config 2: P=0.994, R=0.842, Sum=1.837
Config 3: P=0.987, R=0.726, Sum=1.713
Config 4: P=0.987, R=0.726, Sum=1.713

**Best config:** {'strip_headers': False, 'lowercase': True, 'remove_punct': True, 'replace_urls': True, 'replace_nums': True} (Sum=1.837)


## Conclusions  
- **Best classifier:** Logistic Regression (Precision=0.993, Recall=0.981)  
- **Best preprocessing:** headers _on_, lowercase _on_, punctuation removed, URLs & numbers replaced  
  
