# Spam Filter  
**Data Science I, Assignment B**  
**Student:** Fabian Augschöll  
**Date:** June 2025  

**Abstract**  
In this notebook we build and evaluate three spam‑classification models (Naive Bayes, Logistic Regression, SVM) on the Apache SpamAssassin corpus. We’ll also experiment with different text‑preprocessing pipelines to maximize combined precision + recall.

## Objectives  
1. Load “easy_ham_2” vs. “spam_2” emails  
2. Build a flexible preprocessing pipeline  
3. Vectorize text with a bag‑of‑words model  
4. Train & evaluate three classifiers  
5. Experiment with preprocessing hyperparameters  
6. Demonstrate the best model on a fresh sample  

## Dataset Overview
We use the Apache SpamAssassin public corpus, which contains labeled emails categorized as "ham" (legitimate) and "spam" (unwanted). Specifically, we load:

- `easy_ham_2`: Straightforward, non‐suspicious ham messages.  
- `hard_ham`: Ham messages that resemble spam in vocabulary or structure.  
- `spam_2`: Classic spam messages from diverse sources.  

After loading and shuffling, we split the data into training and testing sets to ensure unbiased evaluation.

## Imports & helper functions

## Preprocessing Pipeline
Raw email text often contains headers, HTML markup, URLs, numbers, and punctuation—elements that can both aid and hinder classification. We implement a flexible `EmailPreprocessor` class that supports the following steps:

1. **Header Stripping**: Removes the message headers (e.g., `From`, `Subject`) to prevent overfitting to specific senders.  
2. **Lowercasing**: Standardizes the case to reduce feature dimensionality.  
3. **URL Replacement**: Substitutes URLs with a placeholder token (`URL`) to capture the presence of links without over‑specificity.  
4. **Number Replacement**: Maps numeric sequences to a token (`NUMBER`) to detect offers or price references generically.  
5. **Punctuation Removal**: Eliminates punctuation to focus on word tokens.  

This modular design allows us to toggle each step during hyperparameter experiments.

In [7]:
import os, re
os.chdir('..')
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, precision_score, recall_score

def load_emails(directory, label):
    """Load all .txt emails in `directory`, assign label 0=ham, 1=spam."""
    emails = []
    for fname in os.listdir(directory):
        path = os.path.join(directory, fname)
        try:
            with open(path, 'r', encoding='utf-8', errors='ignore') as f:
                emails.append((f.read(), label))
        except Exception:
            pass
    return emails

class EmailPreprocessor:
    """Flexible email cleaner: strip headers, lowercase, URL/NUM replacement, punctuation removal."""
    def __init__(self, strip_headers=True, lowercase=True,
                 remove_punct=True, replace_urls=True, replace_nums=True):
        self.strip_headers, self.lowercase = strip_headers, lowercase
        self.remove_punct, self.replace_urls = remove_punct, replace_urls
        self.replace_nums = replace_nums

    def preprocess(self, text):
        if self.strip_headers:
            parts = text.split('\n\n', 1)
            text = parts[1] if len(parts) > 1 else parts[0]
        if self.lowercase:
            text = text.lower()
        if self.replace_urls:
            text = re.sub(r'http[s]?://\S+', 'URL', text)
        if self.replace_nums:
            text = re.sub(r'\d+', 'NUMBER', text)
        if self.remove_punct:
            text = re.sub(r'[^\w\s]', ' ', text)
        return ' '.join(text.split())


## Data loading
We’ll load both ham and spam folders, shuffle, then split out texts & labels.

In [8]:
def prepare_data(base_path='data'):
    # Load all ham: easy and hard
    #easy_ham = load_emails(os.path.join(base_path, 'easy_ham_2'), 0)
    hard_ham = load_emails(os.path.join(base_path, 'hard_ham'), 0)
    spam = load_emails(os.path.join(base_path, 'spam_2'), 1)
    
    # Combine all data
    data = hard_ham + spam #+ easy_ham
    np.random.seed(42)
    np.random.shuffle(data)
    
    # Separate into features and labels
    texts, labels = zip(*data)
    return list(texts), np.array(labels)

# Reload combined dataset
texts, labels = prepare_data('data')
print(f"Loaded {len(texts)} emails: {labels.sum()} spam, {len(labels)-labels.sum()} ham.")


Loaded 1648 emails: 1397 spam, 251 ham.


## Preprocessing Pipeline
Raw email text often contains headers, HTML markup, URLs, numbers, and punctuation—elements that can both aid and hinder classification. We implement a flexible `EmailPreprocessor` class that supports the following steps:

1. **Header Stripping**: Removes the message headers (e.g., `From`, `Subject`) to prevent overfitting to specific senders.  
2. **Lowercasing**: Standardizes the case to reduce feature dimensionality.  
3. **URL Replacement**: Substitutes URLs with a placeholder token (`URL`) to capture the presence of links without over‑specificity.  
4. **Number Replacement**: Maps numeric sequences to a token (`NUMBER`) to detect offers or price references generically.  
5. **Punctuation Removal**: Eliminates punctuation to focus on word tokens.  

This modular design allows us to toggle each step during hyperparameter experiments.

In [9]:
pre = EmailPreprocessor()
processed_texts = [pre.preprocess(t) for t in texts]


## Feature Extraction
We convert preprocessed text into a numerical representation using a Bag‑of‑Words model via `CountVectorizer`. Key settings include:

- **Vocabulary Size**: Top 1,000 most frequent tokens to balance expressiveness and tractability.  
- **Stop‑Word Filtering**: Excludes common English words (e.g., "the", "and") to focus on informative terms.  

This yields a sparse matrix of token counts for each email, which serves as input to our classifiers.

In [10]:
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(processed_texts)
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.3, stratify=labels, random_state=42)
print(f"Train/test sizes: {X_train.shape[0]}/{X_test.shape[0]}")


Train/test sizes: 1153/495


## Model Training & Evaluation
We train three widely‑used classifiers:

1. **Multinomial Naive Bayes**: A probabilistic approach suited for count data.  
2. **Logistic Regression**: A discriminative model that estimates class probabilities.  
3. **Support Vector Machine (linear kernel)**: A margin‑maximizing classifier that often excels in high‑dimensional text spaces.  

For each model, we report:

- **Precision**: Proportion of predicted spam that is actually spam.  
- **Recall**: Proportion of true spam that is correctly detected.  
- **F1‑Score**: Harmonic mean of precision and recall (via `classification_report`).  

These metrics enable us to compare trade‑offs between false positives and false negatives.

In [11]:
def train_and_report(X_tr, X_te, y_tr, y_te):
    models = {
        'Naive Bayes': MultinomialNB(),
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'SVM': SVC(kernel='linear', probability=True)
    }
    results = {}
    for name, model in models.items():
        model.fit(X_tr, y_tr)
        y_pred = model.predict(X_te)
        p, r = precision_score(y_te, y_pred), recall_score(y_te, y_pred)
        print(f"--- {name} ---")
        print(f"Precision: {p:.3f}, Recall: {r:.3f}")
        print(classification_report(y_te, y_pred, target_names=['Ham', 'Spam']))
        results[name] = {'model': model, 'precision': p, 'recall': r}
    return results

results = train_and_report(X_train, X_test, y_train, y_test)
best_name = max(results, key=lambda k: results[k]['precision']+results[k]['recall'])
print(f"**Best model:** {best_name}\n")


--- Naive Bayes ---
Precision: 0.955, Recall: 0.905
              precision    recall  f1-score   support

         Ham       0.59      0.76      0.66        75
        Spam       0.95      0.90      0.93       420

    accuracy                           0.88       495
   macro avg       0.77      0.83      0.80       495
weighted avg       0.90      0.88      0.89       495



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


--- Logistic Regression ---
Precision: 0.972, Recall: 0.990
              precision    recall  f1-score   support

         Ham       0.94      0.84      0.89        75
        Spam       0.97      0.99      0.98       420

    accuracy                           0.97       495
   macro avg       0.96      0.92      0.93       495
weighted avg       0.97      0.97      0.97       495

--- SVM ---
Precision: 0.974, Recall: 0.983
              precision    recall  f1-score   support

         Ham       0.90      0.85      0.88        75
        Spam       0.97      0.98      0.98       420

    accuracy                           0.96       495
   macro avg       0.94      0.92      0.93       495
weighted avg       0.96      0.96      0.96       495

**Best model:** Logistic Regression



## Model Selection
We identify the best performing model based on the combined sum of precision and recall. After initial evaluation, we highlight the top candidate for further demonstration on previously unseen data.


## Demonstration on Fresh Data
Using the selected best model, we perform inference on a separate batch of emails (`hard_ham` + `spam` folders not seen during training). This mimics real‑world deployment, where the filter encounters new message patterns.


In [12]:
# Load real unseen evaluation data (hard ham + spam)
hard_ham = load_emails('data/easy_ham_2', 0)
extra_spam = load_emails('data/spam', 1)
test_data = hard_ham + extra_spam
np.random.seed(42)
np.random.shuffle(test_data)
test_texts, test_labels = zip(*test_data)

# Preprocess using the trained preprocessor
test_processed = [pre.preprocess(t) for t in test_texts]

# Vectorize using trained vectorizer
X_test_real = vectorizer.transform(test_processed)

# Predict using best model
best_model = results[best_name]['model']
y_real_pred = best_model.predict(X_test_real)
y_real_prob = best_model.predict_proba(X_test_real)

# Report performance
from sklearn.metrics import classification_report
print(f"Evaluation on hard_ham + spam folders ({len(test_labels)} emails):")
print(classification_report(test_labels, y_real_pred, target_names=["Ham", "Spam"]))

Evaluation on hard_ham + spam folders (1902 emails):
              precision    recall  f1-score   support

         Ham       0.94      0.20      0.33      1401
        Spam       0.30      0.97      0.46       501

    accuracy                           0.40      1902
   macro avg       0.62      0.58      0.40      1902
weighted avg       0.77      0.40      0.37      1902



## Hyperparameter Experiments
To refine our preprocessing choices, we systematically vary key options:

- Toggling header stripping on/off  
- Enabling/disabling lowercasing  
- Removing/preserving punctuation  

For each configuration, we retrain a Naive Bayes classifier and score performance by combined precision + recall. This ablation study reveals which preprocessing steps contribute most to accuracy and robustness.

In [13]:
def hyperparam_experiment(texts, labels):
    configs = [
        {'strip_headers': True,  'lowercase': True,  'remove_punct': True,  'replace_urls': True,  'replace_nums': True},
        {'strip_headers': False, 'lowercase': True,  'remove_punct': True,  'replace_urls': True,  'replace_nums': True},
        {'strip_headers': True,  'lowercase': False, 'remove_punct': True,  'replace_urls': True,  'replace_nums': True},
        {'strip_headers': True,  'lowercase': True,  'remove_punct': False, 'replace_urls': True,  'replace_nums': True},
    ]
    best, best_cfg = 0, None
    for i, cfg in enumerate(configs, 1):
        pre = EmailPreprocessor(**cfg)
        proc = [pre.preprocess(t) for t in texts]
        X = CountVectorizer(max_features=1000, stop_words='english').fit_transform(proc)
        Xtr, Xte, ytr, yte = train_test_split(X, labels, test_size=0.3,
                                              random_state=42, stratify=labels)
        model = MultinomialNB().fit(Xtr, ytr)
        yp = model.predict(Xte)
        p, r = precision_score(yte, yp), recall_score(yte, yp)
        print(f"Config {i}: P={p:.3f}, R={r:.3f}, Sum={p+r:.3f}")
        if p+r > best:
            best, best_cfg = p+r, cfg
    print(f"\n**Best config:** {best_cfg} (Sum={best:.3f})")

hyperparam_experiment(texts, labels)


Config 1: P=0.955, R=0.905, Sum=1.860
Config 2: P=0.955, R=0.957, Sum=1.912
Config 3: P=0.955, R=0.905, Sum=1.860
Config 4: P=0.955, R=0.905, Sum=1.860

**Best config:** {'strip_headers': False, 'lowercase': True, 'remove_punct': True, 'replace_urls': True, 'replace_nums': True} (Sum=1.912)


## Conclusions
- **Best Classifier**: _Logistic Regression_ consistently achieves high precision and recall, making it our top choice for spam filtering.  
- **Optimal Preprocessing**: No Header stripping, lowercasing, URL/numeric tokenization, and punctuation removal enhance signal quality.  