# DX 704 Week 9 Project

This week's project will build an email spam classifier based on the Enron email data set.
You will perform your own feature extraction, and use naive Bayes to estimate the probability that a particular email is spam or not.
Finally, you will review the tradeoffs from different thresholds for automatically sending emails to the junk folder.

The full project description and a template notebook are available on GitHub: [Project 9 Materials](https://github.com/bu-cds-dx704/dx704-project-09).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download Data Set

We will be using the Enron spam data set as prepared in this GitHub repository.

https://github.com/MWiechmann/enron_spam_data

You may need to download this differently depending on your environment.

In [1]:
# Core imports used throughout the notebook
import re
import json
import math
from collections import Counter, defaultdict

import numpy as np
import pandas as pd


In [2]:
!wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip

--2025-10-31 22:29:17--  https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
Resolving github.com (github.com)... 20.207.73.82
Connecting to github.com (github.com)|20.207.73.82|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip [following]
--2025-10-31 22:29:18--  https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15642124 (15M) [application/zip]
Saving to: ‘enron_spam_data.zip.2’


2025-10-31 22:29:19 (23.7 MB/s) - ‘enron_spam_data.zip.2’ saved [15642124/15642124]



In [3]:
import pandas as pd

In [4]:
# pandas can read the zip file directly
enron_spam_data = pd.read_csv("enron_spam_data.zip")
enron_spam_data

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14
...,...,...,...,...,...
33711,33711,= ? iso - 8859 - 1 ? q ? good _ news _ c = eda...,"hello , welcome to gigapharm onlinne shop .\np...",spam,2005-07-29
33712,33712,all prescript medicines are on special . to be...,i got it earlier than expected and it was wrap...,spam,2005-07-29
33713,33713,the next generation online pharmacy .,are you ready to rock on ? let the man in you ...,spam,2005-07-30
33714,33714,bloow in 5 - 10 times the time,learn how to last 5 - 10 times longer in\nbed ...,spam,2005-07-30


In [5]:
(enron_spam_data["Spam/Ham"] == "spam").mean()

np.float64(0.5092834262664611)

## Part 2: Design a Feature Extractor

Design a feature extractor for this data set and write out two files of features based on the text.
Don't forget that both the Subject and Message columns are relevant sources of text data.
For each email, you should count the number of repetitions of each feature present.
The auto-grader will assume that you are using a multinomial distribution in the following problems.

In [6]:
# YOUR CHANGES HERE
import pandas as pd
import re, json
from collections import Counter

# Load dataset (as in Part 1)
# !wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
df = pd.read_csv("enron_spam_data.zip")  # columns: Message ID, Subject, Message, Spam/Ham

# ---------- Tokenization & Normalization ----------

def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = s.lower()
    # canonicalize urls and numbers to reduce vocab explosion
    s = re.sub(r"https?://\S+|www\.\S+", " <url> ", s)
    s = re.sub(r"\d+", " <num> ", s)
    return s

def tokenize(s: str):
    # keep <url> and <num> as tokens, and capture alnum/underscore tokens
    return re.findall(r"<url>|<num>|[a-z0-9_]+", s)

# ---------- Feature Extractor ----------
# Multinomial-friendly integer counts; subject lightly up-weighted

def extract_features(subject: str, message: str) -> Counter:
    subj = normalize_text(subject)
    body = normalize_text(message)

    toks_subj = tokenize(subj)
    toks_body = tokenize(body)

    # upweight subject: duplicate once (simple integer upweight)
    toks = toks_body + toks_subj + toks_subj

    feats = Counter()

    # Unigrams (prefix to avoid collisions)
    for t in toks:
        feats[f"uni::{t}"] += 1

    # Light bigrams (only from body to keep vocab modest)
    for i in range(len(toks_body) - 1):
        b = f"{toks_body[i]} {toks_body[i+1]}"
        feats[f"bi::{b}"] += 1

    # Structural hints often correlated with spam
    num_urls = sum(1 for t in toks if t == "<url>")
    feats["meta::num_urls"] = num_urls
    feats["meta::has_url"] = 1 if num_urls > 0 else 0
    feats["meta::len"] = max(1, len(toks_body))          # length proxy (int)
    feats["meta::subject_len"] = max(1, len(toks_subj))  # subject length (int)

    return feats

# ---------- Train/Test Split ----------
# Test if Message ID % 30 == 0, else Train

def is_test_row(message_id) -> bool:
    try:
        return int(message_id) % 30 == 0
    except:
        # fallback (rare) if non-integer IDs appear
        return (hash(str(message_id)) % 30) == 0

rows = []
for _, r in df.iterrows():
    feats = extract_features(r.get("Subject", ""), r.get("Message", ""))
    rows.append({
        "Message ID": r["Message ID"],
        # compact JSON; values are integers
        "features_json": json.dumps({k:int(v) for k,v in feats.items()}, separators=(',', ':'))
    })

feat_df = pd.DataFrame(rows)
train_out = feat_df[~feat_df["Message ID"].apply(is_test_row)][["Message ID", "features_json"]]
test_out  = feat_df[ feat_df["Message ID"].apply(is_test_row)][["Message ID", "features_json"]]

train_out.to_csv("train-features.tsv", sep="\t", index=False)
test_out.to_csv("test-features.tsv",  sep="\t", index=False)

print("Saved train-features.tsv", train_out.shape)
print("Saved test-features.tsv",  test_out.shape)



Saved train-features.tsv (32592, 2)
Saved test-features.tsv (1124, 2)


Assign a row to the test data set if `Message ID % 30 == 0` and assign it to the training data set otherwise.
Write two files, "train-features.tsv" and "test-features.tsv" with two columns, Message ID and features_json.
The features_json column should contain a JSON dictionary where the keys are your feature names and the values are integer feature values.
This will give us a sparse feature representation.


In [7]:
# YOUR CHANGES HERE

import pandas as pd, json, numpy as np

def load_tsv(path):
    df = pd.read_csv(path, sep="\t", dtype={"Message ID": object, "features_json": str})
    assert list(df.columns) == ["Message ID","features_json"], f"Bad columns in {path}: {df.columns}"
    return df

train = load_tsv("train-features.tsv")
test  = load_tsv("test-features.tsv")

# 1) Message ID must be convertible to int and split rule must hold
def to_int_mid(x):
    try:
        return int(x)
    except:
        raise AssertionError(f"Message ID not int-like: {x!r}")

train["MID"] = train["Message ID"].map(to_int_mid)
test["MID"]  = test["Message ID"].map(to_int_mid)

assert ((test["MID"] % 30) == 0).all(), "Some test rows are not MID % 30 == 0"
assert ((train["MID"] % 30) != 0).all(), "Some train rows violate MID % 30 != 0"

# 2) features_json must be valid JSON dict with integer values
def check_json(row):
    d = json.loads(row)
    assert isinstance(d, dict), "features_json must be a JSON object"
    for k,v in d.items():
        assert isinstance(k, str), "feature names must be strings"
        assert isinstance(v, (int, np.integer)), f"value for {k} must be integer, got {type(v)}"
    return len(d)

train_nonzero = train["features_json"].map(check_json)
test_nonzero  = test["features_json"].map(check_json)

# 3) No empty dicts (every email should yield at least one feature)
assert (train_nonzero > 0).all(), "Empty feature dict in train"
assert (test_nonzero  > 0).all(), "Empty feature dict in test"

# 4) Quick sparsity + sanity peek
print("Train rows:", len(train), "Test rows:", len(test))
print("Avg features/email — train:", round(train_nonzero.mean(),2), "test:", round(test_nonzero.mean(),2))

# 5) Optional: spot-check a couple of JSONs
print("Sample train JSON:", train.loc[train.index[0], "features_json"][:200], "...")
print("Sample test  JSON:",  test.loc[test.index[0],  "features_json"][:200], "...")
print("✅ Part 2 format & split look good.")


Train rows: 32592 Test rows: 1124
Avg features/email — train: 310.4 test: 291.44
Sample train JSON: {"uni::gary":2,"uni::production":3,"uni::from":3,"uni::the":8,"uni::high":2,"uni::island":2,"uni::larger":2,"uni::block":2,"uni::a":6,"uni::<num>":149,"uni::commenced":1,"uni::on":7,"uni::saturday":1, ...
Sample test  JSON: {"uni::christmas":2,"uni::tree":2,"uni::farm":2,"uni::pictures":2,"meta::num_urls":0,"meta::has_url":0,"meta::len":1,"meta::subject_len":4} ...
✅ Part 2 format & split look good.


Submit "train-features.tsv" and "test-features.tsv" in Gradescope.

Hint: these features will be graded based on the test accuracy of a logistic regression based on the training features.
This is to make sure that your feature set is not degenerate; you do not need to compute this regression yourself.
You can separately assess your feature quality based on your results in part 6.

## Part 3: Compute Conditional Probabilities

Based on your training data, compute appropriate conditional probabilities for use with naïve Bayes.
Use of additive smoothing with $\alpha=1$ to avoid zeros.


In [8]:
# YOUR CHANGES HERE

import pandas as pd, json, numpy as np
from collections import defaultdict, Counter

ALPHA = 1.0  # Laplace smoothing α=1

# 1) Load features (train only) and labels
train_feat = pd.read_csv("train-features.tsv", sep="\t", dtype={"Message ID": int, "features_json": str})
raw = pd.read_csv("enron_spam_data.zip")  # from Part 1
raw = raw.rename(columns={"Message ID":"Message ID"})  # ensure same name
labels = raw.loc[:, ["Message ID", "Spam/Ham"]].drop_duplicates()

train = train_feat.merge(labels, on="Message ID", how="left")
assert train["Spam/Ham"].notna().all(), "Some training Message IDs missing labels—double-check IDs & dataset."

# 2) Build per-class token counts (multinomial)
#    We'll accumulate counts for each feature, separately for ham and spam
class_feature_counts = {"ham": Counter(), "spam": Counter()}
class_token_totals  = {"ham": 0, "spam": 0}
vocab = set()

for _, row in train.iterrows():
    y = row["Spam/Ham"].strip().lower()  # 'ham' or 'spam'
    feats = json.loads(row["features_json"])
    # ensure integer counts
    feats = {str(k): int(v) for k, v in feats.items() if int(v) != 0}
    vocab.update(feats.keys())
    class_feature_counts[y].update(feats)
    class_token_totals[y] += sum(feats.values())

V = len(vocab)
N_ham  = class_token_totals["ham"]
N_spam = class_token_totals["spam"]

print(f"Vocab size: {V:,} | Ham tokens: {N_ham:,} | Spam tokens: {N_spam:,}")

# 3) Compute smoothed conditional probabilities for every feature in vocab
#    P(f|class) = (count(f,class) + α) / (total_tokens_class + α * V)
den_ham  = N_ham  + ALPHA * V
den_spam = N_spam + ALPHA * V

records = []
get_hc = class_feature_counts["ham"].get
get_sc = class_feature_counts["spam"].get

for f in vocab:
    c_ham  = get_hc(f, 0)
    c_spam = get_sc(f, 0)
    p_ham  = (c_ham  + ALPHA) / den_ham
    p_spam = (c_spam + ALPHA) / den_spam
    records.append((f, p_ham, p_spam))

feature_probs = pd.DataFrame(records, columns=["feature","ham_probability","spam_probability"])\
                    .sort_values("feature").reset_index(drop=True)

# 4) Save file
feature_probs.to_csv("feature-probabilities.tsv", sep="\t", index=False)
feature_probs.head(10)


Vocab size: 1,443,095 | Ham tokens: 13,266,275 | Spam tokens: 10,514,414


Unnamed: 0,feature,ham_probability,spam_probability
0,bi::<num> <num>,0.01042254,0.005332549
1,bi::<num> _,2.875718e-05,4.373821e-05
2,bi::<num> a,0.0001035394,0.0001933513
3,bi::<num> aa,6.798388e-07,6.523098e-06
4,bi::<num> aaa,2.719355e-07,8.362946e-08
5,bi::<num> aabda,1.359678e-07,8.362946e-08
6,bi::<num> aads,6.798388e-08,3.345178e-07
7,bi::<num> aakkyl,6.798388e-08,1.672589e-07
8,bi::<num> aalman,6.798388e-08,1.672589e-07
9,bi::<num> aambique,6.798388e-08,1.672589e-07


Save the conditional probabilities in a file "feature-probabilities.tsv" with columns feature, ham_probability and spam_probability.

In [9]:
# YOUR CHANGES HERE

# feature_probs must have these exact columns:
# ["feature", "ham_probability", "spam_probability"]

assert list(feature_probs.columns) == ["feature","ham_probability","spam_probability"]

# Save as TSV (no index)
feature_probs.to_csv("feature-probabilities.tsv", sep="\t", index=False)

# Quick sanity check
check = pd.read_csv("feature-probabilities.tsv", sep="\t")
print(check.shape, check.columns.tolist())
print(check.head(5))


(1443095, 3) ['feature', 'ham_probability', 'spam_probability']
           feature  ham_probability  spam_probability
0  bi::<num> <num>     1.042254e-02      5.332549e-03
1      bi::<num> _     2.875718e-05      4.373821e-05
2      bi::<num> a     1.035394e-04      1.933513e-04
3     bi::<num> aa     6.798388e-07      6.523098e-06
4    bi::<num> aaa     2.719355e-07      8.362946e-08


Submit "feature-probabilities.tsv" in Gradescope.

## Part 4: Implement a Naïve Bayes Classifier

Implement a naïve Bayes classifier based on your previous feature probabilities.

In [10]:
# YOUR CHANGES HERE

# === Part 4: Multinomial Naïve Bayes classifier for TRAIN set ===
import json, math
import pandas as pd
from collections import defaultdict

# 1) Load inputs
train_feats = pd.read_csv("train-features.tsv", sep="\t")          # [Message ID, features_json]
feat_probs  = pd.read_csv("feature-probabilities.tsv", sep="\t")   # [feature, ham_probability, spam_probability]

# We need labels to get class priors from TRAIN split only
enron = pd.read_csv("enron_spam_data.zip")
labels = enron[["Message ID","Spam/Ham"]].copy()
labels["is_spam"] = (labels["Spam/Ham"].str.lower() == "spam").astype(int)
labels_train = labels[labels["Message ID"] % 30 != 0].copy()

# 2) Class priors P(class)
n_spam = labels_train["is_spam"].sum()
n_ham  = len(labels_train) - n_spam
p_spam = n_spam / len(labels_train)
p_ham  = n_ham  / len(labels_train)

log_p_spam = math.log(p_spam) if p_spam > 0 else -1e12
log_p_ham  = math.log(p_ham)  if p_ham  > 0 else -1e12

print(f"Priors — P(spam)={p_spam:.4f}, P(ham)={p_ham:.4f}")

# 3) Fast lookups for P(f|class); use safe floors for truly unseen keys
p_f_ham  = dict(zip(feat_probs["feature"], feat_probs["ham_probability"]))
p_f_spam = dict(zip(feat_probs["feature"], feat_probs["spam_probability"]))

# Floors: small but nonzero (your Part 3 used Laplace; these only catch features that somehow
# appear in features_json but not in the saved table due to edge cases)
floor_ham  = max(min((v for v in p_f_ham.values()  if v > 0), default=1e-12) * 0.1, 1e-12)
floor_spam = max(min((v for v in p_f_spam.values() if v > 0), default=1e-12) * 0.1, 1e-12)

log_p_f_ham  = defaultdict(lambda: math.log(floor_ham),
                           {f: math.log(max(p, 1e-300)) for f,p in p_f_ham.items()})
log_p_f_spam = defaultdict(lambda: math.log(floor_spam),
                           {f: math.log(max(p, 1e-300)) for f,p in p_f_spam.items()})

# 4) Scoring: log P(class) + sum_f count(f) * log P(f|class), then normalize
def score_features_json(features_json: str):
    feats = json.loads(features_json)
    ll_spam = log_p_spam
    ll_ham  = log_p_ham
    for f, c in feats.items():
        if not c:
            continue
        ll_spam += c * log_p_f_spam[f]
        ll_ham  += c * log_p_f_ham[f]
    # log-sum-exp to get probabilities
    m = max(ll_spam, ll_ham)
    ex_spam = math.exp(ll_spam - m)
    ex_ham  = math.exp(ll_ham  - m)
    denom = ex_spam + ex_ham
    return ex_ham/denom, ex_spam/denom  # (ham, spam)

# 5) Score every TRAIN row and save
pred_rows = []
for mid, fj in zip(train_feats["Message ID"], train_feats["features_json"]):
    p_ham_i, p_spam_i = score_features_json(fj)
    pred_rows.append((mid, p_ham_i, p_spam_i))

train_preds = pd.DataFrame(pred_rows, columns=["Message ID","ham","spam"])
train_preds.to_csv("train-predictions.tsv", sep="\t", index=False)

print("Saved train-predictions.tsv", train_preds.shape)
display(train_preds.head())


Priors — P(spam)=0.5093, P(ham)=0.4907
Saved train-predictions.tsv (32592, 3)


Unnamed: 0,Message ID,ham,spam
0,1,1.0,0.0
1,2,1.0,3.066832e-26
2,3,1.0,0.0
3,4,1.0,0.0
4,5,1.0,1.731193e-98


Save your prediction probabilities to "train-predictions.tsv" with columns Message ID, ham and spam.

In [11]:
# YOUR CHANGES HERE

...

Ellipsis

Submit "train-predictions.tsv" in Gradescope.

## Part 5: Predict Spam Probability for Test Data

Use your previous classifier to predict spam probability for the test data.

In [12]:
# YOUR CHANGES HERE

# === Part 5: Predict spam probability for TEST set ===
import json, math
import pandas as pd
from collections import defaultdict

# 1) Load inputs
test_feats  = pd.read_csv("test-features.tsv", sep="\t")             # [Message ID, features_json]
feat_probs  = pd.read_csv("feature-probabilities.tsv", sep="\t")     # [feature, ham_probability, spam_probability]

# Priors from TRAIN split only (same as Part 4 to keep consistency)
enron = pd.read_csv("enron_spam_data.zip")
labels = enron[["Message ID","Spam/Ham"]].copy()
labels["is_spam"] = (labels["Spam/Ham"].str.lower() == "spam").astype(int)
labels_train = labels[labels["Message ID"] % 30 != 0].copy()

n_spam = labels_train["is_spam"].sum()
n_ham  = len(labels_train) - n_spam
p_spam = n_spam / len(labels_train)
p_ham  = n_ham  / len(labels_train)

log_p_spam = math.log(p_spam) if p_spam > 0 else -1e12
log_p_ham  = math.log(p_ham)  if p_ham  > 0 else -1e12

# 2) Fast lookups for P(f|class) from Part 3 output
p_f_ham  = dict(zip(feat_probs["feature"], feat_probs["ham_probability"]))
p_f_spam = dict(zip(feat_probs["feature"], feat_probs["spam_probability"]))

# Floors for any rare/unseen features
floor_ham  = max(min((v for v in p_f_ham.values()  if v > 0), default=1e-12) * 0.1, 1e-12)
floor_spam = max(min((v for v in p_f_spam.values() if v > 0), default=1e-12) * 0.1, 1e-12)

log_p_f_ham  = defaultdict(lambda: math.log(floor_ham),
                           {f: math.log(max(p, 1e-300)) for f,p in p_f_ham.items()})
log_p_f_spam = defaultdict(lambda: math.log(floor_spam),
                           {f: math.log(max(p, 1e-300)) for f,p in p_f_spam.items()})

def score_features_json(features_json: str):
    feats = json.loads(features_json)
    ll_spam = log_p_spam
    ll_ham  = log_p_ham
    for f, c in feats.items():
        if not c:
            continue
        ll_spam += c * log_p_f_spam[f]
        ll_ham  += c * log_p_f_ham[f]
    m = max(ll_spam, ll_ham)
    ex_spam = math.exp(ll_spam - m)
    ex_ham  = math.exp(ll_ham  - m)
    denom = ex_spam + ex_ham
    return ex_ham/denom, ex_spam/denom  # (ham, spam)

# 3) Score TEST and save
pred_rows = []
for mid, fj in zip(test_feats["Message ID"], test_feats["features_json"]):
    p_ham_i, p_spam_i = score_features_json(fj)
    pred_rows.append((mid, p_ham_i, p_spam_i))

test_preds = pd.DataFrame(pred_rows, columns=["Message ID","ham","spam"])
test_preds.to_csv("test-predictions.tsv", sep="\t", index=False)

print("Saved test-predictions.tsv", test_preds.shape)
display(test_preds.head())


Saved test-predictions.tsv (1124, 3)


Unnamed: 0,Message ID,ham,spam
0,0,0.000905,0.9990951
1,30,1.0,6.142316000000001e-222
2,60,1.0,1.771037e-26
3,90,1.0,3.4647520000000004e-69
4,120,1.0,0.0


Save your prediction probabilities in "test-predictions.tsv" with the same columns as "train-predictions.tsv".

In [13]:
# YOUR CHANGES HERE

...

Ellipsis

Submit "test-predictions.tsv" in Gradescope.

## Part 6: Construct ROC Curve

For every probability threshold from 0.01 to .99 in increments of 0.01, compute the false and true positive rates from the test data using the spam class for positives.
That is, if the predicted spam probability is greater than or equal to the threshold, predict spam.

In [14]:
# YOUR CHANGES HERE

# === Part 6: ROC curve from test predictions ===
import numpy as np
import pandas as pd

# 1) Load test labels and predictions
enron = pd.read_csv("enron_spam_data.zip")  # has "Spam/Ham" and "Message ID"
test_labels = (
    enron[enron["Message ID"] % 30 == 0]
    .loc[:, ["Message ID", "Spam/Ham"]]
    .assign(is_spam=lambda df: (df["Spam/Ham"].str.lower() == "spam").astype(int))
)

test_preds = pd.read_csv("test-predictions.tsv", sep="\t")  # Message ID, ham, spam

# 2) Merge to align rows
df = test_labels.merge(test_preds, on="Message ID", how="inner")

y_true = df["is_spam"].to_numpy()
p_spam = df["spam"].to_numpy()

# 3) Compute FPR/TPR for thresholds 0.01..0.99
rows = []
thresholds = np.round(np.arange(0.01, 1.00, 0.01), 2)

P = (y_true == 1).sum()  # positives (spam)
N = (y_true == 0).sum()  # negatives (ham)

for t in thresholds:
    y_hat = (p_spam >= t).astype(int)
    TP = int(((y_hat == 1) & (y_true == 1)).sum())
    FP = int(((y_hat == 1) & (y_true == 0)).sum())
    TN = int(((y_hat == 0) & (y_true == 0)).sum())
    FN = int(((y_hat == 0) & (y_true == 1)).sum())

    tpr = TP / P if P else 0.0  # True Positive Rate (Recall)
    fpr = FP / N if N else 0.0  # False Positive Rate

    rows.append((float(t), fpr, tpr))

roc = pd.DataFrame(rows, columns=["threshold", "false_positive_rate", "true_positive_rate"])

# 4) Save
roc.to_csv("roc.tsv", sep="\t", index=False)
print("Saved roc.tsv", roc.shape)
roc.head()


Saved roc.tsv (99, 3)


Unnamed: 0,threshold,false_positive_rate,true_positive_rate
0,0.01,0.012681,0.996503
1,0.02,0.012681,0.994755
2,0.03,0.012681,0.994755
3,0.04,0.012681,0.994755
4,0.05,0.012681,0.994755


Save this data in a file "roc.tsv" with columns threshold, false_positive_rate and true_positive rate.

In [15]:
# YOUR CHANGES HERE

# Optional sanity check (trapezoidal AUC):
auc = np.trapz(roc["true_positive_rate"].to_numpy(), roc["false_positive_rate"].to_numpy())
print("Approx AUC (trapezoid over FPR–TPR):", round(float(auc), 4))


Approx AUC (trapezoid over FPR–TPR): 0.0


  auc = np.trapz(roc["true_positive_rate"].to_numpy(), roc["false_positive_rate"].to_numpy())


Submit "roc.tsv" in Gradescope.

## Part 7: Signup for Gemini API Key

Create a free Gemini API key at https://aistudio.google.com/app/api-keys.
You will need to do this with a personal Google account - it will not work with your BU Google account.
This will not incur any charges unless you configure billing information for the key.

You will be asked to start a Gemini free trial for week 11.
This will not incur any charges unless you exceed expected usage by an order of magnitude.


No submission needed.

## Part 8: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

In [16]:
# === Week 9: Rebuild all artifacts for Gradescope ===
import re, json, math
from collections import Counter, defaultdict
import numpy as np
import pandas as pd

# -------- Part 1: Data --------
!wget -q -O enron_spam_data.zip https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
df_raw = pd.read_csv("enron_spam_data.zip")  # has columns: Message ID, Subject, Message, Spam/Ham

# -------- Helpers --------
def normalize_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = s.lower()
    s = re.sub(r"https?://\S+|www\.\S+", " <url> ", s)
    s = re.sub(r"\d+", " <num> ", s)
    return s

def tokenize(s: str):
    return re.findall(r"<url>|<num>|[a-z0-9_]+", s)

def extract_features(subject: str, message: str) -> Counter:
    subj = normalize_text(subject)
    body = normalize_text(message)
    toks_subj = tokenize(subj)
    toks_body = tokenize(body)
    toks = toks_body + toks_subj + toks_subj  # upweight subject
    feats = Counter()
    for t in toks:
        feats[f"uni::{t}"] += 1
    for i in range(len(toks_body) - 1):
        feats[f"bi::{toks_body[i]} {toks_body[i+1]}"] += 1
    num_urls = sum(1 for t in toks if t == "<url>")
    feats["meta::num_urls"] = num_urls
    feats["meta::has_url"] = 1 if num_urls > 0 else 0
    feats["meta::len"] = max(1, len(toks_body))
    feats["meta::subject_len"] = max(1, len(toks_subj))
    return feats

def is_test(message_id) -> bool:
    return int(message_id) % 30 == 0

# -------- Part 2: train/test features --------
rows = []
for _, r in df_raw.iterrows():
    feats = extract_features(r.get("Subject",""), r.get("Message",""))
    rows.append({
        "Message ID": int(r["Message ID"]),
        "features_json": json.dumps({k:int(v) for k,v in feats.items()}, separators=(',',':'))
    })
feat_df = pd.DataFrame(rows, columns=["Message ID","features_json"])
train_df = feat_df[~feat_df["Message ID"].apply(is_test)]
test_df  = feat_df[ feat_df["Message ID"].apply(is_test)]
train_df.to_csv("train-features.tsv", sep="\t", index=False)
test_df.to_csv("test-features.tsv",  sep="\t", index=False)

# -------- Part 3: conditional probabilities (Laplace α=1) --------
train_labeled = train_df.merge(df_raw[["Message ID","Spam/Ham"]], on="Message ID", how="left")
ALPHA = 1.0
counts = {"ham": Counter(), "spam": Counter()}
totals = {"ham": 0, "spam": 0}
vocab = set()
for _, row in train_labeled.iterrows():
    y = row["Spam/Ham"].strip().lower()
    feats = json.loads(row["features_json"])
    feats = {str(k): int(v) for k, v in feats.items() if int(v) != 0}
    vocab.update(feats.keys())
    counts[y].update(feats)
    totals[y] += sum(feats.values())
V = len(vocab)
den_ham  = totals["ham"]  + ALPHA * V
den_spam = totals["spam"] + ALPHA * V

records = []
get_hc = counts["ham"].get
get_sc = counts["spam"].get
for f in vocab:
    p_ham  = (get_hc(f,0) + ALPHA) / den_ham
    p_spam = (get_sc(f,0) + ALPHA) / den_spam
    records.append((f, p_ham, p_spam))
feature_probs = pd.DataFrame(records, columns=["feature","ham_probability","spam_probability"]).sort_values("feature")
feature_probs.to_csv("feature-probabilities.tsv", sep="\t", index=False)

# -------- Parts 4 & 5: Naive Bayes predictions (train & test) --------
labels = df_raw[["Message ID","Spam/Ham"]].copy()
labels["is_spam"] = (labels["Spam/Ham"].str.lower() == "spam").astype(int)
labels_train = labels[labels["Message ID"] % 30 != 0]
p_spam = labels_train["is_spam"].mean()
p_ham  = 1 - p_spam
log_p_spam = math.log(max(p_spam, 1e-300))
log_p_ham  = math.log(max(p_ham,  1e-300))

p_f_ham  = dict(zip(feature_probs["feature"], feature_probs["ham_probability"]))
p_f_spam = dict(zip(feature_probs["feature"], feature_probs["spam_probability"]))
floor_ham  = max(min((v for v in p_f_ham.values()  if v>0), default=1e-12)*0.1, 1e-12)
floor_spam = max(min((v for v in p_f_spam.values() if v>0), default=1e-12)*0.1, 1e-12)
from collections import defaultdict
log_p_f_ham  = defaultdict(lambda: math.log(floor_ham),  {f: math.log(max(p,1e-300)) for f,p in p_f_ham.items()})
log_p_f_spam = defaultdict(lambda: math.log(floor_spam), {f: math.log(max(p,1e-300)) for f,p in p_f_spam.items()})

def score(features_json: str):
    feats = json.loads(features_json)
    ll_s = log_p_spam
    ll_h = log_p_ham
    for f, c in feats.items():
        if c:
            ll_s += c * log_p_f_spam[f]
            ll_h += c * log_p_f_ham[f]
    m = max(ll_s, ll_h)
    e_s, e_h = math.exp(ll_s - m), math.exp(ll_h - m)
    d = e_s + e_h
    return e_h/d, e_s/d

def score_df(df_in: pd.DataFrame):
    rows = []
    for mid, fj in zip(df_in["Message ID"], df_in["features_json"]):
        ph, ps = score(fj)
        rows.append((mid, ph, ps))
    return pd.DataFrame(rows, columns=["Message ID","ham","spam"])

train_preds = score_df(train_df); train_preds.to_csv("train-predictions.tsv", sep="\t", index=False)
test_preds  = score_df(test_df);  test_preds.to_csv("test-predictions.tsv",  sep="\t", index=False)

# -------- Part 6: ROC (test only) --------
test_labels = labels[labels["Message ID"] % 30 == 0][["Message ID","is_spam"]]
dfm = test_labels.merge(test_preds, on="Message ID", how="inner")
y_true = dfm["is_spam"].to_numpy()
p_sp = dfm["spam"].to_numpy()

rows = []
ths = np.round(np.arange(0.01, 1.00, 0.01), 2)
P = int((y_true==1).sum()); N = int((y_true==0).sum())
for t in ths:
    yhat = (p_sp >= t).astype(int)
    TP = int(((yhat==1)&(y_true==1)).sum())
    FP = int(((yhat==1)&(y_true==0)).sum())
    tpr = TP / P if P else 0.0
    fpr = FP / N if N else 0.0
    rows.append((float(t), fpr, tpr))
roc = pd.DataFrame(rows, columns=["threshold","false_positive_rate","true_positive_rate"])
roc.to_csv("roc.tsv", sep="\t", index=False)

print("Wrote files:",
      *[f"{p}" for p in ["train-features.tsv","test-features.tsv",
                         "feature-probabilities.tsv","train-predictions.tsv",
                         "test-predictions.tsv","roc.tsv"]], sep="\n - ")


Wrote files:
 - train-features.tsv
 - test-features.tsv
 - feature-probabilities.tsv
 - train-predictions.tsv
 - test-predictions.tsv
 - roc.tsv


## Part 9: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.

In [17]:
# Create TSV-formatted acknowledgments file(s)
content = (
    "Name\tAffiliation\tContribution\tLink\n"
    "ChatGPT (OpenAI)\tGenerative AI assistant\tHelped with feature design, Naive Bayes math, debugging, and ROC code for Week 9.\tN/A\n"
    "MWiechmann/enron_spam_data\tGitHub repository\tData source zip referenced in instructions.\thttps://github.com/MWiechmann/enron_spam_data\n"
    "bu-cds-dx704/dx704-project-09\tGitHub repository\tProject description and template notebook.\thttps://github.com/bu-cds-dx704/dx704-project-09\n"
    "bu-cds-omds example repos\tGitHub repositories\tExample notebooks used for reference.\thttps://github.com/bu-cds-omds\n"
)

# Save as acknowledgments.txt (expected by autograder) and also acknowledgments.tsv for convenience
with open("acknowledgments.txt", "w", encoding="utf-8") as f:
    f.write(content)

with open("acknowledgments.tsv", "w", encoding="utf-8") as f:
    f.write(content)

"acknowledgments.txt and acknowledgments.tsv written."


'acknowledgments.txt and acknowledgments.tsv written.'