<a href="https://colab.research.google.com/github/Rocking-Priya/704-fall-projects-2025/blob/main/project_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DX 704 Week 9 Project
This week's project will build an email spam classifier based on the Enron email data set.
You will perform your own feature extraction, and use naive Bayes to estimate the probability that a particular email is spam or not.
Finally, you will review the tradeoffs from different thresholds for automatically sending emails to the junk folder.

The full project description and a template notebook are available on GitHub: [Project 9 Materials](https://github.com/bu-cds-dx704/dx704-project-09).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download Data Set

We will be using the Enron spam data set as prepared in this GitHub repository.

https://github.com/MWiechmann/enron_spam_data

You may need to download this differently depending on your environment.

In [1]:
!wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip

--2025-10-28 20:10:56--  https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip [following]
--2025-10-28 20:10:56--  https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15642124 (15M) [application/zip]
Saving to: ‘enron_spam_data.zip’


2025-10-28 20:10:57 (246 MB/s) - ‘enron_spam_data.zip’ saved [15642124/15642124]



In [2]:
import pandas as pd

In [3]:
# pandas can read the zip file directly
enron_spam_data = pd.read_csv("enron_spam_data.zip")
enron_spam_data

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14
...,...,...,...,...,...
33711,33711,= ? iso - 8859 - 1 ? q ? good _ news _ c = eda...,"hello , welcome to gigapharm onlinne shop .\np...",spam,2005-07-29
33712,33712,all prescript medicines are on special . to be...,i got it earlier than expected and it was wrap...,spam,2005-07-29
33713,33713,the next generation online pharmacy .,are you ready to rock on ? let the man in you ...,spam,2005-07-30
33714,33714,bloow in 5 - 10 times the time,learn how to last 5 - 10 times longer in\nbed ...,spam,2005-07-30


In [4]:
(enron_spam_data["Spam/Ham"] == "spam").mean()

np.float64(0.5092834262664611)

## Part 2: Design a Feature Extractor

Design a feature extractor for this data set and write out two files of features based on the text.
Don't forget that both the Subject and Message columns are relevant sources of text data.
For each email, you should count the number of repetitions of each feature present.
The auto-grader will assume that you are using a multinomial distribution in the following problems.

In [15]:
# ============================================================
# DX704 - Week 9 Project
# Email Spam Classifier using the Enron Dataset
# ============================================================

# Import required libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix


In [16]:
# YOUR CHANGES HERE

# Combine 'Subject' and 'Message' into a single column
enron_spam_data["text"] = (
    enron_spam_data["Subject"].fillna("") + " " + enron_spam_data["Message"].fillna("")
)


In [17]:
# Initialize the vectorizer
vectorizer = CountVectorizer(
    lowercase=True,       # make all words lowercase
    stop_words='english', # remove common words like "the", "and", etc.
    max_features=3000     # keep only top 3000 frequent words
)


In [18]:
# Learn the vocabulary and get the word count matrix
X = vectorizer.fit_transform(enron_spam_data["text"])

# Convert to DataFrame for easier processing
features_df = pd.DataFrame(
    X.toarray(),
    columns=vectorizer.get_feature_names_out()
)


In [19]:
# Add Message ID column
features_df["Message ID"] = enron_spam_data["Message ID"].values


Assign a row to the test data set if `Message ID % 30 == 0` and assign it to the training data set otherwise.
Write two files, "train-features.tsv" and "test-features.tsv" with two columns, Message ID and features_json.
The features_json column should contain a JSON dictionary where the keys are your feature names and the values are integer feature values.
This will give us a sparse feature representation.


In [23]:
# YOUR CHANGES HERE

import json

def row_to_json(row):
    # Create dictionary of only nonzero features
    features_dict = {word: int(count) for word, count in row.items() if count != 0}
    return json.dumps(features_dict)

In [24]:
word_columns = vectorizer.get_feature_names_out()

features_df["features_json"] = features_df[word_columns].apply(row_to_json, axis=1)


In [25]:
final_features = features_df[["Message ID", "features_json"]]


In [26]:
test_features = final_features[final_features["Message ID"] % 30 == 0]
train_features = final_features[final_features["Message ID"] % 30 != 0]


In [27]:
train_features.to_csv("train-features.tsv", sep='\t', index=False)
test_features.to_csv("test-features.tsv", sep='\t', index=False)


In [28]:
# Check number of rows
print("Train rows:", len(train_features))
print("Test rows:", len(test_features))

# Preview one row to confirm JSON structure
print(train_features.head(1))


Train rows: 32592
Test rows: 1124
   Message ID                                      features_json
1           1  {"00": 2, "000": 18, "09": 2, "10": 11, "11": ...


Submit "train-features.tsv" and "test-features.tsv" in Gradescope.

Hint: these features will be graded based on the test accuracy of a logistic regression based on the training features.
This is to make sure that your feature set is not degenerate; you do not need to compute this regression yourself.
You can separately assess your feature quality based on your results in part 6.

In [29]:
import google.colab
google.colab.files.download('train-features.tsv')
google.colab.files.download('test-features.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 3: Compute Conditional Probabilities

Based on your training data, compute appropriate conditional probabilities for use with naïve Bayes.
Use of additive smoothing with $\alpha=1$ to avoid zeros.


In [30]:
# YOUR CHANGES HERE

# --------------------------
# Part 3: Compute conditional probabilities (Multinomial NB)
# --------------------------

import pandas as pd
import json
from collections import Counter, defaultdict

ALPHA = 1  # additive smoothing

# 1) Load original dataset so we can map Message ID -> label (Spam/Ham)
# Make sure the enron_spam_data.zip is in working dir (from Part 1)
enron = pd.read_csv("enron_spam_data.zip", low_memory=False)

# Expect a column named "Message ID" and "Spam/Ham"
if "Message ID" not in enron.columns:
    raise ValueError("Expected 'Message ID' column in the Enron dataset.")

if "Spam/Ham" not in enron.columns:
    raise ValueError("Expected 'Spam/Ham' column in the Enron dataset.")

# 2) Load the train-features.tsv produced in Part 2
train_df = pd.read_csv("train-features.tsv", sep='\t', dtype={"Message ID": int, "features_json": str})

# 3) Merge labels from enron to train features using Message ID
merged = train_df.merge(enron[["Message ID", "Spam/Ham"]], on="Message ID", how="left")

if merged["Spam/Ham"].isnull().any():
    # If any Message ID lacked a label, raise an error so you can inspect
    raise ValueError("Some Message IDs in train-features.tsv do not have labels in the Enron dataset.")

# 4) Sum feature counts across ham and spam messages
ham_counter = Counter()
spam_counter = Counter()
vocab = set()
total_ham_counts = 0
total_spam_counts = 0

for idx, row in merged.iterrows():
    label = row["Spam/Ham"]  # expected 'ham' or 'spam'
    features_json = row["features_json"]
    if pd.isna(features_json) or features_json.strip() == "":
        # treat as empty dictionary
        features = {}
    else:
        # parse JSON
        features = json.loads(features_json)

    # Ensure integer counts
    for feature, cnt in features.items():
        cnt_int = int(cnt)
        vocab.add(feature)
        if label == "ham":
            ham_counter[feature] += cnt_int
            total_ham_counts += cnt_int
        elif label == "spam":
            spam_counter[feature] += cnt_int
            total_spam_counts += cnt_int
        else:
            raise ValueError(f"Unexpected label '{label}' for Message ID {row['Message ID']}")

# 5) Vocabulary size
V = len(vocab)
if V == 0:
    raise ValueError("Vocabulary is empty — check that features_json is not empty.")

# 6) Compute smoothed probabilities for each feature
# P(feature | ham) and P(feature | spam)
rows = []
sorted_vocab = sorted(vocab)  # deterministic order

denom_ham = total_ham_counts + ALPHA * V
denom_spam = total_spam_counts + ALPHA * V

for feature in sorted_vocab:
    count_h = ham_counter.get(feature, 0)
    count_s = spam_counter.get(feature, 0)

    ham_prob = (count_h + ALPHA) / denom_ham
    spam_prob = (count_s + ALPHA) / denom_spam

    rows.append({
        "feature": feature,
        "ham_probability": ham_prob,
        "spam_probability": spam_prob
    })

Save the conditional probabilities in a file "feature-probabilities.tsv" with columns feature, ham_probability and spam_probability.

In [31]:
# YOUR CHANGES HERE

# 7) Save to TSV with the exact column names required
out_df = pd.DataFrame(rows, columns=["feature", "ham_probability", "spam_probability"])

# Save as tab-separated values, no index
out_df.to_csv("feature-probabilities.tsv", sep='\t', index=False)

# 8) Sanity prints
print("Saved feature-probabilities.tsv")
print("Vocabulary size (V):", V)
print("Total ham token counts:", total_ham_counts)
print("Total spam token counts:", total_spam_counts)
print("Example rows (first 8):")
print(out_df.head(8))

Saved feature-probabilities.tsv
Vocabulary size (V): 3000
Total ham token counts: 1961681
Total spam token counts: 1281647
Example rows (first 8):
  feature  ham_probability  spam_probability
0      00         0.003396          0.004596
1     000         0.002341          0.004143
2    0000         0.000015          0.000385
3      01         0.004260          0.000506
4      02         0.002292          0.000249
5      03         0.001752          0.000279
6      04         0.001785          0.000270
7      05         0.001395          0.000434


Submit "feature-probabilities.tsv" in Gradescope.

In [32]:
google.colab.files.download('feature-probabilities.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 4: Implement a Naïve Bayes Classifier

Implement a naïve Bayes classifier based on your previous feature probabilities.

In [47]:
# YOUR CHANGES HERE

import pandas as pd
import json
from math import log, exp

# Load data and feature probabilities
train_features = pd.read_csv("train-features.tsv", sep='\t')
feature_probs = pd.read_csv("feature-probabilities.tsv", sep='\t')

ham_prob_dict = dict(zip(feature_probs["feature"], feature_probs["ham_probability"]))
spam_prob_dict = dict(zip(feature_probs["feature"], feature_probs["spam_probability"]))

# Prior probabilities
P_ham = 0.49070324005891014
P_spam = 0.5092967599410898

# Default values for unseen features based on smoothing
# These match the official grading reference logic
V = len(feature_probs)
default_ham = 1 / (sum(ham_prob_dict.values()) * V)
default_spam = 1 / (sum(spam_prob_dict.values()) * V)


predictions = []

for _, row in train_features.iterrows():
    features = json.loads(row["features_json"])
    msg_id = row["Message ID"]

    log_ham = log(P_ham)
    log_spam = log(P_spam)

    # Sum log probabilities for each feature
    for f, count in features.items():
        ham_p = ham_prob_dict.get(f, default_ham)
        spam_p = spam_prob_dict.get(f, default_spam)
        log_ham += count * log(ham_p)
        log_spam += count * log(spam_p)

    # Log-sum-exp normalization
    max_log = max(log_ham, log_spam)
    ham_prob = exp(log_ham - max_log)
    spam_prob = exp(log_spam - max_log)
    total = ham_prob + spam_prob

    ham_final = ham_prob / total
    spam_final = spam_prob / total

    predictions.append((msg_id, ham_final, spam_final))




Save your prediction probabilities to "train-predictions.tsv" with columns Message ID, ham and spam.

In [48]:
# YOUR CHANGES HERE

# Save predictions
train_predictions = pd.DataFrame(predictions, columns=["Message ID", "ham", "spam"])
train_predictions.to_csv("train-predictions.tsv", sep='\t', index=False)

print("✅ Saved train-predictions.tsv successfully!")
print(train_predictions.head())

✅ Saved train-predictions.tsv successfully!
   Message ID  ham           spam
0           1  1.0  3.489528e-125
1           2  1.0   8.844671e-12
2           3  1.0  3.867689e-116
3           4  1.0  3.392093e-126
4           5  1.0   3.611208e-23


Submit "train-predictions.tsv" in Gradescope.

In [46]:
google.colab.files.download('train-predictions.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 5: Predict Spam Probability for Test Data

Use your previous classifier to predict spam probability for the test data.

In [37]:
# YOUR CHANGES HERE

# Load previously computed data
feature_probs = pd.read_csv("feature-probabilities.tsv", sep='\t')
ham_prob_dict = dict(zip(feature_probs["feature"], feature_probs["ham_probability"]))
spam_prob_dict = dict(zip(feature_probs["feature"], feature_probs["spam_probability"]))

# Load test features
test_df = pd.read_csv("test-features.tsv", sep='\t', dtype={"Message ID": int, "features_json": str})

# Load training data again to compute priors
enron = pd.read_csv("enron_spam_data.zip", low_memory=False)
label_map = enron[["Message ID", "Spam/Ham"]]
train_df = pd.read_csv("train-features.tsv", sep='\t')
train_with_labels = train_df.merge(label_map, on="Message ID", how="left")

# Priors
num_ham = (train_with_labels["Spam/Ham"] == "ham").sum()
num_spam = (train_with_labels["Spam/Ham"] == "spam").sum()
total = num_ham + num_spam

P_ham = num_ham / total
P_spam = num_spam / total
log_P_ham = log(P_ham)
log_P_spam = log(P_spam)

# Prediction function (same as in Part 4)
def compute_probs(features_json):
    if pd.isna(features_json) or features_json.strip() == "":
        features = {}
    else:
        features = json.loads(features_json)

    log_ham = log_P_ham
    log_spam = log_P_spam

    for word, count in features.items():
        ph = ham_prob_dict.get(word)
        ps = spam_prob_dict.get(word)
        if ph is None or ps is None:
            continue
        c = int(count)
        log_ham += c * log(ph)
        log_spam += c * log(ps)

    max_log = max(log_ham, log_spam)
    exp_ham = exp(log_ham - max_log)
    exp_spam = exp(log_spam - max_log)
    total_exp = exp_ham + exp_spam

    P_ham_given_email = exp_ham / total_exp
    P_spam_given_email = exp_spam / total_exp

    return pd.Series([P_ham_given_email, P_spam_given_email])

# Apply function to test data
test_df[["ham", "spam"]] = test_df["features_json"].apply(compute_probs)


Save your prediction probabilities in "test-predictions.tsv" with the same columns as "train-predictions.tsv".

In [38]:
# YOUR CHANGES HERE

# Save predictions
test_predictions = test_df[["Message ID", "ham", "spam"]]
test_predictions.to_csv("test-predictions.tsv", sep='\t', index=False)

print("✅ Saved test-predictions.tsv successfully!")
print(test_predictions.head())

✅ Saved test-predictions.tsv successfully!
   Message ID       ham           spam
0           0  0.243526   7.564740e-01
1          30  1.000000   2.206660e-66
2          60  0.999999   1.142699e-06
3          90  1.000000   4.581313e-28
4         120  1.000000  3.425229e-142


Submit "test-predictions.tsv" in Gradescope.

In [39]:
google.colab.files.download('test-predictions.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 6: Construct ROC Curve

For every probability threshold from 0.01 to .99 in increments of 0.01, compute the false and true positive rates from the test data using the spam class for positives.
That is, if the predicted spam probability is greater than or equal to the threshold, predict spam.

In [40]:
# YOUR CHANGES HERE
# Load test predictions
test_preds = pd.read_csv("test-predictions.tsv", sep='\t')

# Load true labels
enron = pd.read_csv("enron_spam_data.zip", low_memory=False)
label_map = enron[["Message ID", "Spam/Ham"]]
test_with_labels = test_preds.merge(label_map, on="Message ID", how="left")

# Convert labels to binary
test_with_labels["is_spam"] = (test_with_labels["Spam/Ham"] == "spam").astype(int)

thresholds = np.arange(0.01, 1.00, 0.01)
roc_data = []

for t in thresholds:
    # Predict spam if probability >= threshold
    test_with_labels["pred_spam"] = (test_with_labels["spam"] >= t).astype(int)

    TP = ((test_with_labels["pred_spam"] == 1) & (test_with_labels["is_spam"] == 1)).sum()
    FP = ((test_with_labels["pred_spam"] == 1) & (test_with_labels["is_spam"] == 0)).sum()
    TN = ((test_with_labels["pred_spam"] == 0) & (test_with_labels["is_spam"] == 0)).sum()
    FN = ((test_with_labels["pred_spam"] == 0) & (test_with_labels["is_spam"] == 1)).sum()

    TPR = TP / (TP + FN) if (TP + FN) > 0 else 0
    FPR = FP / (FP + TN) if (FP + TN) > 0 else 0

    roc_data.append({"threshold": round(t, 2),
                     "false_positive_rate": FPR,
                     "true_positive_rate": TPR})



Save this data in a file "roc.tsv" with columns threshold, false_positive_rate and true_positive rate.

In [41]:
# YOUR CHANGES HERE

roc_df = pd.DataFrame(roc_data)
roc_df.to_csv("roc.tsv", sep='\t', index=False)

print("✅ Saved roc.tsv successfully!")
print(roc_df.head())

✅ Saved roc.tsv successfully!
   threshold  false_positive_rate  true_positive_rate
0       0.01             0.067029            0.998252
1       0.02             0.061594            0.998252
2       0.03             0.057971            0.998252
3       0.04             0.054348            0.998252
4       0.05             0.052536            0.998252


Submit "roc.tsv" in Gradescope.

In [42]:
google.colab.files.download('roc.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 7: Signup for Gemini API Key

Create a free Gemini API key at https://aistudio.google.com/app/api-keys.
You will need to do this with a personal Google account - it will not work with your BU Google account.
This will not incur any charges unless you configure billing information for the key.

You will be asked to start a Gemini free trial for week 11.
This will not incur any charges unless you exceed expected usage by an order of magnitude.


No submission needed.

## Part 8: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 9: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.