# DX 704 Week 9 Project

This week's project will build an email spam classifier based on the Enron email data set.
You will perform your own feature extraction, and use naive Bayes to estimate the probability that a particular email is spam or not.
Finally, you will review the tradeoffs from different thresholds for automatically sending emails to the junk folder.

The full project description and a template notebook are available on GitHub: [Project 9 Materials](https://github.com/bu-cds-dx704/dx704-project-09).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download Data Set

We will be using the Enron spam data set as prepared in this GitHub repository.

https://github.com/MWiechmann/enron_spam_data

You may need to download this differently depending on your environment.

In [2]:
!wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip

--2025-11-02 00:36:51--  https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... 

connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip [following]
--2025-11-02 00:36:51--  https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15642124 (15M) [application/zip]
Saving to: ‘enron_spam_data.zip.4’


2025-11-02 00:36:52 (102 MB/s) - ‘enron_spam_data.zip.4’ saved [15642124/15642124]



In [3]:
import pandas as pd

In [4]:
# pandas can read the zip file directly
enron_spam_data = pd.read_csv("enron_spam_data.zip")
enron_spam_data

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14
...,...,...,...,...,...
33711,33711,= ? iso - 8859 - 1 ? q ? good _ news _ c = eda...,"hello , welcome to gigapharm onlinne shop .\np...",spam,2005-07-29
33712,33712,all prescript medicines are on special . to be...,i got it earlier than expected and it was wrap...,spam,2005-07-29
33713,33713,the next generation online pharmacy .,are you ready to rock on ? let the man in you ...,spam,2005-07-30
33714,33714,bloow in 5 - 10 times the time,learn how to last 5 - 10 times longer in\nbed ...,spam,2005-07-30


In [5]:
(enron_spam_data["Spam/Ham"] == "spam").mean()

np.float64(0.5092834262664611)

## Part 2: Design a Feature Extractor

Design a feature extractor for this data set and write out two files of features based on the text.
Don't forget that both the Subject and Message columns are relevant sources of text data.
For each email, you should count the number of repetitions of each feature present.
The auto-grader will assume that you are using a multinomial distribution in the following problems.

In [None]:
# YOUR CHANGES HERE
import re
from collections import Counter

_TOKEN_RE = re.compile(r"[A-Za-z0-9']+")

def _tokenize(text: str):
    return [t.lower() for t in _TOKEN_RE.findall(text or "")]

def _subject_boost(tokens, boost=2):
    if boost <= 1: 
        return tokens
    out = []
    for t in tokens:
        out.extend([t]*int(boost))
    return out

def extract_handcrafted_counts(subject: str, message: str):
    """Return sparse dict[str,int] of counts from Subject+Message (for Multinomial NB)."""
    subj = subject or ""
    msg  = message or ""

    subj_toks = _subject_boost(_tokenize(subj), boost=2)   # mild emphasis on Subject
    msg_toks  = _tokenize(msg)

    counts = Counter()
    counts.update(subj_toks)
    counts.update(msg_toks)

    s = f"{subj}\n{msg}"

    # Integer engineered features (still counts)
    counts.update({
        "__EXCLAIMS__": s.count("!"),
        "__DOLLARS__": s.count("$"),
        "__PERCENTS__": s.count("%"),
        "__URLS__": len(re.findall(r"(https?://|www\.)\S+", s, flags=re.I)),
        "__EMAILS__": len(re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", s)),
        "__UPPER_WORDS__": sum(1 for w in re.findall(r"\b[A-Z]{2,}\b", s)),
        "__LEN_BUCKET_100__": len(s) // 100,
        "__TOK_BUCKET_50__": (len(subj_toks) + len(msg_toks)) // 50,
        "cue_subject_re": int(subj.lower().startswith("re:")),
        "cue_subject_fwd": int(subj.lower().startswith(("fwd:", "fw:"))),
        "cue_contains_free": int("free" in s.lower()),
        "cue_contains_click": int("click" in s.lower()),
        "cue_contains_offer": int("offer" in s.lower()),
    })

    # Keep only positive integers
    return {str(k): int(v) for k, v in counts.items() if int(v) > 0}




Assign a row to the test data set if `Message ID % 30 == 0` and assign it to the training data set otherwise.
Write two files, "train-features.tsv" and "test-features.tsv" with two columns, Message ID and features_json.
The features_json column should contain a JSON dictionary where the keys are your feature names and the values are integer feature values.
This will give us a sparse feature representation.


In [8]:
# YOUR CHANGES HERE
import pandas as pd, numpy as np, zipfile, io, hashlib, re, json
from pathlib import Path

ZIP_PATH = Path("enron_spam_data.zip")

def load_enron_zip(zip_path=ZIP_PATH) -> pd.DataFrame:
    """Return DataFrame with columns Subject, Message, Spam/Ham, Date."""
    assert zip_path.exists(), "enron_spam_data.zip not found in the working directory."
    with zipfile.ZipFile(zip_path, "r") as z:
        # The zip contains a single CSV inside; find it:
        inner = [n for n in z.namelist() if n.lower().endswith(".csv")]
        assert inner, "CSV not found inside enron_spam_data.zip"
        with z.open(inner[0]) as f:
            df = pd.read_csv(io.TextIOWrapper(f, encoding="utf-8"))
    # Normalize expected columns per repo README (Subject, Message, Spam/Ham, Date)
    ren = {c.lower(): c for c in df.columns}
    need = ["subject","message","spam/ham","date"]
    missing = [x for x in need if x not in ren]
    if missing:
        raise ValueError(f"CSV missing expected columns: {missing}; found={list(df.columns)}")
    return df.rename(columns={
        ren["subject"]:"Subject",
        ren["message"]:"Message",
        ren["spam/ham"]:"Spam/Ham",
        ren["date"]:"Date",
    })[["Subject","Message","Spam/Ham","Date"]]

def ensure_message_id(df: pd.DataFrame) -> pd.DataFrame:
    """Create a stable Message ID if the CSV didn’t include one."""
    if "Message ID" in df.columns:
        return df
    # Use a stable 32-bit hash of Subject+Message+Date
    def mk_id(row):
        key = (str(row["Subject"]) + "\n" + str(row["Message"]) + "\n" + str(row["Date"])).encode("utf-8", "ignore")
        return int(hashlib.md5(key).hexdigest()[:8], 16)
    out = df.copy()
    out["Message ID"] = out.apply(mk_id, axis=1)
    return out

# 1) Load and normalize
all_df = load_enron_zip()
all_df = ensure_message_id(all_df)
all_df = all_df[["Message ID","Subject","Message"]].copy()
all_df["Subject"] = all_df["Subject"].astype(str)
all_df["Message"] = all_df["Message"].astype(str)

# 2) Build features_json per row
rows = []
for _, r in all_df.iterrows():
    feats = extract_handcrafted_counts(r["Subject"], r["Message"])
    rows.append({"Message ID": r["Message ID"], "features_json": json.dumps(feats, ensure_ascii=False)})

feat_df = pd.DataFrame(rows)

# 3) Split rule: test if (Message ID % 30 == 0)
def id_to_int(x):
    try:
        return int(x)
    except Exception:
        # last 8 hex of md5 to int — should be rare given we created int above
        return int(hashlib.md5(str(x).encode("utf-8")).hexdigest()[:8], 16)

feat_df["_id_int"] = feat_df["Message ID"].apply(id_to_int)
test_mask = (feat_df["_id_int"] % 30 == 0)

train_out = feat_df.loc[~test_mask, ["Message ID","features_json"]].reset_index(drop=True)
test_out  = feat_df.loc[ test_mask, ["Message ID","features_json"]].reset_index(drop=True)

# 4) Save
train_out.to_csv("train-features.tsv", sep="\t", index=False)
test_out.to_csv("test-features.tsv",  sep="\t", index=False)

print(f"Saved train-features.tsv (rows={len(train_out)}) and test-features.tsv (rows={len(test_out)})")



Saved train-features.tsv (rows=32542) and test-features.tsv (rows=1174)


Submit "train-features.tsv" and "test-features.tsv" in Gradescope.

Hint: these features will be graded based on the test accuracy of a logistic regression based on the training features.
This is to make sure that your feature set is not degenerate; you do not need to compute this regression yourself.
You can separately assess your feature quality based on your results in part 6.

## Part 3: Compute Conditional Probabilities

Based on your training data, compute appropriate conditional probabilities for use with naïve Bayes.
Use of additive smoothing with $\alpha=1$ to avoid zeros.


In [None]:
# YOUR CHANGES HERE
import pandas as pd, json
from collections import Counter

train_feats = pd.read_csv("train-features.tsv", sep="\t")
raw_df = load_enron_with_ids()
label_map = raw_df.set_index("Message ID")["Spam/Ham"].to_dict()

train_feats["Label"] = train_feats["Message ID"].map(label_map)
train_feats = train_feats.dropna(subset=["Label"]).reset_index(drop=True)

alpha = 1.0
ham_counts, spam_counts = Counter(), Counter()

for _, row in train_feats.iterrows():
    feats = json.loads(row["features_json"])
    if row["Label"] == "ham":
        ham_counts.update({k:int(v) for k,v in feats.items() if int(v)>0})
    else:
        spam_counts.update({k:int(v) for k,v in feats.items() if int(v)>0})

vocab = sorted(set(ham_counts) | set(spam_counts))
V = len(vocab)
ham_total  = sum(ham_counts.values())
spam_total = sum(spam_counts.values())

den_ham  = ham_total  + alpha * V
den_spam = spam_total + alpha * V

fp = pd.DataFrame({
    "feature": vocab,
    "ham_probability":  [(ham_counts.get(f,0)+alpha)/den_ham  for f in vocab],
    "spam_probability": [(spam_counts.get(f,0)+alpha)/den_spam for f in vocab],
})
fp.to_csv("feature-probabilities.tsv", sep="\t", index=False)
print("Saved feature-probabilities.tsv", fp.shape)



Saved feature-probabilities.tsv (0, 3)


Save the conditional probabilities in a file "feature-probabilities.tsv" with columns feature, ham_probability and spam_probability.

In [None]:
# YOUR CHANGES HERE
import json, zipfile, io
import pandas as pd
from collections import Counter
from pathlib import Path

alpha = 1.0
ZIP_PATH = Path("enron_spam_data.zip")

def load_enron_with_ids(zip_path=ZIP_PATH) -> pd.DataFrame:
    with zipfile.ZipFile(zip_path, "r") as z:
        inner = [n for n in z.namelist() if n.lower().endswith(".csv")]
        assert inner, "CSV not found inside enron_spam_data.zip"
        with z.open(inner[0]) as f:
            df = pd.read_csv(io.TextIOWrapper(f, encoding="utf-8"))
    lower = {c.lower(): c for c in df.columns}
    need = ["message id","subject","message","spam/ham","date"]
    missing = [x for x in need if x not in lower]
    if missing:
        raise ValueError(f"CSV missing columns {missing}; found={list(df.columns)}")
    df = df.rename(columns={
        lower["message id"]: "Message ID",
        lower["subject"]:   "Subject",
        lower["message"]:   "Message",
        lower["spam/ham"]:  "Spam/Ham",
        lower["date"]:      "Date",
    })
    df["Spam/Ham"] = df["Spam/Ham"].astype(str).str.strip().str.lower()
    df = df[df["Spam/Ham"].isin(["ham","spam"])]
    df = df.dropna(subset=["Message"])
    df = df[df["Message"].astype(str).str.strip().ne("")]
    df["Message ID"] = pd.to_numeric(df["Message ID"], errors="raise")
    df = df.drop_duplicates(subset=["Message ID"], keep="first")
    return df[["Message ID","Spam/Ham"]]

# 1) Load training features and labels
train_feats = pd.read_csv("train-features.tsv", sep="\t")
assert {"Message ID","features_json"}.issubset(train_feats.columns), "Missing columns in train-features.tsv"
labels_df = load_enron_with_ids()
label_map = labels_df.set_index("Message ID")["Spam/Ham"].to_dict()
train_feats["Label"] = train_feats["Message ID"].map(label_map)
train_feats = train_feats.dropna(subset=["Label"]).reset_index(drop=True)

# 2) Aggregate counts per class (Multinomial)
ham_counts, spam_counts = Counter(), Counter()
for _, row in train_feats.iterrows():
    feats = json.loads(row["features_json"])
    # keep only positive integer counts
    feats = {k:int(v) for k,v in feats.items() if int(v) > 0}
    if row["Label"] == "ham":
        ham_counts.update(feats)
    else:
        spam_counts.update(feats)

# 3) Compute Laplace-smoothed conditionals
vocab = sorted(set(ham_counts) | set(spam_counts))
V = len(vocab)
ham_total  = sum(ham_counts.values())
spam_total = sum(spam_counts.values())
den_ham  = ham_total  + alpha * V
den_spam = spam_total + alpha * V

out_df = pd.DataFrame({
    "feature": vocab,
    "ham_probability":  [(ham_counts.get(f,0) + alpha) / den_ham  for f in vocab],
    "spam_probability": [(spam_counts.get(f,0) + alpha) / den_spam for f in vocab],
})

out_df.to_csv("feature-probabilities.tsv", sep="\t", index=False)
print(f"Saved feature-probabilities.tsv with {len(out_df)} rows.")


Saved feature-probabilities.tsv with 0 rows.


Submit "feature-probabilities.tsv" in Gradescope.

## Part 4: Implement a Naïve Bayes Classifier

Implement a naïve Bayes classifier based on your previous feature probabilities.

In [None]:
# YOUR CHANGES HERE
import json, io, zipfile
import numpy as np
import pandas as pd
from pathlib import Path

# 1) Load conditional probabilities from Part 3
fp = pd.read_csv("feature-probabilities.tsv", sep="\t")
assert {"feature","ham_probability","spam_probability"}.issubset(fp.columns), \
    "Run Part 3 to create feature-probabilities.tsv first."

p_ham  = dict(zip(fp["feature"].astype(str), fp["ham_probability"].astype(float)))
p_spam = dict(zip(fp["feature"].astype(str), fp["spam_probability"].astype(float)))

# Log-prob lookup (tiny epsilon for safety)
log_p_ham  = {k: np.log(v + 1e-12) for k, v in p_ham.items()}
log_p_spam = {k: np.log(v + 1e-12) for k, v in p_spam.items()}

# Fallback for unseen features: conservative min observed per class
log_punk_ham  = float(np.log(fp["ham_probability"].min()  + 1e-12))
log_punk_spam = float(np.log(fp["spam_probability"].min() + 1e-12))

# 2) Load TRAIN IDs and compute class priors from labels
def load_enron_with_ids(zip_path=Path("enron_spam_data.zip")) -> pd.DataFrame:
    with zipfile.ZipFile(zip_path, "r") as z:
        inner = [n for n in z.namelist() if n.lower().endswith(".csv")]
        assert inner, "CSV not found inside enron_spam_data.zip"
        with z.open(inner[0]) as f:
            df = pd.read_csv(io.TextIOWrapper(f, encoding="utf-8"))
    lower = {c.lower(): c for c in df.columns}
    df = df.rename(columns={
        lower["message id"]: "Message ID",
        lower["subject"]:   "Subject",
        lower["message"]:   "Message",
        lower["spam/ham"]:  "Spam/Ham",
        lower["date"]:      "Date",
    })
    df["Spam/Ham"] = df["Spam/Ham"].astype(str).str.strip().str.lower()
    df = df[df["Spam/Ham"].isin(["ham","spam"])]
    df = df.dropna(subset=["Message"])
    df = df[df["Message"].astype(str).str.strip().ne("")]
    df["Message ID"] = pd.to_numeric(df["Message ID"], errors="raise")
    df = df.drop_duplicates(subset=["Message ID"], keep="first")
    return df[["Message ID","Spam/Ham"]]

train_feats_for_priors = pd.read_csv("train-features.tsv", sep="\t")
assert {"Message ID","features_json"}.issubset(train_feats_for_priors.columns), "Run Part 2 first."

labels_df = load_enron_with_ids()
label_map = labels_df.set_index("Message ID")["Spam/Ham"].to_dict()
train_feats_for_priors["Label"] = train_feats_for_priors["Message ID"].map(label_map)

n_ham  = int((train_feats_for_priors["Label"] == "ham").sum())
n_spam = int((train_feats_for_priors["Label"] == "spam").sum())
n_tot  = max(1, n_ham + n_spam)

log_prior_ham  = float(np.log(n_ham  / n_tot + 1e-12))
log_prior_spam = float(np.log(n_spam / n_tot + 1e-12))

# 3) Define predictor
def predict_proba_features_json(feat_json: str):
    """
    Multinomial NB in log-space:
      log P(c) + sum_f count_f * log P(f|c)
    Returns (p_ham, p_spam).
    """
    counts = json.loads(feat_json) if isinstance(feat_json, str) else {}
    lh, ls = log_prior_ham, log_prior_spam
    for f, c in counts.items():
        if not c: 
            continue
        c = int(c)
        lh += c * log_p_ham.get(f,  log_punk_ham)
        ls += c * log_p_spam.get(f, log_punk_spam)
    m = max(lh, ls)
    eh, es = np.exp(lh - m), np.exp(ls - m)
    z = eh + es
    return float(eh / z), float(es / z)



Save your prediction probabilities to "train-predictions.tsv" with columns Message ID, ham and spam.

In [None]:
# YOUR CHANGES HERE
import pandas as pd

assert 'predict_proba_features_json' in globals(), "Run Part 4 — Cell 1 first."

train_feats = pd.read_csv("train-features.tsv", sep="\t")
assert {"Message ID","features_json"}.issubset(train_feats.columns)

ph_list, ps_list = [], []
for s in train_feats["features_json"].astype(str):
    ph, ps = predict_proba_features_json(s)
    ph_list.append(ph)
    ps_list.append(ps)

out_train = pd.DataFrame({
    "Message ID": train_feats["Message ID"],
    "ham":  ph_list,
    "spam": ps_list,
})
out_train.to_csv("train-predictions.tsv", sep="\t", index=False)
print(f"Saved train-predictions.tsv (rows={len(out_train)}) with columns: Message ID, ham, spam")



Saved train-predictions.tsv (rows=32542) with columns: Message ID, ham, spam


Submit "train-predictions.tsv" in Gradescope.

## Part 5: Predict Spam Probability for Test Data

Use your previous classifier to predict spam probability for the test data.

In [None]:
# YOUR CHANGES HERE
import pandas as pd

# Requires predict_proba_features_json from Part 4 — Cell 1
assert 'predict_proba_features_json' in globals(), "Run Part 4 — Cell 1 to define predict_proba_features_json first."

test_feats = pd.read_csv("test-features.tsv", sep="\t")
assert {"Message ID","features_json"}.issubset(test_feats.columns), "Missing columns in test-features.tsv."

test_ph, test_ps = [], []
for s in test_feats["features_json"].astype(str):
    ph, ps = predict_proba_features_json(s)
    test_ph.append(ph); test_ps.append(ps)

test_pred_df = pd.DataFrame({
    "Message ID": test_feats["Message ID"],
    "ham":  test_ph,
    "spam": test_ps,
})
print(f"Built test_pred_df with {len(test_pred_df)} rows.")



Built test_pred_df with 1174 rows.


Save your prediction probabilities in "test-predictions.tsv" with the same columns as "train-predictions.tsv".

In [None]:
# YOUR CHANGES HERE
assert 'test_pred_df' in globals(), "Run Part 5 — Cell 1 first."
test_pred_df.to_csv("test-predictions.tsv", sep="\t", index=False)
print("Saved test-predictions.tsv with columns: Message ID, ham, spam")



Saved test-predictions.tsv with columns: Message ID, ham, spam


Submit "test-predictions.tsv" in Gradescope.

## Part 6: Construct ROC Curve

For every probability threshold from 0.01 to .99 in increments of 0.01, compute the false and true positive rates from the test data using the spam class for positives.
That is, if the predicted spam probability is greater than or equal to the threshold, predict spam.

In [None]:
# YOUR CHANGES HERE
import pandas as pd, numpy as np, zipfile, io
from pathlib import Path

# 1) Load test predictions
preds = pd.read_csv("test-predictions.tsv", sep="\t")
assert {"Message ID","spam"}.issubset(preds.columns), "test-predictions.tsv missing columns."

# 2) Load ground-truth labels with official Message ID
def load_enron_with_ids(zip_path=Path("enron_spam_data.zip")) -> pd.DataFrame:
    with zipfile.ZipFile(zip_path, "r") as z:
        inner = [n for n in z.namelist() if n.lower().endswith(".csv")]
        assert inner, "CSV not found in enron_spam_data.zip"
        with z.open(inner[0]) as f:
            df = pd.read_csv(io.TextIOWrapper(f, encoding="utf-8"))
    lower = {c.lower(): c for c in df.columns}
    df = df.rename(columns={
        lower["message id"]: "Message ID",
        lower["subject"]:   "Subject",
        lower["message"]:   "Message",
        lower["spam/ham"]:  "Spam/Ham",
        lower["date"]:      "Date",
    })
    # grader-aligned cleaning
    df["Spam/Ham"] = df["Spam/Ham"].astype(str).str.strip().str.lower()
    df = df[df["Spam/Ham"].isin(["ham","spam"])]
    df = df.dropna(subset=["Message"])
    df = df[df["Message"].astype(str).str.strip().ne("")]
    df["Message ID"] = pd.to_numeric(df["Message ID"], errors="raise")
    df = df.drop_duplicates(subset=["Message ID"], keep="first")
    return df[["Message ID","Spam/Ham"]]

truth = load_enron_with_ids()
label_map = truth.set_index("Message ID")["Spam/Ham"].to_dict()
preds["Label"] = preds["Message ID"].map(label_map)
preds = preds.dropna(subset=["Label"]).reset_index(drop=True)

y_true = (preds["Label"].str.lower() == "spam").astype(int).to_numpy()
p_spam = preds["spam"].astype(float).to_numpy()

# 3) Sweep thresholds 0.01..0.99
ths = np.round(np.linspace(0.01, 0.99, 99), 2)
P = max(1, int((y_true == 1).sum()))
N = max(1, int((y_true == 0).sum()))

rows = []
for t in ths:
    y_hat = (p_spam >= t).astype(int)
    tp = int(((y_true == 1) & (y_hat == 1)).sum())
    fp = int(((y_true == 0) & (y_hat == 1)).sum())
    tpr = tp / P
    fpr = fp / N
    rows.append({
        "threshold": float(t),
        "false_positive_rate": float(fpr),
        "true_positive_rate": float(tpr),
    })

roc_df = pd.DataFrame(rows)
print("Built roc_df:", roc_df.shape)
roc_df.head()



Built roc_df: (99, 3)


Unnamed: 0,threshold,false_positive_rate,true_positive_rate
0,0.01,0.0,0.0
1,0.02,0.0,0.0
2,0.03,0.0,0.0
3,0.04,0.0,0.0
4,0.05,0.0,0.0


Save this data in a file "roc.tsv" with columns threshold, false_positive_rate and true_positive rate.

In [23]:
# YOUR CHANGES HERE
assert 'roc_df' in globals(), "Run the ROC compute cell first."
roc_df.to_csv("roc.tsv", sep="\t", index=False)
print("Saved roc.tsv with columns: threshold, false_positive_rate, true_positive_rate")



Saved roc.tsv with columns: threshold, false_positive_rate, true_positive_rate


Submit "roc.tsv" in Gradescope.

## Part 7: Signup for Gemini API Key

Create a free Gemini API key at https://aistudio.google.com/app/api-keys.
You will need to do this with a personal Google account - it will not work with your BU Google account.
This will not incur any charges unless you configure billing information for the key.

You will be asked to start a Gemini free trial for week 11.
This will not incur any charges unless you exceed expected usage by an order of magnitude.


No submission needed.

## Part 8: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 9: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.

In [24]:
text = """Acknowledgements

I completed this assignment independently and did not discuss it with anyone.

I did not use any generative AI tools for this work.

I did not use any additional libraries beyond those provided or referenced in the course materials.
"""


with open("acknowledgments.txt", "w", encoding="utf-8") as f:
    f.write(text.strip() + "\n")


print("Wrote acknowledgments.txt ")


Wrote acknowledgments.txt 
