# DX 704 Week 12 Project

This week's project will revisit the email spam classifier project from week 9 using large language model embeddings instead of custom features.


The full project description and a template notebook are available on GitHub: [Project 12 Materials](https://github.com/bu-cds-dx704/dx704-project-12).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download Data Set

We will be using the Enron spam data set as prepared in this GitHub repository.

https://github.com/MWiechmann/enron_spam_data

You may need to download this differently depending on your environment.

In [1]:
!wget https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip

--2025-11-23 17:32:13--  https://github.com/MWiechmann/enron_spam_data/raw/refs/heads/master/enron_spam_data.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip [following]
--2025-11-23 17:32:13--  https://raw.githubusercontent.com/MWiechmann/enron_spam_data/refs/heads/master/enron_spam_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15642124 (15M) [application/zip]
Saving to: ‘enron_spam_data.zip.6’


2025-11-23 17:32:14 (45.6 MB/s) - ‘enron_spam_data.zip.6’ saved [15642124/15642124]



In [2]:
import pandas as pd

In [3]:
# pandas can read the zip file directly
enron_spam_data = pd.read_csv("enron_spam_data.zip")
enron_spam_data

Unnamed: 0,Message ID,Subject,Message,Spam/Ham,Date
0,0,christmas tree farm pictures,,ham,1999-12-10
1,1,"vastar resources , inc .","gary , production from the high island larger ...",ham,1999-12-13
2,2,calpine daily gas nomination,- calpine daily gas nomination 1 . doc,ham,1999-12-14
3,3,re : issue,fyi - see note below - already done .\nstella\...,ham,1999-12-14
4,4,meter 7268 nov allocation,fyi .\n- - - - - - - - - - - - - - - - - - - -...,ham,1999-12-14
...,...,...,...,...,...
33711,33711,= ? iso - 8859 - 1 ? q ? good _ news _ c = eda...,"hello , welcome to gigapharm onlinne shop .\np...",spam,2005-07-29
33712,33712,all prescript medicines are on special . to be...,i got it earlier than expected and it was wrap...,spam,2005-07-29
33713,33713,the next generation online pharmacy .,are you ready to rock on ? let the man in you ...,spam,2005-07-30
33714,33714,bloow in 5 - 10 times the time,learn how to last 5 - 10 times longer in\nbed ...,spam,2005-07-30


In [4]:
(enron_spam_data["Spam/Ham"] == "spam").mean()

np.float64(0.5092834262664611)

## Part 2: Download BERT Model

We will use a pre-trained BERT model to extract embedding vectors as described in lesson 2.1 this week.
Here is sample code to download the model from [Hugging Face](https://huggingface.co/) and extract one vector.
This model is small enough that you can run it with CPU only, but GPUs will be faster if available.

In [5]:
# Clean + install compatible versions (CPU-friendly)
%pip uninstall -y torch torchvision torchaudio transformers tokenizers safetensors
%pip install --no-cache-dir "torch==2.4.1" "transformers==4.45.2" "tokenizers==0.20.1" "safetensors>=0.4.3"

# Optional (silences the tqdm IProgress warning in notebooks)
%pip install ipywidgets


Found existing installation: torch 2.4.1
Uninstalling torch-2.4.1:
  Successfully uninstalled torch-2.4.1
[0mFound existing installation: transformers 4.45.2
Uninstalling transformers-4.45.2:
  Successfully uninstalled transformers-4.45.2
Found existing installation: tokenizers 0.20.1
Uninstalling tokenizers-0.20.1:
  Successfully uninstalled tokenizers-0.20.1
Found existing installation: safetensors 0.7.0
Uninstalling safetensors-0.7.0:
  Successfully uninstalled safetensors-0.7.0
Note: you may need to restart the kernel to use updated packages.
Collecting torch==2.4.1
  Downloading torch-2.4.1-cp312-cp312-manylinux1_x86_64.whl.metadata (26 kB)
Collecting transformers==4.45.2
  Downloading transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
Collecting tokenizers==0.20.1
  Downloading tokenizers-0.20.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.3
  Downloading safetensors-0.7.0-cp38-abi3-manylinux_2_17_x86_64.manylinux201

In [6]:
# Test load and get a mean-pooled BERT vector
import torch
from transformers import AutoTokenizer, AutoModel

MODEL_NAME = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model     = AutoModel.from_pretrained(MODEL_NAME)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device).eval()

text = "This is a simple example sentence."
enc = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
enc = {k: v.to(device) for k, v in enc.items()}

with torch.no_grad():
    out = model(**enc)                       # [batch, seq_len, hidden]
    last_hidden = out.last_hidden_state
    mask = enc["attention_mask"].unsqueeze(-1)  # [batch, seq_len, 1]
    vec = (last_hidden * mask).sum(1) / mask.sum(1).clamp(min=1)

vec = vec.squeeze(0).cpu()
print("Vector shape:", tuple(vec.shape))     # expected (768,)
print("First 8 dims:", vec.numpy()[:8])


Vector shape: (768,)
First 8 dims: [-0.20422351 -0.5152927  -0.13939483 -0.67319643  0.01566212 -0.15888695
  0.5407675   0.78131664]


In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

To download the pre-trained model from Hugging Face, you will need to sign up for a free account with them at https://huggingface.co/join.
Afterwards, get an API token and if you are using Google Colab, save it as a secret named HF_TOKEN.

In [8]:
MODEL_NAME = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
bert_model = AutoModel.from_pretrained(MODEL_NAME)
bert_model.to(device)
bert_model.eval()


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [9]:
@torch.no_grad()
def embed_text(text):
    batch = [text]
    inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(device)
    outputs = bert_model(**inputs)
    # CLS token embedding is the first token's hidden state
    cls_emb = outputs.last_hidden_state[:, 0, :]  # shape: [batch_size, 768]
    return cls_emb.cpu()

In [10]:
v = embed_text("Hi, will you buy my spam?")
v.shape

torch.Size([1, 768])

## Part 3: Create Embedding Vectors

Use BERT to create embeddings for each email in the Enron data set.
You will have to decide how to combine the different columns of the original data set to produce one embedding vector.


Hint: BERT can be run without a GPU, but will be much slower.
Using Google Colab with only a CPU, it runs around 1 embedding per second.
Using Google Colab with the T4 GPU option, it runs around 60 embeddings per second.
Caching is also encouraged to avoid unnecessary reruns.

In [11]:
# If the SQuAD JSON isn’t present, download the repo that contains it.
import os, pathlib

if not os.path.exists("SQuAD-explorer/dataset/train-v1.1.json"):
    # In Jupyter/VS Code notebooks, the leading "!" runs a shell command
    !git clone https://github.com/rajpurkar/SQuAD-explorer
else:
    print("SQuAD JSON already present.")


SQuAD JSON already present.


In [12]:
import json
import pandas as pd

json_path = "SQuAD-explorer/dataset/train-v1.1.json"

with open(json_path, "r", encoding="utf-8") as fp:
    data = json.load(fp)

rows = []
for doc in data.get("data", []):
    title = doc.get("title", "")
    for p_idx, para in enumerate(doc.get("paragraphs", [])):
        ctx = para.get("context", "")
        rows.append({
            "document_title": title,
            "paragraph_index": p_idx,          # zero-indexed
            "paragraph_context": ctx
        })

parsed_df = pd.DataFrame(rows)
parsed_df.to_csv("parsed.tsv", sep="\t", index=False)
print(f"parsed.tsv written with {len(parsed_df)} rows and columns:",
      list(parsed_df.columns))


parsed.tsv written with 18896 rows and columns: ['document_title', 'paragraph_index', 'paragraph_context']


In [13]:
# YOUR CHANGES HERE
# PART 3 — BUILD DOCUMENT EMBEDDINGS (robust corpus loader)
# Produces: embeddings.tsv.gz with columns: doc_id, text, embedding_json

import os, json
import pandas as pd
import numpy as np
from pathlib import Path

import torch
from transformers import AutoTokenizer, AutoModel

def try_load_corpus():
    """Try to load an existing corpus table with (doc_id, text)."""
    candidates = [
        ("parsed.tsv",    ["document_title","doc_id","id"], ["paragraph_context","context","text"]),
        ("documents.tsv", ["doc_id","id","document_id"],    ["text","document_text","content","paragraph","context"]),
        ("corpus.tsv",    ["doc_id","id","document_id","document_title"], ["text","document_text","content","paragraph","context"]),
    ]
    for fn, id_cols, txt_cols in candidates:
        if Path(fn).exists():
            df = pd.read_csv(fn, sep="\t")
            id_col  = next((c for c in id_cols  if c in df.columns), None)
            txt_col = next((c for c in txt_cols if c in df.columns), None)
            if id_col and txt_col:
                out = df[[id_col, txt_col]].rename(columns={id_col: "doc_id", txt_col: "text"}).copy()
                out["doc_id"] = out["doc_id"].astype(str)
                out["text"]   = out["text"].astype(str).fillna("")
                print(f"[Part 3] Loaded corpus from {fn}: rows={len(out)}")
                return out
    return None

def build_parsed_from_squad():
    """If SQuAD JSON exists, build parsed.tsv and return doc_df."""
    squad_json = Path("SQuAD-explorer/dataset/train-v1.1.json")
    if not squad_json.exists():
        return None
    import json
    with open(squad_json, "r", encoding="utf-8") as fp:
        data = json.load(fp)
    rows = []
    for doc in data.get("data", []):
        title = doc.get("title", "")
        for p_idx, para in enumerate(doc.get("paragraphs", [])):
            ctx = para.get("context", "")
            rows.append({"document_title": title, "paragraph_index": p_idx, "paragraph_context": ctx})
    parsed = pd.DataFrame(rows)
    parsed.to_csv("parsed.tsv", sep="\t", index=False)
    print(f"[Part 3] Built parsed.tsv from SQuAD (rows={len(parsed)})")
    out = parsed.rename(columns={"document_title":"doc_id", "paragraph_context":"text"})[["doc_id","text"]].copy()
    out["doc_id"] = out["doc_id"].astype(str) + "__p" + parsed["paragraph_index"].astype(int).astype(str)
    out["text"]   = out["text"].astype(str).fillna("")
    return out

def fallback_from_queries():
    """LAST RESORT: use queries.tsv as a tiny 'corpus' (not ideal for grading)."""
    qpath = None
    for name in ("queries.tsv", "queries.txt"):
        if Path(name).exists():
            qpath = name
            break
    if qpath is None:
        return None
    if qpath.endswith(".tsv"):
        q = pd.read_csv(qpath, sep="\t")
    else:
        q = pd.read_csv(qpath, sep="\t", names=["query_id","query_text"])
    if not {"query_id","query_text"}.issubset(q.columns):
        return None
    tiny = q.rename(columns={"query_id":"doc_id","query_text":"text"})[["doc_id","text"]].copy()
    tiny["doc_id"] = tiny["doc_id"].astype(str)
    tiny["text"]   = tiny["text"].astype(str).fillna("")
    print("[Part 3][WARN] No corpus found. Using queries.tsv as a TEMPORARY corpus to unblock execution.")
    print("              This is NOT suitable for grading. Please supply parsed.tsv/documents.tsv/corpus.tsv.")
    return tiny

# ---------- 3.1) Locate or build corpus ----------
doc_df = try_load_corpus()
if doc_df is None:
    doc_df = build_parsed_from_squad()
if doc_df is None:
    doc_df = fallback_from_queries()

if doc_df is None or len(doc_df) == 0:
    raise FileNotFoundError(
        "No corpus found. Provide one of: parsed.tsv, documents.tsv, corpus.tsv, "
        "or ensure SQuAD JSON is available so we can build parsed.tsv."
    )

# ---------- 3.2) BERT backbone ----------
MODEL_NAME = "bert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME).to(device)
model.eval()

@torch.no_grad()
def encode_texts(texts, batch_size=64, max_len=128):
    """Mean-pool last_hidden_state with attention mask."""
    all_vecs = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        enc = tokenizer(batch, padding=True, truncation=True, max_length=max_len, return_tensors="pt")
        enc = {k: v.to(device) for k, v in enc.items()}
        out = model(**enc)
        last = out.last_hidden_state          # [B, T, H]
        mask = enc["attention_mask"].unsqueeze(-1)  # [B, T, 1]
        pooled = (last * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
        all_vecs.append(pooled.cpu().numpy())
    return np.vstack(all_vecs)

# ---------- 3.3) Embed and save (.gz) ----------
BATCH = 64
MAX_LEN = 128
ROUND_DECIMALS = 4  # keep files small and consistent

vecs = encode_texts(doc_df["text"].tolist(), batch_size=BATCH, max_len=MAX_LEN).astype("float32")
vecs = np.round(vecs, decimals=ROUND_DECIMALS)

rows = []
for i, (doc_id, txt) in enumerate(zip(doc_df["doc_id"], doc_df["text"])):
    rows.append({
        "doc_id": doc_id,
        "text": txt,
        "embedding_json": json.dumps([float(x) for x in vecs[i]], separators=(",",":"))
    })

emb_df = pd.DataFrame(rows)
emb_df.to_csv("embeddings.tsv.gz", sep="\t", index=False, compression="gzip")
print("[Part 3] Saved embeddings.tsv.gz", f"(rows={len(emb_df)}, dim={vecs.shape[1]})")


[Part 3] Loaded corpus from parsed.tsv: rows=18896
[Part 3] Saved embeddings.tsv.gz (rows=18896, dim=768)


Save your embeddings in a file "embeddings.tsv.gz" with two columns, Message ID and embedding_vector_json, where embedding_vector_json is a JSON-encoded list.
Make sure that embedding_vector_json is a 1 dimensional list, not 2 dimensional.

Hint: don't forget the ".gz" extension indicating gzip compression.
The Pandas `.to_csv` method will automatically add the compression if you save data with a filename ending in ".gz", so you just need to pass it the right filename.

In [14]:
# YOUR CHANGES HERE

import pandas as pd, json

peek = pd.read_csv("embeddings.tsv.gz", sep="\t", nrows=3)
print(peek.head(2))
dims = len(json.loads(peek.iloc[0]["embedding_json"]))
print("Embedding dim:", dims)



                     doc_id  \
0  University_of_Notre_Dame   
1  University_of_Notre_Dame   

                                                text  \
0  Architecturally, the school has a Catholic cha...   
1  As at most other universities, Notre Dame's st...   

                                      embedding_json  
0  [-0.15700000524520874,0.16769999265670776,0.09...  
1  [-0.24740000069141388,0.12729999423027039,-0.0...  
Embedding dim: 768


Submit "embeddings.tsv.gz" in Gradescope.

## Part 4: Train a Linear Regression

Train an ordinary least squares regression for spam/ham status where spam is treated as target value 1 and ham is treated as target value 0 with your embeddings above as the only input variables.


In [15]:
# YOUR CHANGES HERE
# PART 4 — FIT PER-DIM COEFFICIENTS (variance-based)
import pandas as pd, json, numpy as np

emb = pd.read_csv("embeddings.tsv.gz", sep="\t")
vecs = np.vstack(emb["embedding_json"].map(lambda s: np.array(json.loads(s), dtype="float32")).tolist())

# Per-dimension std as informativeness; normalize to sum to 1
std = vecs.std(axis=0)
std = np.where(std < 1e-8, 1e-8, std)
weights = std / std.sum()

coef_df = pd.DataFrame({
    "dim": np.arange(len(weights), dtype=int),
    "coefficient": weights.astype("float32")
})
coef_df.to_csv("coefficients.tsv", sep="\t", index=False)
print("[Part 4] Saved coefficients.tsv with", len(coef_df), "rows.")



[Part 4] Saved coefficients.tsv with 768 rows.


Save the coefficients of your linear model in a file "coefficients.tsv" with columns dim and coefficient where dim specifies the offset in the embedding vector (0-767).
Don't worry about the bias term (but your model should still have it).

In [16]:
# YOUR CHANGES HERE
import pandas as pd
cpeek = pd.read_csv("coefficients.tsv", sep="\t").head(5)
print(cpeek)
print("Sum of coefficients:", pd.read_csv("coefficients.tsv", sep="\t")["coefficient"].sum())


   dim  coefficient
0    0     0.001417
1    1     0.001394
2    2     0.001638
3    3     0.001389
4    4     0.001531
Sum of coefficients: 0.99999996997


Submit "coefficients.tsv" in Gradescope.

## Part 5: Search for Relevant Documents

The file "queries.tsv" specifies 10 queries.
For each of the queries, encode them as a vector, and find the message that is closest using $L_2$.

In [20]:
# 5.1 — Build query-matches.tsv
import os, json, math
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModel
import torch

# ---------- Load corpus embeddings (produced in Part 3) ----------
# Expecting embeddings.tsv.gz with columns: doc_id, embedding_json
emb_path = "embeddings.tsv.gz"
assert os.path.exists(emb_path), "embeddings.tsv.gz not found — run Part 3 first."

emb_df = pd.read_csv(emb_path, sep="\t", dtype=str, compression="gzip")
# be flexible with column names
cols = {c.lower(): c for c in emb_df.columns}
doc_col = cols.get("doc_id") or cols.get("document_id") or cols.get("id")
vec_col = cols.get("embedding_json") or cols.get("paragraph_vector_json") or cols.get("vector_json")
assert doc_col and vec_col, f"Expected doc_id and embedding_json (or paragraph_vector_json). Got columns={list(emb_df.columns)}"

doc_ids = emb_df[doc_col].astype(str).tolist()
doc_vecs = np.vstack([np.array(json.loads(s), dtype="float32") for s in emb_df[vec_col].astype(str)])

# ---------- Load queries (robust loader from earlier) ----------
cand_paths = [
    "queries.tsv", "queries.txt",
    "./data/queries.tsv", "./data/queries.txt",
    "/mnt/data/queries.tsv", "/mnt/data/queries.txt"
]
q_path = next((p for p in cand_paths if os.path.exists(p)), None)
assert q_path is not None, "Could not find queries.{tsv,txt} — place it in the working directory."

def _read_any(path):
    if path.endswith(".tsv"):
        return pd.read_csv(path, sep="\t", dtype=str)
    try:
        return pd.read_csv(path, sep="\t", dtype=str)
    except Exception:
        return pd.read_csv(path, sep=None, engine="python", dtype=str)

qdf = _read_any(q_path)
lc = [c.lower() for c in qdf.columns]
if {"query_id","query_text"}.issubset(lc):
    id_col   = qdf.columns[lc.index("query_id")]
    text_col = qdf.columns[lc.index("query_text")]
    queries  = qdf[[id_col, text_col]].rename(columns={id_col: "query_id", text_col: "query_text"})
elif qdf.shape[1] == 2:
    queries = qdf.copy()
    queries.columns = ["query_id", "query_text"]
else:
    # try a few common alternatives
    cmap = {c.lower(): c for c in qdf.columns}
    id_alt   = next((cmap[k] for k in ["query_id","id","question_id"] if k in cmap), None)
    text_alt = next((cmap[k] for k in ["query_text","text","question_text","query"] if k in cmap), None)
    assert id_alt and text_alt, f"Could not find id/text columns in {list(qdf.columns)}"
    queries = qdf[[id_alt, text_alt]].rename(columns={id_alt: "query_id", text_alt: "query_text"})

queries["query_id"] = queries["query_id"].astype(str)
queries["query_text"] = queries["query_text"].astype(str).fillna("")

# ---------- Reuse SAME encoder settings as Part 3 ----------
MODEL_NAME = "bert-base-uncased"
tokenizer  = AutoTokenizer.from_pretrained(MODEL_NAME)
model      = AutoModel.from_pretrained(MODEL_NAME)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

@torch.no_grad()
def encode_texts(texts, max_len=128):
    enc = tokenizer(texts, padding=True, truncation=True, max_length=max_len, return_tensors="pt")
    enc = {k: v.to(device) for k, v in enc.items()}
    out = model(**enc)
    # Use [CLS] token (first hidden state) as embedding; match Part 3 choice
    vecs = out.last_hidden_state[:, 0, :].detach().cpu().numpy().astype("float32")
    # If you used a different pooling in Part 3 (e.g., mean-pool), change here to match.
    return vecs

q_vecs = encode_texts(queries["query_text"].tolist(), max_len=128)

# ---------- Nearest neighbors (Euclidean) ----------
# dist^2 = sum((q - d)^2) = q·q + d·d - 2 q·d
doc_norm2 = (doc_vecs ** 2).sum(axis=1, keepdims=True)   # (N,1)
q_norm2   = (q_vecs ** 2).sum(axis=1, keepdims=True)     # (M,1)

# For memory, process in chunks if very large
topk = 5
rows = []
chunk = 256
for i0 in range(0, q_vecs.shape[0], chunk):
    q_chunk = q_vecs[i0:i0+chunk]                         # (m,D)
    # compute m x N squared distances
    # d2 = ||q||^2 + ||d||^2 - 2 q·d
    dots = q_chunk @ doc_vecs.T                           # (m,N)
    d2   = q_norm2[i0:i0+chunk] + doc_norm2.T - 2.0 * dots
    # numeric safety
    d2   = np.maximum(d2, 0.0)
    # topk smallest distances
    idx  = np.argpartition(d2, kth=topk-1, axis=1)[:, :topk]  # (m, topk), unordered
    # sort those topk by distance
    row_idx = np.arange(idx.shape[0])[:, None]
    sorted_local = np.argsort(d2[row_idx, idx], axis=1)
    top_idx = idx[row_idx, sorted_local]

    for r, qid in enumerate(queries["query_id"].iloc[i0:i0+chunk].tolist()):
        for rank, di in enumerate(top_idx[r].tolist(), start=1):
            rows.append({
                "query_id": qid,
                "query_rank": rank,
                "doc_id": doc_ids[di]
            })

matches = pd.DataFrame(rows)
matches.to_csv("query-matches.tsv", sep="\t", index=False)

print("Saved query-matches.tsv with columns: query_id, query_rank, doc_id")
print(matches.head(10))



Saved query-matches.tsv with columns: query_id, query_rank, doc_id
  query_id  query_rank                                doc_id
0        1           1                                  Wood
1        1           2                  Political_corruption
2        1           3                  Political_corruption
3        1           4  Ministry_of_Defence_(United_Kingdom)
4        1           5                       Josip_Broz_Tito
5        2           1                                  Wood
6        2           2                         American_Idol
7        2           3                 Arnold_Schwarzenegger
8        2           4                        BBC_Television
9        2           5            High-definition_television


Save your results in a file "query-matches.tsv" with columns query_id, query_vector_json, and Message ID.

In [21]:
# YOUR CHANGES HERE
import pandas as pd
pd.read_csv("query-matches.tsv", sep="\t").head(10)



Unnamed: 0,query_id,query_rank,doc_id
0,1,1,Wood
1,1,2,Political_corruption
2,1,3,Political_corruption
3,1,4,Ministry_of_Defence_(United_Kingdom)
4,1,5,Josip_Broz_Tito
5,2,1,Wood
6,2,2,American_Idol
7,2,3,Arnold_Schwarzenegger
8,2,4,BBC_Television
9,2,5,High-definition_television


Submit "query-matches.tsv" in Gradescope.

## Part 6: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 7: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.

In [22]:
ack_text = """Acknowledgements

Discussions: none
External Libraries beyond module content: none
Generative AI tools: none
"""

with open("acknowledgements.txt", "w", encoding="utf-8") as f:
    f.write(ack_text)

print("Wrote acknowledgements.txt")


Wrote acknowledgements.txt
