# DX 704 Week 10 Project

In this project, you will implement document search within a question and answer database and assess its performance.


The full project description and a template notebook are available on GitHub: [Project 10 Materials](https://github.com/bu-cds-dx704/dx704-project-10).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download the SQuAD-explorer Data Set

You may use the code provided below.

In [1]:
!git clone https://github.com/rajpurkar/SQuAD-explorer

fatal: destination path 'SQuAD-explorer' already exists and is not an empty directory.


In [2]:
import json

In [3]:
with open("SQuAD-explorer/dataset/train-v1.1.json") as fp:
    train_data = json.load(fp)

In [4]:
type(train_data)

dict

In [5]:
list(train_data.keys())

['data', 'version']

In [6]:
type(train_data["data"])

list

In [7]:
len(train_data["data"])

442

In [8]:
type(train_data["data"][0])

dict

In [9]:
train_data["data"][0].keys()

dict_keys(['title', 'paragraphs'])

In [10]:
train_data["data"][0]["title"]

'University_of_Notre_Dame'

In [11]:
len(train_data["data"][0]["paragraphs"])

55

In [12]:
train_data["data"][0]["paragraphs"][0]

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'qas': [{'answers': [{'answer_start': 515,
     'text': 'Saint Bernadette Soubirous'}],
   'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
   'id': '5733be284776f41900661182'},
  {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ

In [13]:
sum(len(doc["paragraphs"]) for doc in train_data["data"])

18896

## Part 2: Restructure JSON Data for Processing

Parse the file "SQuAD-explorer/dataset/train-v1.1.json" above to produce a file "parsed.tsv" with columns document_title, paragraph_index, and paragraph_context.
The paragraph_index column should be zero-indexed, so zero for the first paragraph of each document.
Use pandas `to_csv` method to write the file since there are many quotes and other issues to handle otherwise.

In [None]:
# YOUR CHANGES HERE

import json, csv
from pathlib import Path
import pandas as pd

if 'train_data' in globals():
    squad = train_data
else:
    json_path = Path("SQuAD-explorer/dataset/train-v1.1.json")
    assert json_path.exists(), f"Missing: {json_path}"
    with json_path.open("r", encoding="utf-8") as f:
        squad = json.load(f)

rows = []
for article in squad.get("data", []):
    title = article.get("title", "")
    for idx, para in enumerate(article.get("paragraphs", [])):  # zero-indexed
        context = para.get("context", "")
        rows.append({
            "document_title": title,
            "paragraph_index": idx,
            "paragraph_context": context
        })

parsed_df = pd.DataFrame(rows, columns=["document_title", "paragraph_index", "paragraph_context"])

# Write TSV via pandas for robust quoting/escaping
out_path = Path("parsed.tsv")
parsed_df.to_csv(out_path, sep="\t", index=False, quoting=csv.QUOTE_MINIMAL, escapechar="\\")
print(f"Saved {len(parsed_df):,} rows to {out_path}")

# quick peek
display(parsed_df.head(3)) 



Saved 18,896 rows to parsed.tsv


Unnamed: 0,document_title,paragraph_index,paragraph_context
0,University_of_Notre_Dame,0,"Architecturally, the school has a Catholic cha..."
1,University_of_Notre_Dame,1,"As at most other universities, Notre Dame's st..."
2,University_of_Notre_Dame,2,The university is the major seat of the Congre...


Submit "parsed.tsv" in Gradescope.

## Part 3: Prepare Suitable Paragraph Vectors for Document Search

Design and implement paragraph vectors based on their text with length 1024.
Note that this will be much smaller than the number of distinct words in the training data.

Hint: you can base your vectors on any techniques covered in this module so far.
Beware that they will be automatically assessed (along with the question vectors of part 4) to make sure they retain useful information.

In [15]:
# YOUR CHANGES HERE

import json
import numpy as np
import pandas as pd

if 'parsed_df' not in globals():
    parsed_df = pd.read_csv("parsed.tsv", sep="\t")

texts = parsed_df["paragraph_context"].astype(str).tolist()

# Build TF-IDF → SVD(1024) pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

tfidf = TfidfVectorizer(
    lowercase=True,
    strip_accents="unicode",
    ngram_range=(1,2),        # unigrams + bigrams give helpful signal
    max_df=0.95,              # ignore extremely common terms
    min_df=2,                 # ignore ultra-rare terms
    max_features=100_000      # cap vocab size (keeps memory reasonable)
)
X = tfidf.fit_transform(texts)            # sparse (n_paragraphs x vocab)

svd = TruncatedSVD(n_components=1024, random_state=42)
Z = svd.fit_transform(X)                  # dense (n_paragraphs x 1024)

# Normalize each vector to unit length (optional but often helpful for retrieval)
Z_norm = np.linalg.norm(Z, axis=1, keepdims=True)
Z_norm[Z_norm == 0.0] = 1.0
Z = Z / Z_norm

# Pack into the required dataframe
vec_json = [json.dumps(row.tolist(), ensure_ascii=False) for row in Z]
out_df = pd.DataFrame({
    "document_title":  parsed_df["document_title"].values,
    "paragraph_index": parsed_df["paragraph_index"].values,
    "paragraph_vector_json": vec_json,
})

# Save as gzip-compressed TSV (Pandas auto-compresses based on .gz suffix)
out_path = "paragraph-vectors.tsv.gz"
out_df.to_csv(out_path, sep="\t", index=False)
print(f"Saved {len(out_df):,} vectors (dim=1024) to {out_path}")


Saved 18,896 vectors (dim=1024) to paragraph-vectors.tsv.gz


Save your paragraph vectors in a file "paragraph-vectors.tsv.gz" with columns document_title, paragraph_index, and paragraph_vector_json where paragraph_vector_json is a JSON encoded list.

Hint: don't forget the ".gz" extension indicating gzip compression.
The Pandas `.to_csv` method will automatically add the compression if you save data with a filename ending in ".gz", so you just need to pass it the right filename.

In [16]:
# YOUR CHANGES HERE

# Quick sanity check of paragraph-vectors.tsv.gz
import pandas as pd, json

pv = pd.read_csv("paragraph-vectors.tsv.gz", sep="\t")
print("Rows:", len(pv), "| Columns:", list(pv.columns))
sample = json.loads(pv.loc[0, "paragraph_vector_json"])
print("Sample vector length:", len(sample), "| First 5 values:", sample[:5])
pv.head(3)


Rows: 18896 | Columns: ['document_title', 'paragraph_index', 'paragraph_vector_json']
Sample vector length: 1024 | First 5 values: [0.21648748566091988, 0.05543054491784081, 0.047497016329809884, 0.029432655291299778, -0.058120808524353945]


Unnamed: 0,document_title,paragraph_index,paragraph_vector_json
0,University_of_Notre_Dame,0,"[0.21648748566091988, 0.05543054491784081, 0.0..."
1,University_of_Notre_Dame,1,"[0.3154089865696247, 0.0788294129312354, 0.031..."
2,University_of_Notre_Dame,2,"[0.20608262551513049, 0.03948027443264367, 0.1..."


Submit "paragraph-vectors.tsv.gz" in Gradescope.

## Part 4: Encode Question Vectors with the Same Design

Read the questions in "questions.tsv" and encode them in the same way that you encoded the paragraph vectors.

In [17]:
# YOUR CHANGES HERE

import pandas as pd, numpy as np, json

# 1) Load questions
qdf = pd.read_csv("questions.tsv", sep="\t")
# Try to locate ID and text columns robustly
id_col_candidates   = ["question_id", "id", "qid"]
text_col_candidates = ["question", "question_text", "text"]

qid_col = next((c for c in id_col_candidates   if c in qdf.columns), None)
qtx_col = next((c for c in text_col_candidates if c in qdf.columns), None)
assert qid_col is not None and qtx_col is not None, \
    f"questions.tsv must contain an id column ({id_col_candidates}) and a text column ({text_col_candidates})."

questions = qdf[qtx_col].astype(str).tolist()
qids      = qdf[qid_col].tolist()

# 2) Ensure we have the SAME vectorizer + projector as Part 3
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

def _build_paragraph_space_from_parsed():
    pdf = pd.read_csv("parsed.tsv", sep="\t")
    texts = pdf["paragraph_context"].astype(str).tolist()
    _tfidf = TfidfVectorizer(
        lowercase=True,
        strip_accents="unicode",
        ngram_range=(1,2),
        max_df=0.95,
        min_df=2,
        max_features=100_000,
    )
    X = _tfidf.fit_transform(texts)
    _svd = TruncatedSVD(n_components=1024, random_state=42)
    _svd.fit(X)
    return _tfidf, _svd

if 'tfidf' in globals() and 'svd' in globals():
    tfidf_q = tfidf
    svd_q   = svd
else:
    # Recreate from paragraphs if not present in memory
    tfidf_q, svd_q = _build_paragraph_space_from_parsed()

# 3) Transform questions into the SAME space
Xq = tfidf_q.transform(questions)   # sparse
Zq = svd_q.transform(Xq)            # (n_questions x 1024)

# Normalize rows (consistent with Part 3)
row_norms = np.linalg.norm(Zq, axis=1, keepdims=True)
row_norms[row_norms == 0.0] = 1.0
Zq = Zq / row_norms

# 4) Save to TSV (JSON-encoded vectors)
qvec_json = [json.dumps(v.tolist(), ensure_ascii=False) for v in Zq]
out_q = pd.DataFrame({
    "question_id": qids,
    "question_vector_json": qvec_json,
})
out_q.to_csv("question-vectors.tsv", sep="\t", index=False)
print(f"Saved {len(out_q):,} question vectors (dim=1024) to question-vectors.tsv")


Saved 100 question vectors (dim=1024) to question-vectors.tsv


Save your question vectors in "question-vectors.tsv" with columns question_id and question_vector_json.

In [18]:
# YOUR CHANGES HERE
import pandas as pd, json

qq = pd.read_csv("question-vectors.tsv", sep="\t")
print("Rows:", len(qq), "| Columns:", list(qq.columns))
ex = json.loads(qq.loc[0, "question_vector_json"])
print("Sample vector length:", len(ex), "| First 5 values:", ex[:5])
qq.head(3)


Rows: 100 | Columns: ['question_id', 'question_vector_json']
Sample vector length: 1024 | First 5 values: [0.11361384662658769, -0.09845417557452925, -0.008749376950232328, -0.02364693659290899, -0.007412525521142891]


Unnamed: 0,question_id,question_vector_json
0,1,"[0.11361384662658769, -0.09845417557452925, -0..."
1,4,"[0.09488748854261451, 0.005534444137350899, 0...."
2,7,"[0.08195436598868455, -0.03333341297395918, -0..."


Submit "question-vectors.tsv" in Gradescope.

## Part 5: Match Questions to Paragraphs using Nearest Neighbors

Match your question vectors to paragraph vectors and identify the top 5 paragraph vectors for each question using nearest neighbors.
Specifically, use the Euclidean distance between the vectors.


In [19]:
# YOUR CHANGES HERE

import json
import numpy as np
import pandas as pd

# Load vectors
pv = pd.read_csv("paragraph-vectors.tsv.gz", sep="\t")
qv = pd.read_csv("question-vectors.tsv",     sep="\t")

# Parse JSON vectors
P_list = [json.loads(s) for s in pv["paragraph_vector_json"].astype(str)]
Q_list = [json.loads(s) for s in qv["question_vector_json"].astype(str)]

P = np.asarray(P_list, dtype=np.float32)   # (N_paragraphs, 1024)
Q = np.asarray(Q_list, dtype=np.float32)   # (N_questions,   1024)

# Sanity checks
assert P.ndim == 2 and Q.ndim == 2 and P.shape[1] == Q.shape[1] == 1024, "Vector dims must be 1024."
N = P.shape[0]

# Precompute norms for fast Euclidean distance:
P_sq = np.einsum("ij,ij->i", P, P)  # (N,)
rows = []

for qi, qid in enumerate(qv["question_id"].tolist()):
    q = Q[qi]                                      # (1024,)
    q_sq = float(np.dot(q, q))

    # distances to all paragraphs (squared)
    d2 = P_sq + q_sq - 2.0 * (P @ q)               # (N,)

    # top-5 smallest distances
    k = 5
    top_idx = np.argpartition(d2, k)[:k]           # unsorted top-k
    top_idx = top_idx[np.argsort(d2[top_idx])]     # sort by distance

    for rank, pi in enumerate(top_idx, start=1):
        rows.append({
            "question_id":    qid,
            "question_rank":  rank,                              # 1..5
            "document_title": pv.iloc[pi]["document_title"],
            "paragraph_index": int(pv.iloc[pi]["paragraph_index"]),
        })

matches_df = pd.DataFrame(rows, columns=["question_id", "question_rank", "document_title", "paragraph_index"])
matches_df.to_csv("question-matches.tsv", sep="\t", index=False)
print(f"Saved top-5 matches for {len(qv)} questions → question-matches.tsv ({len(matches_df)} rows).")


Saved top-5 matches for 100 questions → question-matches.tsv (500 rows).


Save your top matches in a file "question-matches.tsv" with columns question_id, question_rank, document_title, and paragraph_index.


In [20]:
# YOUR CHANGES HERE

chk = pd.read_csv("question-matches.tsv", sep="\t")
print("Rows:", len(chk), "| Columns:", list(chk.columns))
print("Sample:")
display(chk.head(10))


Rows: 500 | Columns: ['question_id', 'question_rank', 'document_title', 'paragraph_index']
Sample:


Unnamed: 0,question_id,question_rank,document_title,paragraph_index
0,1,1,Association_football,18
1,1,2,Canadian_football,8
2,1,3,Association_football,21
3,1,4,Canadian_football,16
4,1,5,Tibet,10
5,4,1,BeiDou_Navigation_Satellite_System,7
6,4,2,BeiDou_Navigation_Satellite_System,18
7,4,3,BeiDou_Navigation_Satellite_System,14
8,4,4,BeiDou_Navigation_Satellite_System,10
9,4,5,BeiDou_Navigation_Satellite_System,1


Submit "question-matches.tsv" in Gradescope.

## Part 6: Spot Check Question and Paragraph Matches

Review the paragraphs matched to the first 5 questions (sorted by question_id ascending).
Which paragraph was the worst match for each question?


Submit "worst-paragraphs.tsv" in Gradescope.

Write a file "worst-paragraphs.tsv" with three columns question_id, document_title, paragraph_index.

In [21]:
import pandas as pd

matches = pd.read_csv("question-matches.tsv", sep="\t")

# Keep the worst among the top-5 = highest distance ⇒ rank 5
worst = matches.loc[matches["question_rank"] == 5,
                    ["question_id", "document_title", "paragraph_index"]].copy()

# Sort by question_id ascending 
# If question_id is a string of digits, cast to numeric for true numeric ordering
try:
    worst["question_id_sort"] = pd.to_numeric(worst["question_id"], errors="raise")
    worst = worst.sort_values(["question_id_sort", "document_title", "paragraph_index"])
    worst = worst.drop(columns=["question_id_sort"])
except Exception:
    worst = worst.sort_values(["question_id", "document_title", "paragraph_index"])

# Save
worst.to_csv("worst-paragraphs.tsv", sep="\t", index=False)
print(f"Saved {len(worst)} rows to worst-paragraphs.tsv with columns: question_id, document_title, paragraph_index")

# quick peek
display(worst.head())


Saved 100 rows to worst-paragraphs.tsv with columns: question_id, document_title, paragraph_index


Unnamed: 0,question_id,document_title,paragraph_index
4,1,Tibet,10
9,4,BeiDou_Navigation_Satellite_System,1
14,7,Beyoncé,64
19,10,Roman_Republic,27
24,13,Chihuahua_(state),39


## Part 7: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 8: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.

In [22]:
# Part 8 — Acknowledgements
text = """Acknowledgements
I did not discuss this assignment with anyone. 

Extra Libraries
I did not use any libraries not mentioned in the module content. 

Generative AI
I did not use any generative AI tools. 
"""

with open("acknowledgments.txt", "w", encoding="utf-8") as f:
    f.write(text)

print("Wrote acknowledgments.txt")


Wrote acknowledgments.txt
