# Lab Exercise: Optimizing Embedding Search Systems

**Duration:** 2 hours  
**Model:** `nomic-ai/nomic-embed-text-v1.5` (768 dimensions, trained with Matryoshka)

---

## Dataset: BEIR SciFact

SciFact contains scientific claims paired with research paper abstracts for fact-checking. Queries are claims like "Vitamin D deficiency causes depression" while documents are formal research abstracts, making it perfect for testing HyDE since it bridges the vocabulary gap. You'll use 1000 documents and 50 queries with their relevance judgments (qrels).

### Setup Code
```python
!pip install beir

from beir import util
from beir.datasets.data_loader import GenericDataLoader

# Download SciFact
data_path = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(data_path)
out_dir = "./datasets"
data_path = util.download_and_unzip(url, out_dir)

# Load data
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")

# Extract subset
corpus_ids = list(corpus.keys())[:1000]
corpus_texts = [corpus[doc_id]['title'] + " " + corpus[doc_id]['text'] for doc_id in corpus_ids]

# Use queries that have relevance judgments in our subset
query_ids = [qid for qid in list(queries.keys())[:50] if qid in qrels]
query_texts = {qid: queries[qid] for qid in query_ids}

print(f"Loaded {len(corpus_texts)} documents and {len(query_texts)} queries")
print(f"Relevance judgments available for evaluation")
```

---

## Part 1: Baseline Implementation (40 minutes)

Load the `nomic-ai/nomic-embed-text-v1.5` model and encode all 1000 documents using float32 at full 768 dimensions. Build an HNSW index with M=16, ef_construction=200, and cosine distance using the hnswlib library.

Implement a simple HyDE function that converts queries into hypothetical documents (e.g., "Scientific research shows that [query]. Studies have found evidence..."). For each test query, generate embeddings with HyDE, search the index for top-10 results, and collect the retrieved document IDs.

Evaluate your results using the qrels (relevance judgments). Calculate NDCG@10, Recall@10, and MAP (Mean Average Precision) by comparing your retrieved documents against the ground truth relevant documents for each query. Also measure average search latency and index storage size in MB. These are your baseline numbers.

---

## Part 2: Float16 Precision Reduction (10 minutes)

Convert your baseline float32 embeddings to float16 using `.astype(np.float16)`. Calculate and report the storage reduction percentage.

Compare new metrics to your baseline to quantify quality degradation. Is 50% storage reduction worth the quality loss you measured?

---


## Tips


Store results in a list of dictionaries or pandas DataFrame as you test each configuration so you can easily build your final comparison table. Focus on understanding the tradeoffs rather than perfect implementation - the goal is to gain intuition about these optimization techniques.

In [None]:
import os
import numpy as np
import faiss
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from sentence_transformers import SentenceTransformer

In [None]:
data_path = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(data_path)
out_dir = "./datasets"
data_path = util.download_and_unzip(url, out_dir)

# load data
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")

# extract subset
corpus_ids = list(corpus.keys())[:1000]
corpus_texts = [corpus[doc_id]['title'] + " " + corpus[doc_id]['text'] for doc_id in corpus_ids]

# use queries that have relevance judgments in our subset
query_ids = [qid for qid in list(queries.keys())[:50] if qid in qrels]
query_texts = {qid: queries[qid] for qid in query_ids}

print(f"Loaded {len(corpus_texts)} documents and {len(query_texts)} queries")
print(f"Relevance judgments available for evaluation")

In [None]:
model_name = "nomic-ai/nomic-embed-text-v1.5"
model = SentenceTransformer(model_name, trust_remote_code=True)

doc_emb = model.encode(corpus_texts, convert_to_numpy=True, show_progress_bar=True)
doc_emb = doc_emb.astype(np.float32)

norms = np.linalg.norm(doc_emb, axis=1, keepdims=True)
norms[norms == 0] = 1.0
doc_emb = doc_emb / norms

In [None]:
def average_precision(rels):
    hits = 0
    s = 0.0
    for i, r in enumerate(rels, start=1):
        if r == 1:
            hits += 1
            s += hits / i
    if hits == 0:
        return 0.0
    return s / hits

In [None]:
dim = doc_emb.shape[1]
M = 16

index = faiss.IndexHNSWFlat(dim, M, faiss.METRIC_INNER_PRODUCT)
index.hnsw.efConstruction = 200
index.add(doc_emb)
index.hnsw.efSearch = 50

os.makedirs("indexes", exist_ok=True)
faiss.write_index(index, "indexes/base_f32_768.index")
base_size_mb = os.path.getsize("indexes/base_f32_768.index") / (1024 * 1024)
print("index size MB:", base_size_mb)


In [None]:
avg = []

for qid, qtext in query_texts.items():
    hyde_text = "Scientific research describes the following claim. " + qtext + ". Studies report observations and evidence."

    q_vec = model.encode([hyde_text], convert_to_numpy=True).astype(np.float32)
    q_vec = q_vec / np.linalg.norm(q_vec, axis=1, keepdims=True) # normalize

    scores, idxs = index.search(q_vec, 10)
    idxs = idxs[0].tolist()

    search_doc_ids = [corpus_ids[i] for i in idxs] # HyDe searched docs
    real_docs = {d for d in qrels[qid]} # ground truth docs

    rels = [1 if d in real_docs else 0 for d in search_doc_ids]

    avg_val = average_precision(rels)
    avg.append(avg_val)

mean_avg = float(np.mean(avg))
print("MAP (HyDE, F32 768d):", mean_avg)

In [None]:
doc_emb_f16 = doc_emb.astype(np.float16)

index_f16 = faiss.IndexHNSWFlat(dim, M, faiss.METRIC_INNER_PRODUCT)
index_f16.hnsw.efConstruction = 200
index_f16.add(doc_emb_f16.astype(np.float16))
index_f16.hnsw.efSearch = 50

faiss.write_index(index_f16, "indexes/base_f16_768.index")
f16_size_mb = os.path.getsize("indexes/base_f16_768.index") / (1024 * 1024)
print("index size MB:", f16_size_mb)
print("Storage reduction percent:", 100 * (1 - f16_size_mb / base_size_mb))

In [None]:
avg_f16 = []

for qid, qtext in query_texts.items():
    hyde_text = "Scientific research describes the following claim. " + qtext + ". Studies report observations and evidence."

    q_vec = model.encode([hyde_text], convert_to_numpy=True).astype(np.float32)
    q_vec = q_vec / np.linalg.norm(q_vec, axis=1, keepdims=True) # normalize

    scores, idxs = index.search(q_vec, 10)
    idxs = idxs[0].tolist()

    search_doc_ids = [corpus_ids[i] for i in idxs] # HyDe search docs
    relevant_docs = {d for d in qrels[qid]} # ground truth docs

    rels = [1 if d in relevant_docs else 0 for d in search_doc_ids]

    avg_val = average_precision(rels)
    avg_f16.append(avg_val)

mean_avg_f16 = float(np.mean(avg_f16))
print("MAP (HyDE, F16 768d):", mean_avg_f16)