# Exercise 4.2 — Semantic Search with Transformer Embeddings (Advanced)

This notebook builds a practical semantic search engine:
- Create a passage index of dense embeddings using a pre-trained Transformer.
- Run fast ANN search with FAISS for top-k candidates.
- Evaluate retrieval quality on a realistic dataset with Recall@k and MRR.
- Optionally rerank top candidates with a cross-encoder for higher precision.

# 1. Environment Setup

In this section, we install and import all the necessary libraries that will allow us to:
- load and preprocess a text dataset,
- generate embeddings using Transformer models,
- perform high-speed similarity search with FAISS,
- and evaluate retrieval quality using standard metrics.

###  Libraries
1. sentence-transformers: high-level API built on top of Hugging Face Transformers
   for creating sentence and document embeddings.
2. datasets: Hugging Face's dataset library for easy access to public NLP datasets.
3. faiss-cpu: efficient vector similarity search library for building ANN indices.
4. scikit-learn: provides cosine similarity function and additional evaluation tools.
5. torch & numpy: core numerical libraries for tensors and arrays.
6. pandas: to manipulate dataframes easily for analysis and previewing

In [1]:
!pip install -q sentence-transformers datasets faiss-cpu scikit-learn

import os, math, random, numpy as np, pandas as pd, faiss, torch
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from sklearn.metrics import ndcg_score

# Set device and random seeds for reproducibility
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

print(f"✅ Environment ready | Using device: {DEVICE}")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[?25h✅ Environment ready | Using device: cpu


### 2) Dataset: GLUE QQP for Retrieval

We will use the **GLUE QQP (Quora Question Pairs)** dataset.  
It contains pairs of questions from Quora, labelled as:
- **1** → duplicate (same meaning)
- **0** → non-duplicate (different meaning)

We can reinterpret this dataset for **semantic search**:
- Treat one side (e.g., `question2`) as our **document corpus**.
- Treat the other side (`question1`) as **queries**.
- A query’s relevant document is the corresponding `question2` when the label = 1.

This is an ideal dataset for demonstrating how a search system retrieves semantically similar questions.

In this step, we:
1. Load and sample the dataset for speed.
2. Prepare the corpus, queries, and relevance labels.
3. Remove duplicates and create data structures for indexing and evaluation.


In [2]:
# Load the QQP dataset
ds = load_dataset("glue", "qqp")

# We’ll use both the training and validation splits.
# The training split helps us build the corpus (document side),
# while the validation split will serve as test queries.

# Keep only relevant columns and drop empty rows
train_df = pd.DataFrame(ds["train"])[["question1", "question2", "label"]].dropna()
valid_df = pd.DataFrame(ds["validation"])[["question1", "question2", "label"]].dropna()

# We use a small subset to make it feasible for Colab execution
MAX_TRAIN = 30000
MAX_VAL = 5000
train_df = train_df.sample(min(MAX_TRAIN, len(train_df)), random_state=SEED).reset_index(drop=True)
valid_df = valid_df.sample(min(MAX_VAL, len(valid_df)), random_state=SEED).reset_index(drop=True)

# ---- Building the corpus ----
# Use the question2 column as our "document corpus".
# We remove duplicates so that each unique question appears only once.
corpus_texts = pd.Index(valid_df["question2"].tolist()).unique().tolist()
corpus_id_by_text = {t: i for i, t in enumerate(corpus_texts)}
print(f"📚 Corpus size: {len(corpus_texts)} unique documents")

# ---- Building queries and relevance sets ----
# Each question1 becomes a query.
# If label == 1, then its paired question2 is a relevant document.
queries, relevant_ids = [], []
for q1, q2, lab in valid_df.itertuples(index=False):
    if q1 and q2:
        queries.append(q1)
        if lab == 1 and q2 in corpus_id_by_text:
            relevant_ids.append({corpus_id_by_text[q2]})
        else:
            relevant_ids.append(set())

# Filter out queries that have no relevant documents
filtered = [(q, rel) for q, rel in zip(queries, relevant_ids) if len(rel) > 0]
queries, relevant_ids = zip(*filtered) if filtered else ([], [])
queries = list(queries)
relevant_ids = list(relevant_ids)
print(f"🔍 Prepared {len(queries)} queries with at least one relevant document.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

qqp/train-00000-of-00001.parquet:   0%|          | 0.00/33.6M [00:00<?, ?B/s]

qqp/validation-00000-of-00001.parquet:   0%|          | 0.00/3.73M [00:00<?, ?B/s]

qqp/test-00000-of-00001.parquet:   0%|          | 0.00/36.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/363846 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/40430 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/390965 [00:00<?, ? examples/s]

📚 Corpus size: 4948 unique documents
🔍 Prepared 1835 queries with at least one relevant document.


# 3) Embedding Model
Use a compact, performant sentence embedding model. Default: `all-MiniLM-L6-v2`.

- Produces dense vectors suitable for cosine similarity.
- Fast enough for classroom settings.

We L2-normalise embeddings so inner product equals cosine similarity in FAISS.


In [3]:
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(EMBEDDING_MODEL, device=DEVICE)

# Encode corpus
batch_size = 256
corpus_embeddings = model.encode(
    corpus_texts, batch_size=batch_size, convert_to_numpy=True, show_progress_bar=True, normalize_embeddings=True
)
print("Corpus embeddings:", corpus_embeddings.shape, "|| L2-norm first vec:", np.linalg.norm(corpus_embeddings[0]))


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/20 [00:00<?, ?it/s]

Corpus embeddings: (4948, 384) || L2-norm first vec: 1.0


# 4) Build FAISS Index
We use an **Inner Product** index with normalised vectors, equivalent to cosine similarity.

- `IndexFlatIP` for accuracy and simplicity.
- Optional IVF or HNSW indices could be plugged in later for larger corpora.


In [4]:
d = corpus_embeddings.shape[1]
index = faiss.IndexFlatIP(d)            # cosine if vectors are unit-normalised
index.add(corpus_embeddings)
print("FAISS index size:", index.ntotal)


FAISS index size: 4948


### 5) Encode Queries and Search
- Encode each query to an embedding.
- Search top-k nearest neighbours by inner product.
- Return ranked results and similarity scores.


In [5]:
TOP_K = 10
query_embeddings = model.encode(
    queries, batch_size=128, convert_to_numpy=True, show_progress_bar=True, normalize_embeddings=True
)

# Search
scores, indices = index.search(query_embeddings, TOP_K)  # scores: (n_queries, k), indices: (n_queries, k)
print("Search done:", scores.shape, indices.shape)


Batches:   0%|          | 0/15 [00:00<?, ?it/s]

Search done: (1835, 10) (1835, 10)


### 6) Retrieval Evaluation
We compute **Recall@k** and **MRR@k** on queries that have at least one labelled duplicate.

- **Recall@k**: fraction of queries where at least one relevant document appears in top-k.
- **MRR@k**: mean reciprocal rank of the first relevant result within top-k.


In [6]:
def recall_at_k(indices, relevant_sets, k):
    hits = 0
    for i, rel in enumerate(relevant_sets):
        topk = set(indices[i, :k])
        if len(rel.intersection(topk)) > 0:
            hits += 1
    return hits / len(relevant_sets) if relevant_sets else 0.0

def mrr_at_k(indices, relevant_sets, k):
    rr = []
    for i, rel in enumerate(relevant_sets):
        found_rr = 0.0
        for rank, idx in enumerate(indices[i, :k], start=1):
            if idx in rel:
                found_rr = 1.0 / rank
                break
        rr.append(found_rr)
    return float(np.mean(rr)) if rr else 0.0

for k in [1, 5, 10]:
    rec = recall_at_k(indices, relevant_ids, k)
    mrr = mrr_at_k(indices, relevant_ids, k)
    print(f"k={k:>2} | Recall@k={rec:.3f} | MRR@k={mrr:.3f}")


k= 1 | Recall@k=0.772 | MRR@k=0.772
k= 5 | Recall@k=0.956 | MRR@k=0.846
k=10 | Recall@k=0.988 | MRR@k=0.850


### 7) Inspect Results
Let us look at a few queries and their top matches. This helps you judge whether the system retrieves **semantically** related questions, not just keyword overlaps.


In [7]:
def preview(i, k=5):
    print(f"\nQuery [{i}]: {queries[i]}")
    for rank in range(k):
        idx = indices[i, rank]
        print(f"  {rank+1:>2}. score={scores[i, rank]:.3f} | doc[{idx}] = {corpus_texts[idx]}")

# Show 3 random previews
for i in random.sample(range(len(queries)), k=3):
    preview(i, k=5)



Query [1309]: Where can I hire a call girl in Bangalore?
   1. score=0.885 | doc[3480] = Where can I get call girls in Bangalore?
   2. score=0.500 | doc[1713] = How can I have casual sex with a girl in India?
   3. score=0.475 | doc[1920] = What is it like to work at Citrix in Bangalore?
   4. score=0.471 | doc[2273] = I am 27 with 8 years experience in Office Administration completed only +2 from Saudi Arabia planning to do diploma so will I get a job in Bangalore?
   5. score=0.467 | doc[4253] = How do I get a job in Dubai from India?

Query [228]: How do I stop my American Staffy/Kelpie mix from humping my furniture?
   1. score=1.000 | doc[1771] = How do I stop my American Staffy/Kelpie mix from humping my furniture?
   2. score=0.805 | doc[495] = How do I stop my Miniature Pinscher/Chihuahua mix to stop humping my furniture?
   3. score=0.745 | doc[2376] = How do you stop your Boxer/Pitbull mix from humping your furniture?
   4. score=0.406 | doc[4693] = I purchased a big embroi

### 8) Optional: Cross-Encoder Reranking for Precision@k
A **cross-encoder** scores a query–candidate pair jointly, often improving top-rank precision.

Workflow:
1. Retrieve top-N candidates with FAISS (fast).
2. Rerank those N using a cross-encoder (slower but more accurate).
3. Measure metrics again on reranked results.

> This section is optional. Enable it if you have time and a GPU.


In [8]:
ENABLE_RERANK = True
TOP_N_RERANK = 50
TOP_K_FINAL = 10

if ENABLE_RERANK:
    reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", device=DEVICE)

    # Build pairs for the first M queries to limit runtime
    M = min(1000, len(queries))  # adjust as needed
    reranked_indices = np.zeros((M, TOP_K_FINAL), dtype=int)

    for i in range(M):
        # Take top-N candidates from FAISS for query i
        cand_ids = indices[i, :TOP_N_RERANK]
        pairs = [(queries[i], corpus_texts[j]) for j in cand_ids]
        ce_scores = reranker.predict(pairs)

        # Sort by cross-encoder score and take top K
        ordering = np.argsort(-ce_scores)[:TOP_K_FINAL]
        reranked_indices[i, :TOP_K_FINAL] = cand_ids[ordering]

    # Evaluate on the M subset
    sub_rels = relevant_ids[:M]
    print("\nReranked metrics on subset:")
    for k in [1, 5, 10]:
        rec = recall_at_k(reranked_indices, sub_rels, k)
        mrr = mrr_at_k(reranked_indices, sub_rels, k)
        print(f"k={k:>2} | Recall@k={rec:.3f} | MRR@k={mrr:.3f}")


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]


Reranked metrics on subset:
k= 1 | Recall@k=0.775 | MRR@k=0.775
k= 5 | Recall@k=0.958 | MRR@k=0.847
k=10 | Recall@k=0.990 | MRR@k=0.852


### 9) Interactive Search
Run ad-hoc queries against the index and inspect the ranked results.  
This gives an intuitive sense of how well semantic retrieval works on unseen text.


In [9]:
def search(query, k=5, rerank=False):
    q_emb = model.encode([query], normalize_embeddings=True)
    faiss_scores, faiss_idx = index.search(q_emb, max(k, TOP_N_RERANK if rerank else k))
    cand_ids = faiss_idx[0]

    if rerank and 'reranker' in globals():
        pairs = [(query, corpus_texts[j]) for j in cand_ids]
        ce_scores = reranker.predict(pairs)
        ordering = np.argsort(-ce_scores)[:k]
        final_ids = cand_ids[ordering]
        final_scores = ce_scores[ordering]
    else:
        final_ids = cand_ids[:k]
        final_scores = faiss_scores[0][:k]

    print(f"\nQuery: {query}\n")
    for r, (idx_, sc) in enumerate(zip(final_ids, final_scores), start=1):
        print(f"{r}. score={float(sc):.3f} | {corpus_texts[idx_]}")

# Example interactive searches
search("how to improve machine learning model performance", k=5, rerank=False)
search("football match schedule and results", k=5, rerank=False)



Query: how to improve machine learning model performance

1. score=0.434 | How can I improve ranking for my website?
2. score=0.426 | I have high intelligence but I tend to make errors on mechanical tasks. How can I improve my performance as it is affecting how I am regarded at work?
3. score=0.392 | How do I tune a random forest?
4. score=0.382 | How should you start a career in Machine Learning?
5. score=0.358 | How can I improve my speaking?

Query: football match schedule and results

1. score=0.396 | What is the best football formation to win?
2. score=0.385 | Which is the best football club of 2015-2016?
3. score=0.374 | How can I be a football manager?
4. score=0.373 | What NFL teams went to the wildcard playoffs in 2016?
5. score=0.364 | When will India host Olympics?


### 10) Reflection
Briefly address the following in your notes:

- Which metric improved the most, and why might that be the case for this dataset?
- Did cross-encoder reranking help your top-k precision in practice?
- How would you adapt this pipeline for a **document collection** with long texts?
  - Hint: passage splitting, overlap windows, per-document aggregation (max or mean of passage scores).
- What considerations matter for deployment?
  - Index refresh cadence, latency budget, memory footprint, and model size.
