# NB06: FAISS Retrieval + Semantic Search

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB06_faiss_retrieval.ipynb)

**Duration:** ~70 minutes

> **GPU recommended** — go to **Runtime → Change runtime type → T4 GPU**. Section 6 loads the multilingual bge-m3 model (~2 GB) and re-encodes the corpus — this is much faster on GPU.

## Learning Goals

By the end of this notebook, you will be able to:

1. **Build a semantic search system** from scratch using dense embeddings
2. **Use FAISS for fast similarity search** over large document collections
3. **Understand the bi-encoder retrieval paradigm** — encode once, search many times
4. **Try multilingual search with bge-m3** — query in one language, retrieve in another

---

**Prerequisites:** NB02 (sentence embeddings). Familiarity with cosine similarity and vector representations of text.

In [1]:
!pip install faiss-cpu sentence-transformers datasets pandas numpy -q

import faiss
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import time

print(f"FAISS version: {faiss.__version__}")
print("All imports successful.")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m67.3 MB/s[0m eta [36m0:00:00[0m
[?25hFAISS version: 1.13.2
All imports successful.


In [3]:
# ── GPU Check ─────────────────────────────────────────────────────────────
import torch

if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU detected — running on CPU.")
    print("Sections 1-5 (MiniLM, 300 docs) are fine on CPU.")
    print("Section 6 (bge-m3, 2 GB model) will be slow without GPU.")
    print("To enable GPU: Runtime → Change runtime type → T4 GPU")

GPU available: Tesla T4


## 1. The Use Case: Searching Policy Documents

Social scientists, policy analysts, and researchers frequently need to find relevant documents in large collections — policy briefs, legal texts, academic papers, parliamentary debates, and more.

**The problem with keyword search:** Traditional keyword search (TF-IDF, BM25) only matches documents that share *exact words* with the query. This means:

- Searching for "climate change effects on wildlife" will **miss** a document titled "global warming impacts on animal populations" — even though they address the same topic.
- Searching for "vaccination efficacy" will **miss** documents about "how well immunization works."

**The solution: Semantic search.** We encode documents and queries into dense vector representations (embeddings) that capture *meaning*, not just surface words. Similar meanings produce similar vectors, regardless of the exact words used.

In this notebook, we will build a complete semantic search engine:

1. Load a corpus of scientific abstracts
2. Encode all documents into embeddings
3. Build a FAISS index for fast retrieval
4. Search with natural-language queries
5. Evaluate retrieval quality
6. Enable cross-lingual search with a multilingual model

In [16]:
# Easiest/cleanest: skip `datasets` entirely for BEIR and read the raw JSONL(.gz) via huggingface_hub.

import json, gzip
import pandas as pd
from huggingface_hub import hf_hub_download

def load_beir_corpus(repo_id: str, n: int | None = None) -> pd.DataFrame:
    # try gz first, then plain jsonl
    for fname in ("corpus.jsonl.gz", "corpus.jsonl"):
        try:
            path = hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=fname)
            break
        except Exception:
            path = None
    if path is None:
        raise RuntimeError(f"Couldn't find corpus.jsonl(.gz) in {repo_id}")

    def lines(p):
        if p.endswith(".gz"):
            with gzip.open(p, "rt", encoding="utf-8") as f:
                for line in f:
                    yield line
        else:
            with open(p, "r", encoding="utf-8") as f:
                for line in f:
                    yield line

    rows = []
    for i, line in enumerate(lines(path)):
        if n is not None and i >= n:
            break
        obj = json.loads(line)
        rows.append({
            "doc_id": obj.get("_id") or obj.get("doc_id") or obj.get("id"),
            "title": obj.get("title", "") or "",
            "text": obj.get("text", "") or "",
        })

    df = pd.DataFrame(rows)
    df["full_text"] = (
        df["title"].fillna("").astype(str).str.strip()
        + ". "
        + df["text"].fillna("").astype(str).str.strip()
    )
    return df

corpus_df = load_beir_corpus("BeIR/scifact", n=300)
print(len(corpus_df))
print(corpus_df.iloc[0]["full_text"][:300])

300
Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging.. Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line sc


## 2. Encoding the Corpus

We use a **bi-encoder** approach to semantic search:

1. **Offline step (done once):** Encode all documents in the corpus into fixed-size embedding vectors and store them.
2. **Online step (done per query):** Encode the user's query into the same embedding space, then find the nearest document vectors.

This is extremely efficient because:
- The expensive corpus encoding happens **once** and can be cached.
- At query time, we only need to encode **one short query** and perform a vector lookup.

We start with `all-MiniLM-L6-v2`, a lightweight model (80 MB) that produces 384-dimensional embeddings. It is fast, effective for English, and a good baseline.

**Key detail:** We set `normalize_embeddings=True` so that all vectors have unit length. This means the inner product (dot product) between any two vectors equals their cosine similarity — which is exactly what we want for measuring semantic similarity.

In [17]:
# Using a lightweight but effective model
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Encoding corpus...")
start = time.time()
corpus_embeddings = model.encode(
    corpus_df['full_text'].tolist(),
    show_progress_bar=True,
    batch_size=64,
    normalize_embeddings=True  # Important for cosine similarity with FAISS
)
print(f"Encoded {len(corpus_embeddings)} documents in {time.time()-start:.1f}s")
print(f"Embedding shape: {corpus_embeddings.shape}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding corpus...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Encoded 300 documents in 1.8s
Embedding shape: (300, 384)


## 3. Building a FAISS Index

**FAISS** (Facebook AI Similarity Search) is an open-source library developed by Meta AI for efficient similarity search over dense vectors. It is the standard tool for this task and scales to billions of vectors.

FAISS offers many index types. We use **`IndexFlatIP`** (Flat Index with Inner Product):

| Index type | Description | Speed | Accuracy |
|---|---|---|---|
| `IndexFlatIP` | Exact inner product search (brute force) | Slower for huge corpora | 100% exact |
| `IndexFlatL2` | Exact L2 distance search | Slower for huge corpora | 100% exact |
| `IndexIVFFlat` | Approximate search with inverted file | Fast | Very good |
| `IndexHNSW` | Approximate search with graph structure | Very fast | Very good |

For our 300-document corpus, `IndexFlatIP` is perfect — it gives exact results and is fast enough. For millions of documents, you would switch to an approximate index like `IndexIVFFlat` or `IndexHNSW`.

**Why Inner Product?** Because we normalized our embeddings to unit length, the inner product between two vectors equals their cosine similarity. Higher score = more similar.

In [18]:
import faiss

# Create a FAISS index
dimension = corpus_embeddings.shape[1]  # 384 for MiniLM
index = faiss.IndexFlatIP(dimension)  # Inner Product = cosine sim (normalized)

# Add vectors to the index
index.add(corpus_embeddings.astype('float32'))

print(f"FAISS index built: {index.ntotal} vectors, {dimension} dimensions")

FAISS index built: 300 vectors, 384 dimensions


## 4. Searching!

Now for the fun part. We define a `search()` function that:

1. Takes a natural-language query string
2. Encodes it into an embedding with the same model
3. Searches the FAISS index for the `top_k` most similar documents
4. Returns a clean DataFrame with ranks, scores, titles, and text previews

Let's test it with several queries and see what comes back.

In [19]:
def search(query: str, top_k: int = 5) -> pd.DataFrame:
    """Search the corpus for documents matching the query."""
    # Encode the query
    query_embedding = model.encode([query], normalize_embeddings=True).astype('float32')

    # Search FAISS index
    scores, indices = index.search(query_embedding, top_k)

    # Format results
    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append({
            'rank': len(results) + 1,
            'score': float(score),
            'title': corpus_df.iloc[idx]['title'],
            'text': corpus_df.iloc[idx]['full_text'][:200] + '...'
        })
    return pd.DataFrame(results)

# Test queries
queries = [
    "effects of climate change on biodiversity",
    "how do vaccines work",
    "machine learning for medical diagnosis",
    "genetic factors in cancer risk",
    "air pollution and respiratory disease"
]

for q in queries:
    print(f"\n{'='*60}")
    print(f"Query: {q}")
    print(f"{'='*60}")
    results = search(q, top_k=3)
    for _, row in results.iterrows():
        print(f"  [{row['rank']}] (score: {row['score']:.3f}) {row['title']}")


Query: effects of climate change on biodiversity
  [1] (score: 0.308) Genetic Tests for Ecological and Allopatric Speciation in Anoles on an Island Archipelago
  [2] (score: 0.298) Effective population size and patterns of molecular evolution and variation
  [3] (score: 0.243) The First Myriapod Genome Sequence Reveals Conservative Arthropod Gene Content and Genome Organisation in the Centipede Strigamia maritima 

Query: how do vaccines work
  [1] (score: 0.374) The descent of memory T-cell subsets
  [2] (score: 0.355) An essential role for interferon gamma in resistance to Mycobacterium tuberculosis infection
  [3] (score: 0.346) Lymph node T cell responses predict the efficacy of live attenuated SIV vaccines

Query: machine learning for medical diagnosis
  [1] (score: 0.366) Simplifying likelihood ratios
  [2] (score: 0.357) Adverse drug events: database construction and in silico prediction.
  [3] (score: 0.339) Robustness of Random Forest-based gene selection methods

Query: gene

## 5. Evaluating Retrieval Quality

How do we know if our search engine is any good? We need **retrieval evaluation metrics**.

The most intuitive metric is **Precision@k**: of the top *k* results returned, how many are actually relevant?

$$\text{Precision@k} = \frac{\text{Number of relevant documents in top } k}{k}$$

For example, if we retrieve 5 documents and 3 are relevant, Precision@5 = 3/5 = 60%.

**The hard part:** Determining relevance usually requires human judgment. For this demo, we will use a simple proxy — checking whether retrieved documents contain keywords related to the query topic. This is imperfect, but gives a rough signal.

In a real evaluation, you would use a benchmark dataset with human-annotated relevance judgments (like BEIR, MTEB, or TREC).

In [20]:
# Define queries with expected relevant terms
eval_queries = [
    {"query": "vaccine effectiveness against viral infections",
     "relevant_terms": ["vaccine", "immunization", "viral", "antibod"]},
    {"query": "genetic mutations and cancer development",
     "relevant_terms": ["genetic", "mutation", "cancer", "tumor", "oncog"]},
    {"query": "impact of air pollution on health",
     "relevant_terms": ["pollution", "air", "respiratory", "particulate"]},
]

# Simple keyword-based relevance proxy
def is_relevant(doc_text, relevant_terms):
    doc_lower = doc_text.lower()
    return any(term in doc_lower for term in relevant_terms)

for eq in eval_queries:
    results = search(eq['query'], top_k=5)
    relevant = sum(is_relevant(row['text'], eq['relevant_terms']) for _, row in results.iterrows())
    precision = relevant / len(results)
    print(f"Query: '{eq['query'][:50]}...'")
    print(f"  Precision@5: {precision:.0%} ({relevant}/5 relevant)")

Query: 'vaccine effectiveness against viral infections...'
  Precision@5: 20% (1/5 relevant)
Query: 'genetic mutations and cancer development...'
  Precision@5: 80% (4/5 relevant)
Query: 'impact of air pollution on health...'
  Precision@5: 60% (3/5 relevant)


## 6. Multilingual Search with bge-m3

So far, we have used an English-only model. But what if your corpus is in English and your users search in French, German, or Spanish?

**bge-m3** (BAAI General Embedding — Multi-lingual, Multi-granularity, Multi-functionality) is a state-of-the-art multilingual embedding model that:

- Supports **100+ languages**
- Produces **1024-dimensional** embeddings
- Enables **cross-lingual retrieval**: query in one language, retrieve in another

This is incredibly useful for:
- Multilingual policy analysis (e.g., EU documents in 24 languages)
- Comparative political science across countries
- Searching English academic literature with non-English queries

The key insight: bge-m3 maps semantically equivalent sentences from different languages to **nearby points** in the same embedding space. So "vaccination" (EN), "vaccination" (FR), "Impfung" (DE), and "vacunaci\u00f3n" (ES) all end up close together.

**Note:** bge-m3 is larger (~2 GB) and slower than MiniLM. Loading and encoding will take a bit longer.

In [21]:
# Load multilingual model (larger but much more powerful)
print("Loading multilingual model (this may take a minute)...")
ml_model = SentenceTransformer('BAAI/bge-m3')

# Re-encode corpus with multilingual model
ml_embeddings = ml_model.encode(
    corpus_df['full_text'].tolist(),
    show_progress_bar=True,
    batch_size=32,
    normalize_embeddings=True
)

# Build new FAISS index
ml_dimension = ml_embeddings.shape[1]
ml_index = faiss.IndexFlatIP(ml_dimension)
ml_index.add(ml_embeddings.astype('float32'))

def search_multilingual(query: str, top_k: int = 5):
    """Search using multilingual embeddings."""
    q_emb = ml_model.encode([query], normalize_embeddings=True).astype('float32')
    scores, indices = ml_index.search(q_emb, top_k)
    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append({
            'rank': len(results) + 1,
            'score': float(score),
            'title': corpus_df.iloc[idx]['title'],
        })
    return pd.DataFrame(results)

print(f"\nMultilingual FAISS index built: {ml_index.ntotal} vectors, {ml_dimension} dimensions")

Loading multilingual model (this may take a minute)...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]


Multilingual FAISS index built: 300 vectors, 1024 dimensions


In [22]:
# Same meaning, different languages
cross_lingual_queries = [
    ("English", "effects of vaccination on immune response"),
    ("French", "effets de la vaccination sur la r\u00e9ponse immunitaire"),
    ("German", "Auswirkungen der Impfung auf die Immunantwort"),
    ("Spanish", "efectos de la vacunaci\u00f3n en la respuesta inmune"),
]

print("Cross-lingual retrieval test:")
print("="*60)
for lang, query in cross_lingual_queries:
    results = search_multilingual(query, top_k=3)
    print(f"\n[{lang}] '{query[:50]}...'")
    for _, row in results.iterrows():
        print(f"  [{row['rank']}] ({row['score']:.3f}) {row['title'][:60]}")

Cross-lingual retrieval test:

[English] 'effects of vaccination on immune response...'
  [1] (0.529) Lymph node T cell responses predict the efficacy of live att
  [2] (0.487) Antioxidants attenuate the plasma cytokine response to exerc
  [3] (0.486) The descent of memory T-cell subsets

[French] 'effets de la vaccination sur la réponse immunitair...'
  [1] (0.516) Lymph node T cell responses predict the efficacy of live att
  [2] (0.489) The descent of memory T-cell subsets
  [3] (0.487) Antioxidants attenuate the plasma cytokine response to exerc

[German] 'Auswirkungen der Impfung auf die Immunantwort...'
  [1] (0.501) Innate lymphoid cells mediate influenza-induced airway hyper
  [2] (0.500) Neutrophil extracellular traps enriched in oxidized mitochon
  [3] (0.494) Lymph node T cell responses predict the efficacy of live att

[Spanish] 'efectos de la vacunación en la respuesta inmune...'
  [1] (0.493) Lymph node T cell responses predict the efficacy of live att
  [2] (0.477) The d

## From FAISS to a Vector Database: ChromaDB

FAISS is excellent for understanding how vector search works under the hood — but in production, you often want a **vector database** that handles persistence, metadata filtering, and API convenience for you.

**ChromaDB** is a lightweight, open-source vector database that is perfect for prototyping and small-to-medium scale applications. Here is how it compares to raw FAISS:

| Feature | FAISS | ChromaDB |
|---------|-------|----------|
| **Persistence** | Manual (save/load index files) | Built-in (auto-saves to disk) |
| **Metadata** | Not supported (vectors only) | Filter by any metadata field |
| **Embedding** | BYO (encode externally) | Built-in sentence-transformers |
| **API** | Low-level NumPy arrays | High-level Python API |
| **Scale** | Billions of vectors | Millions of vectors |
| **Best for** | Research, max performance | Prototyping, applications, RAG |

Let's build the same search engine using ChromaDB — notice how much simpler the code is.

In [23]:
!pip install chromadb -q

import chromadb

# Create an in-memory ChromaDB client (use PersistentClient for disk storage)
chroma_client = chromadb.Client()

# Create a collection — ChromaDB handles embedding automatically!
collection = chroma_client.create_collection(
    name="scifact_abstracts",
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

# Add documents (ChromaDB embeds them using its default model)
collection.add(
    documents=corpus_df['full_text'].tolist(),
    ids=[str(i) for i in range(len(corpus_df))],
    metadatas=[{"title": t} for t in corpus_df['title'].tolist()]
)


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m84.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m61.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m75.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.5/72.5 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.6/132.6 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.4/66.4 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:08<00:00, 9.50MiB/s]


ChromaDB search results:
  [1] (distance: 0.538) The descent of memory T-cell subsets
  [2] (distance: 0.557) An essential role for interferon gamma in resistance to Mycobacterium tuberculosis infection
  [3] (distance: 0.600) Long-term immune deficiency after allogeneic stem cell transplantation: B-cell deficiency is associated with late infections.
  [4] (distance: 0.605) Transgenic Interleukin 10 Prevents Induction of Experimental Autoimmune Encephalomyelitis 
  [5] (distance: 0.606) Lymph node T cell responses predict the efficacy of live attenuated SIV vaccines


In [24]:
# Query — just pass a string, ChromaDB handles the rest
results = collection.query(
    query_texts=["effects of vaccination on immune response"],
    n_results=5
)

print("ChromaDB search results:")
for i, (doc, meta, dist) in enumerate(zip(
    results['documents'][0], results['metadatas'][0], results['distances'][0]
)):
    print(f"  [{i+1}] (distance: {dist:.3f}) {meta['title']}")

ChromaDB search results:
  [1] (distance: 0.538) The descent of memory T-cell subsets
  [2] (distance: 0.557) An essential role for interferon gamma in resistance to Mycobacterium tuberculosis infection
  [3] (distance: 0.600) Long-term immune deficiency after allogeneic stem cell transplantation: B-cell deficiency is associated with late infections.
  [4] (distance: 0.605) Transgenic Interleukin 10 Prevents Induction of Experimental Autoimmune Encephalomyelitis 
  [5] (distance: 0.606) Lymph node T cell responses predict the efficacy of live attenuated SIV vaccines


### When to Use Which?

- **Use FAISS** when you need to understand vector search internals, need maximum performance at scale (billions of vectors), or want full control over the index type and parameters.
- **Use ChromaDB** when you are building applications, prototyping RAG pipelines, or need metadata filtering and persistence without managing index files manually.
- **Use a managed service** (Pinecone, Weaviate, Qdrant) when you need production-grade infrastructure with replication, auth, and monitoring.

In practice, many teams **start with ChromaDB** for rapid prototyping, then move to a managed service as their needs grow. The concepts you learned with FAISS (embeddings, similarity search, index types) transfer directly.

## 7. Exercise: Build Your Own Search Engine

Now it's your turn! Build a semantic search engine over a different corpus.

**Suggestions:**
- Load a different dataset (e.g., Wikipedia snippets, news articles, or your own research papers)
- Define at least 3 meaningful search queries relevant to the corpus
- Evaluate Precision@5 for each query using keyword-based or manual relevance judgments
- Compare results between the English model (`all-MiniLM-L6-v2`) and the multilingual model (`BAAI/bge-m3`)

**Bonus:** Try indexing with `IndexIVFFlat` instead of `IndexFlatIP` and compare speed/accuracy.

In [None]:
# YOUR CODE HERE

# Step 1: Load a corpus
# e.g., dataset = load_dataset("wikipedia", "20220301.simple", split="train[:500]")


# Step 2: Encode the corpus with a sentence-transformer model


# Step 3: Build a FAISS index


# Step 4: Define search queries and test them


# Step 5: Evaluate Precision@5 with relevance terms


## 8. Summary & Takeaways

In this notebook, we built a complete semantic search system. Here are the key takeaways:

| Concept | What we learned |
|---|---|
| **Semantic search** | Dense embeddings capture meaning, not just keywords — enabling retrieval of semantically similar documents even when they use different words. |
| **FAISS** | Facebook AI Similarity Search provides fast, scalable nearest-neighbor search over dense vectors. `IndexFlatIP` gives exact results; approximate indices scale to billions of vectors. |
| **Bi-encoder paradigm** | Encode the corpus once (offline), then encode queries at search time (online). This separation makes retrieval extremely fast. |
| **Multilingual retrieval** | Models like bge-m3 map text from 100+ languages into a shared embedding space, enabling cross-lingual search — query in French, retrieve English documents. |
| **Evaluation** | Precision@k measures how many of the top-k retrieved documents are relevant. Real evaluation requires human-annotated relevance judgments. |

### Limitations of bi-encoder retrieval

Bi-encoders are fast but imperfect. Because query and document are encoded **independently**, the model cannot attend to fine-grained interactions between them. This means:

- Subtle semantic distinctions may be missed
- The top-1 result is not always the best — but the correct answer is usually somewhere in the top 10-20

### What's next?

In **NB07**, we will address these limitations by adding a **cross-encoder reranker** on top of the bi-encoder retriever. The cross-encoder processes each (query, document) pair jointly, enabling much more precise relevance scoring. The typical pipeline:

1. **Retrieve** the top 50-100 candidates with a bi-encoder (fast but approximate)
2. **Rerank** those candidates with a cross-encoder (slow but precise)

This two-stage approach gives you both speed and accuracy.