# NB06: FAISS Retrieval + Semantic Search

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB06_faiss_retrieval.ipynb)

**Duration:** ~70 minutes

> **GPU optional** — CPU is fine for the 300-doc demo. A GPU simply speeds up embedding in Sections 2 and 6.

## Learning Goals

By the end of this notebook, you will be able to:

1. **Build a semantic search system** from scratch using dense embeddings
2. **Use FAISS for fast similarity search** over document collections
3. **Understand the bi-encoder retrieval paradigm** — encode once, search many times
4. **Use compact E5 models** (`intfloat/e5-small`, `intfloat/multilingual-e5-small`) with the correct `query:` / `passage:` format

---

**Prerequisites:** NB02 (sentence embeddings). Familiarity with cosine similarity and vector representations of text.


In [None]:
!pip install faiss-cpu sentence-transformers datasets pandas numpy -q

import faiss
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import time

print(f"FAISS version: {faiss.__version__}")
print("All imports successful.")

In [None]:
# ── GPU Check ─────────────────────────────────────────────────────────────
import torch

if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU detected — running on CPU.")
    print("This notebook still runs fine on CPU for 300 documents.")
    print("To enable GPU: Runtime -> Change runtime type -> T4 GPU")


## 1. The Use Case: Searching Policy Documents

Social scientists, policy analysts, and researchers frequently need to find relevant documents in large collections — policy briefs, legal texts, academic papers, parliamentary debates, and more.

**The problem with keyword search:** Traditional keyword search (TF-IDF, BM25) only matches documents that share *exact words* with the query.

**The solution: Semantic search.** We encode documents and queries into dense vectors that capture meaning.

In this notebook, we will:

1. Load a corpus of scientific abstracts
2. Encode all documents into embeddings
3. Build a FAISS index for fast retrieval
4. Search with natural-language queries
5. Evaluate retrieval quality
6. Enable cross-lingual search with a compact multilingual model

### Why E5 small models here?

You asked for a smaller IntFloat model, and that is a good fit for this notebook:
- `intfloat/e5-small` is compact and strong for retrieval tasks
- `intfloat/multilingual-e5-small` gives cross-lingual retrieval without the heavy bge-m3 footprint
- Both require the prompt format:
  - documents: `"passage: ..."`
  - queries: `"query: ..."`


In [None]:
# Robust SciFact loader (handles schema/split differences)
from datasets import load_dataset


def load_scifact_corpus(max_docs=300):
    ds = load_dataset("mteb/scifact")

    if isinstance(ds, dict) or hasattr(ds, "keys"):
        if "corpus" in ds:
            split = ds["corpus"]
            split_name = "corpus"
        elif "train" in ds:
            split = ds["train"]
            split_name = "train"
        else:
            first_key = list(ds.keys())[0]
            split = ds[first_key]
            split_name = first_key
    else:
        split = ds
        split_name = "(single split)"

    raw_df = split.to_pandas()
    print(f"Loaded split: {split_name}")
    print(f"Available columns: {list(raw_df.columns)}")

    df = raw_df.copy()

    # Ensure doc_id exists
    if "doc_id" not in df.columns:
        if "_id" in df.columns:
            df["doc_id"] = df["_id"].astype(str)
        else:
            df["doc_id"] = df.index.astype(str)

    # Resolve title/text columns robustly
    title_candidates = ["title", "document_title", "paper_title"]
    text_candidates = ["text", "abstract", "contents", "content", "document"]

    title_col = next((c for c in title_candidates if c in df.columns), None)
    text_col = next((c for c in text_candidates if c in df.columns), None)

    if text_col is None:
        # Fallback: first object column that is not known metadata
        object_cols = [c for c in df.columns if df[c].dtype == object and c not in {"doc_id", title_col}]
        if object_cols:
            text_col = object_cols[0]

    if title_col is None:
        df["title"] = ""
    else:
        df["title"] = df[title_col].fillna("").astype(str)

    if text_col is None:
        raise ValueError(f"Could not identify a text column from: {list(df.columns)}")
    df["text"] = df[text_col].fillna("").astype(str)

    df = df[["doc_id", "title", "text"]].copy()
    df["full_text"] = (df["title"].str.strip() + ". " + df["text"].str.strip()).str.strip(" .")
    df = df[df["full_text"].str.len() > 20].reset_index(drop=True)

    if len(df) > max_docs:
        df = df.head(max_docs).reset_index(drop=True)

    return df


corpus_df = load_scifact_corpus(max_docs=300)

print(f"\nCorpus size: {len(corpus_df)} documents")
print("\nExample document:")
print(corpus_df.iloc[0]["full_text"][:300])


## 2. Encoding the Corpus

We use a **bi-encoder** approach:

1. Encode corpus documents once (offline)
2. Encode each query at runtime (online)
3. Compare vectors in FAISS

For retrieval, we use `intfloat/e5-small`.

Important: E5 models expect explicit prefixes:
- corpus docs: `passage: ...`
- queries: `query: ...`

Without these prefixes, retrieval quality drops noticeably.


In [None]:
# English retrieval model (compact, strong)
MODEL_NAME = "intfloat/e5-small"
model = SentenceTransformer(MODEL_NAME)


def format_passage(text: str) -> str:
    return f"passage: {text.strip()}"


def format_query(text: str) -> str:
    return f"query: {text.strip()}"


corpus_inputs = [format_passage(t) for t in corpus_df["full_text"].tolist()]

print(f"Encoding corpus with {MODEL_NAME}...")
start = time.time()
corpus_embeddings = model.encode(
    corpus_inputs,
    show_progress_bar=True,
    batch_size=64,
    normalize_embeddings=True,
)
print(f"Encoded {len(corpus_embeddings)} documents in {time.time()-start:.1f}s")
print(f"Embedding shape: {corpus_embeddings.shape}")


## 3. Building a FAISS Index

**FAISS** (Facebook AI Similarity Search) is an open-source library developed by Meta AI for efficient similarity search over dense vectors. It is the standard tool for this task and scales to billions of vectors.

FAISS offers many index types. We use **`IndexFlatIP`** (Flat Index with Inner Product):

| Index type | Description | Speed | Accuracy |
|---|---|---|---|
| `IndexFlatIP` | Exact inner product search (brute force) | Slower for huge corpora | 100% exact |
| `IndexFlatL2` | Exact L2 distance search | Slower for huge corpora | 100% exact |
| `IndexIVFFlat` | Approximate search with inverted file | Fast | Very good |
| `IndexHNSW` | Approximate search with graph structure | Very fast | Very good |

For our 300-document corpus, `IndexFlatIP` is perfect — it gives exact results and is fast enough. For millions of documents, you would switch to an approximate index like `IndexIVFFlat` or `IndexHNSW`.

**Why Inner Product?** Because we normalized our embeddings to unit length, the inner product between two vectors equals their cosine similarity. Higher score = more similar.

In [None]:
import faiss

# Create a FAISS index
# e5-small produces 384-dimensional vectors
dimension = corpus_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner Product == cosine similarity for normalized vectors

# Add vectors to the index
index.add(corpus_embeddings.astype("float32"))

print(f"FAISS index built: {index.ntotal} vectors, {dimension} dimensions")


## 4. Searching!

Now for the fun part. We define a `search()` function that:

1. Takes a natural-language query string
2. Encodes it into an embedding with the same model
3. Searches the FAISS index for the `top_k` most similar documents
4. Returns a clean DataFrame with ranks, scores, titles, and text previews

Let's test it with several queries and see what comes back.

In [None]:
def search(query: str, top_k: int = 5) -> pd.DataFrame:
    """Search the corpus for documents matching the query."""
    query_embedding = model.encode(
        [format_query(query)],
        normalize_embeddings=True,
    ).astype("float32")

    scores, indices = index.search(query_embedding, top_k)

    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append({
            "rank": len(results) + 1,
            "score": float(score),
            "title": corpus_df.iloc[idx]["title"],
            "text": corpus_df.iloc[idx]["full_text"][:200] + "...",
        })
    return pd.DataFrame(results)


# Test queries
queries = [
    "effects of climate change on biodiversity",
    "how do vaccines work",
    "machine learning for medical diagnosis",
    "genetic factors in cancer risk",
    "air pollution and respiratory disease",
]

for q in queries:
    print(f"\n{'='*60}")
    print(f"Query: {q}")
    print(f"{'='*60}")
    results = search(q, top_k=3)
    for _, row in results.iterrows():
        print(f"  [{row['rank']}] (score: {row['score']:.3f}) {row['title']}")


## 5. Evaluating Retrieval Quality

How do we know if our search engine is any good? We need **retrieval evaluation metrics**.

The most intuitive metric is **Precision@k**: of the top *k* results returned, how many are actually relevant?

$$\text{Precision@k} = \frac{\text{Number of relevant documents in top } k}{k}$$

For example, if we retrieve 5 documents and 3 are relevant, Precision@5 = 3/5 = 60%.

**The hard part:** Determining relevance usually requires human judgment. For this demo, we will use a simple proxy — checking whether retrieved documents contain keywords related to the query topic. This is imperfect, but gives a rough signal.

In a real evaluation, you would use a benchmark dataset with human-annotated relevance judgments (like BEIR, MTEB, or TREC).

In [None]:
# Define queries with expected relevant terms
eval_queries = [
    {"query": "vaccine effectiveness against viral infections", 
     "relevant_terms": ["vaccine", "immunization", "viral", "antibod"]},
    {"query": "genetic mutations and cancer development",
     "relevant_terms": ["genetic", "mutation", "cancer", "tumor", "oncog"]},
    {"query": "impact of air pollution on health",
     "relevant_terms": ["pollution", "air", "respiratory", "particulate"]},
]

# Simple keyword-based relevance proxy
def is_relevant(doc_text, relevant_terms):
    doc_lower = doc_text.lower()
    return any(term in doc_lower for term in relevant_terms)

for eq in eval_queries:
    results = search(eq['query'], top_k=5)
    relevant = sum(is_relevant(row['text'], eq['relevant_terms']) for _, row in results.iterrows())
    precision = relevant / len(results)
    print(f"Query: '{eq['query'][:50]}...'")
    print(f"  Precision@5: {precision:.0%} ({relevant}/5 relevant)")

## 6. Multilingual Search with `intfloat/multilingual-e5-small`

Now we switch to a compact multilingual retrieval model.

Compared to `bge-m3`, this model is much lighter while still enabling cross-lingual retrieval for teaching-scale demos.

As with E5-small, we must keep prefixes:
- documents: `passage: ...`
- queries: `query: ...`


In [None]:
# Compact multilingual retrieval model
ML_MODEL_NAME = "intfloat/multilingual-e5-small"
print(f"Loading {ML_MODEL_NAME} ...")
ml_model = SentenceTransformer(ML_MODEL_NAME)


def format_passage_ml(text: str) -> str:
    return f"passage: {text.strip()}"


def format_query_ml(text: str) -> str:
    return f"query: {text.strip()}"


ml_inputs = [format_passage_ml(t) for t in corpus_df["full_text"].tolist()]
ml_embeddings = ml_model.encode(
    ml_inputs,
    show_progress_bar=True,
    batch_size=64,
    normalize_embeddings=True,
)

ml_dimension = ml_embeddings.shape[1]
ml_index = faiss.IndexFlatIP(ml_dimension)
ml_index.add(ml_embeddings.astype("float32"))


def search_multilingual(query: str, top_k: int = 5):
    """Search using multilingual E5 embeddings."""
    q_emb = ml_model.encode([format_query_ml(query)], normalize_embeddings=True).astype("float32")
    scores, indices = ml_index.search(q_emb, top_k)
    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append({
            "rank": len(results) + 1,
            "score": float(score),
            "title": corpus_df.iloc[idx]["title"],
        })
    return pd.DataFrame(results)


print(f"\nMultilingual FAISS index built: {ml_index.ntotal} vectors, {ml_dimension} dimensions")


In [None]:
# Same meaning, different languages
cross_lingual_queries = [
    ("English", "effects of vaccination on immune response"),
    ("French", "effets de la vaccination sur la r\u00e9ponse immunitaire"),
    ("German", "Auswirkungen der Impfung auf die Immunantwort"),
    ("Spanish", "efectos de la vacunaci\u00f3n en la respuesta inmune"),
]

print("Cross-lingual retrieval test:")
print("="*60)
for lang, query in cross_lingual_queries:
    results = search_multilingual(query, top_k=3)
    print(f"\n[{lang}] '{query[:50]}...'")
    for _, row in results.iterrows():
        print(f"  [{row['rank']}] ({row['score']:.3f}) {row['title'][:60]}")

## From FAISS to a Vector Database: ChromaDB

FAISS is excellent for understanding how vector search works under the hood — but in production, you often want a **vector database** that handles persistence, metadata filtering, and API convenience for you.

**ChromaDB** is a lightweight, open-source vector database that is perfect for prototyping and small-to-medium scale applications. Here is how it compares to raw FAISS:

| Feature | FAISS | ChromaDB |
|---------|-------|----------|
| **Persistence** | Manual (save/load index files) | Built-in (auto-saves to disk) |
| **Metadata** | Not supported (vectors only) | Filter by any metadata field |
| **Embedding** | BYO (encode externally) | Built-in sentence-transformers |
| **API** | Low-level NumPy arrays | High-level Python API |
| **Scale** | Billions of vectors | Millions of vectors |
| **Best for** | Research, max performance | Prototyping, applications, RAG |

Let's build the same search engine using ChromaDB — notice how much simpler the code is.

In [None]:
!pip install chromadb -q

import chromadb

# Create an in-memory ChromaDB client (use PersistentClient for disk storage)
chroma_client = chromadb.Client()

# Create a collection — ChromaDB handles embedding automatically!
collection = chroma_client.create_collection(
    name="scifact_abstracts",
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

# Add documents (ChromaDB embeds them using its default model)
collection.add(
    documents=corpus_df['full_text'].tolist(),
    ids=[str(i) for i in range(len(corpus_df))],
    metadatas=[{"title": t} for t in corpus_df['title'].tolist()]
)

# Query — just pass a string, ChromaDB handles the rest
results = collection.query(
    query_texts=["effects of vaccination on immune response"],
    n_results=5
)

print("ChromaDB search results:")
for i, (doc, meta, dist) in enumerate(zip(
    results['documents'][0], results['metadatas'][0], results['distances'][0]
)):
    print(f"  [{i+1}] (distance: {dist:.3f}) {meta['title']}")

### When to Use Which?

- **Use FAISS** when you need to understand vector search internals, need maximum performance at scale (billions of vectors), or want full control over the index type and parameters.
- **Use ChromaDB** when you are building applications, prototyping RAG pipelines, or need metadata filtering and persistence without managing index files manually.
- **Use a managed service** (Pinecone, Weaviate, Qdrant) when you need production-grade infrastructure with replication, auth, and monitoring.

In practice, many teams **start with ChromaDB** for rapid prototyping, then move to a managed service as their needs grow. The concepts you learned with FAISS (embeddings, similarity search, index types) transfer directly.

## 7. Exercise: Build Your Own Search Engine

Now it's your turn. Build a semantic search engine over a different corpus.

**Suggestions:**
- Load a different dataset (Wikipedia snippets, news, or your own documents)
- Define at least 3 meaningful search queries
- Evaluate Precision@5 for each query
- Compare:
  - `intfloat/e5-small` (English-focused)
  - `intfloat/multilingual-e5-small` (cross-lingual)

**Bonus:** Try `IndexIVFFlat` instead of `IndexFlatIP` and compare speed/accuracy.


In [None]:
# YOUR CODE HERE

# Step 1: Load a corpus
# e.g., dataset = load_dataset("wikipedia", "20220301.simple", split="train[:500]")


# Step 2: Encode the corpus with a sentence-transformer model


# Step 3: Build a FAISS index


# Step 4: Define search queries and test them


# Step 5: Evaluate Precision@5 with relevance terms


## 8. Summary & Takeaways

In this notebook, we built a complete semantic search system with FAISS and compact E5 models.

| Concept | What we learned |
|---|---|
| **Semantic search** | Dense embeddings capture meaning, not just keywords. |
| **FAISS** | `IndexFlatIP` gives exact nearest-neighbor retrieval over normalized vectors. |
| **Bi-encoder paradigm** | Encode corpus once, then encode each query and search fast. |
| **E5 prompt format** | For E5 models, `passage:` for docs and `query:` for queries is critical for quality. |
| **Multilingual retrieval** | `intfloat/multilingual-e5-small` enables cross-lingual search without a heavy model. |

### Notes for practice

- If your corpus is English-only and you need speed, `intfloat/e5-small` is a strong default.
- If users query in multiple languages, switch to `intfloat/multilingual-e5-small`.
- For large production corpora, move from `IndexFlatIP` to approximate indices (`IVF`, `HNSW`) and add reranking.

In **NB07**, we add reranking on top of retrieval for better precision.
