Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course
## Tutorial 10 — Conversational Search: The Basics

**Author:** Jan Scholtes

**Edition 2025-2026**

Department of Advanced Computer Sciences — Maastricht University

Welcome to Tutorial 10 on **Conversational Search: The Basics**. This is the first of three tutorials on Conversational Search:

1. **The Basics** (this tutorial) — dialogue structure, query understanding, hybrid search, re-ranking, evaluation metrics
2. **RAG — Retrieval-Augmented Generation** (Tutorial 11) — fusion methods, hallucination detection
3. **Agentic Approaches** (Tutorial 12) — agents, memory, tools, multi-agent orchestration

In this tutorial we explore how conversational AI transforms traditional search from keyword matching into intent-driven, multi-turn dialogue systems. The topics covered are:

1. **Properties of Human Conversations** — turns, grounding, adjacency pairs, mixed initiative, sub-dialogues.
2. **From Keywords to Conversations** — the evolution from keyword-based → intent-based → conversational search.
3. **Query Understanding & Reformulation** — NER, query expansion, coreference resolution, context tracking.
4. **BM25: The Classic Retrieval Model** — term frequency, inverse document frequency, document-length normalization.
5. **Dense Retrieval with Sentence Transformers** — encoding queries and documents as embeddings.
6. **Hybrid Search: BM25 + Dense Retrieval** — combining lexical and semantic search with score fusion.
7. **Neural Re-Ranking** — using cross-encoders to re-score retrieved documents.
8. **Evaluation Metrics for Conversational AI** — semantic similarity, answer relevancy, context relevancy, faithfulness, RAGAS.
9. **Commercial Systems** — how Bing, Google, and Perplexity implement conversational search.

At the end you will find the **Exercises** section with graded assignments.

> **Note:** This course is about Information Retrieval, Text Mining, and Conversational Search — not about programming skills. The code cells below show you *how* these methods work in practice using Python libraries. Focus on understanding the **concepts** and **results**.

## Library Installation

We install all required packages in a single cell. Run this cell once at the beginning of your session.

In [None]:
# Install required packages
import subprocess, sys

packages = [
    "sentence-transformers",
    "rank_bm25",
    "spacy",
    "faiss-cpu",
    "scikit-learn",
]
for pkg in packages:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

# Download spaCy English model (if not already installed)
subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm", "-q"])

print("All packages installed successfully.")

## Library Imports

All imports are grouped here so the notebook is easy to set up and run.

In [None]:
# Core Python
import warnings
warnings.filterwarnings("ignore")

import numpy as np
from collections import Counter

# NLP
import spacy

# BM25
from rank_bm25 import BM25Okapi

# Sentence Transformers (dense retrieval)
from sentence_transformers import SentenceTransformer, CrossEncoder

# FAISS for vector search
import faiss

# Scikit-learn utilities
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

print("All libraries imported successfully.")

## Sample Corpus

We create a small corpus of text passages about Information Retrieval topics. This corpus will be used throughout the tutorial to demonstrate search techniques. In a real system, these would be passages retrieved from a large document collection.

In [None]:
# Sample corpus: short passages about IR topics
corpus = [
    "BM25 is a bag-of-words retrieval function that ranks documents based on term frequency and inverse document frequency. It is widely used in search engines like Elasticsearch and Apache Solr.",
    "Dense retrieval uses neural network embeddings to encode queries and documents into a shared vector space. Models like DPR and ColBERT achieve state-of-the-art performance on passage retrieval benchmarks.",
    "TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents.",
    "PageRank is an algorithm used by Google Search to rank web pages. It works by counting the number and quality of links to a page to determine a rough estimate of the website's importance.",
    "BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google. It uses masked language modeling and next sentence prediction for pre-training.",
    "Retrieval-Augmented Generation (RAG) combines a retrieval system with a language model. The retriever finds relevant documents, and the generator produces an answer grounded in those documents.",
    "Conversational search systems maintain context across multiple turns of dialogue. They use coreference resolution and query reformulation to understand follow-up questions.",
    "The RAGAS framework evaluates RAG pipelines using metrics like faithfulness, answer relevancy, and context relevancy. It helps detect hallucinations in generated answers.",
    "ColBERT uses late interaction between query and document token embeddings for efficient retrieval. Each query token attends to all document tokens via MaxSim operations.",
    "Query expansion improves recall by adding synonyms or related terms to the original query. Techniques include using word embeddings (Word2Vec, GloVe) and knowledge graphs.",
    "Hybrid search combines BM25 keyword matching with dense vector retrieval. Score fusion merges both ranking signals for improved precision and recall.",
    "Cross-encoder re-ranking models take a query-document pair as input and output a relevance score. They are more accurate than bi-encoders but too slow for first-stage retrieval.",
    "Grounding in conversational AI means establishing common understanding between the user and the system. The system confirms what it understood before proceeding.",
    "Mixed-initiative dialogue allows both the user and the system to take the lead in a conversation. The system can ask clarifying questions when the query is ambiguous.",
    "Perplexity AI is a conversational search engine that combines real-time web retrieval with LLM-based answer synthesis. Every claim is linked to a verifiable source.",
]

# Labels for reference
corpus_labels = [
    "BM25", "Dense Retrieval", "TF-IDF", "PageRank", "BERT",
    "RAG", "Conversational Search", "RAGAS", "ColBERT", "Query Expansion",
    "Hybrid Search", "Cross-Encoder Re-ranking", "Grounding", "Mixed Initiative", "Perplexity AI",
]

print(f"Corpus loaded: {len(corpus)} passages")
for i, (label, text) in enumerate(zip(corpus_labels, corpus)):
    print(f"  [{i:2d}] {label}: {text[:80]}...")

---
# 1. Properties of Human Conversations

Before building conversational search systems, we must understand what makes human conversations work. Natural dialogue is remarkably complex:

## Key Properties

| Property | Description | Example |
|---|---|---|
| **Turns** | Each contribution is a "turn"; participants alternate | User asks → System answers → User follows up |
| **Grounding** | Establishing common understanding; confirming what was heard | System: "Okay, searching for flights to Paris..." |
| **Adjacency Pairs** | Local structure between speech acts (Sacks et al., 1974) | Question → Answer, Proposal → Accept/Reject |
| **Sub-dialogues** | Embedded corrections or clarifications within the main dialogue | User: "No, I meant *New York*, not *New Haven*" |
| **Mixed Initiative** | Both parties can lead the conversation (Walker & Whittaker, 1990) | System: "Did you mean the city or the state?" |
| **Knowledge Inference** | Inferring implicit constraints from context | "meeting on Monday" → implies travel on Sunday |

## Why This Matters for Search

Traditional search is **single-turn**: the user types a query, gets results, end of interaction. Conversational search must handle:

- **Multi-turn context** — remembering what was discussed earlier
- **Ambiguity resolution** — asking clarifying questions
- **Coreference** — understanding that "it", "they", "those" refer to earlier entities
- **Conversational memory** — maintaining both short-term and long-term context

## The Grounding Problem

Grounding (Clark, 1996) is the **Principle of Closure**: agents need evidence they have succeeded in performing an action. Good conversational systems always acknowledge what they understood:

| | System Response |
|---|---|
| **Bad** ❌ | "Here are your results." (no acknowledgment) |
| **Good** ✅ | "Okay, searching for Italian restaurants in New York near you..." |

Let's demonstrate some of these properties with NLP tools:

In [None]:
# Demonstration: NER and query understanding with spaCy

# Simulated multi-turn conversation
turns = [
    "Find me Italian restaurants in New York.",
    "Which ones have outdoor seating?",       # Coreference: "ones" → Italian restaurants in NY
    "What about their prices?",               # Coreference: "their" → the ones with outdoor seating
    "Book the cheapest one for Friday.",       # Coreference: "one" → cheapest restaurant
]

print("=== Multi-Turn Conversation Analysis ===\n")
for i, turn in enumerate(turns, 1):
    doc = nlp(turn)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    
    # Detect potential coreference markers
    pronouns = [token.text for token in doc if token.pos_ == "PRON" or token.text.lower() in ("ones", "one")]
    
    print(f"Turn {i}: \"{turn}\"")
    print(f"  Entities: {entities if entities else 'None detected'}")
    print(f"  Pronouns/References: {pronouns if pronouns else 'None'}")
    if pronouns:
        print(f"  ⚠ Coreference resolution needed — '{', '.join(pronouns)}' refers to context from earlier turns")
    print()

print("--- Key Insight ---")
print("Turns 2-4 contain references ('ones', 'their', 'one') that only make sense")
print("when resolved against the context from Turn 1. A single-turn search engine")
print("would fail completely on these follow-up queries.")

**Observation:** Turns 2–4 contain pronouns and references that require **coreference resolution** — linking "ones", "their", and "one" back to the Italian restaurants from Turn 1. Without this, a search engine would not understand the follow-up queries.

This demonstrates why conversational search needs more than just keyword matching: it needs **dialogue management**, **context tracking**, and **grounding**.

---
# 2. From Keyword-Based to Conversational Search

The evolution of search can be understood as three stages:

## Stage 1: Keyword-Based Search (Traditional)

- **Mechanism:** Matching exact words/phrases using TF-IDF, BM25, inverted index
- **Ranking:** Term frequency, document frequency, PageRank
- **Limitation:** No understanding of intent — "apple" could mean the fruit or the company
- **Challenges:** Polysemy, synonyms, no context retention

## Stage 2: Intent-Based Search

- **Mechanism:** Identifying the *underlying intent* behind a query
- **Technologies:** NER, dependency parsing, query expansion (Word2Vec, GloVe, BERT), knowledge graphs
- **Advantage:** Better synonym/paraphrase handling, semantic understanding
- **Example:** Google's BERT update (2019) — understanding queries like "can you get medicine for someone pharmacy" (the word "for" changes meaning)

## Stage 3: Conversational Search

- **Mechanism:** Multi-turn dialogue with natural query refinement
- **Technologies:** LLM integration, RAG, conversational memory, coreference resolution
- **Key difference:** Synthesizes answers instead of returning links
- **Challenges:** Hallucination, bias, scalability, privacy

| Feature | Keyword Search | Intent Search | Conversational Search |
|---|---|---|---|
| **Query format** | Keywords, Boolean | Natural language | Multi-turn dialogue |
| **Understanding** | Lexical matching | Semantic intent | Full context + history |
| **Response** | Ranked links | Relevant results + snippets | Synthesized answer |
| **Memory** | None | Session-limited | Multi-turn + long-term |
| **Proactivity** | None | Limited suggestions | Clarifying questions |

Let's now implement each search strategy to see the differences in practice.

---
# 3. Query Understanding & Reformulation

Before a search engine retrieves documents, it must **understand** the query. In conversational search, this involves several steps:

## A. NER & POS Tagging
Identify entities (people, places, organizations) and parts of speech to understand *what* the user is asking about.

## B. Query Clarification
Detect ambiguous queries and ask follow-up questions to clarify intent.

## C. Query Expansion
Improve recall by adding semantically related terms:
- **Word Embeddings** (Word2Vec, GloVe, FastText) — find similar words in vector space
- **Thesauri** — lexical expansion with synonyms
- **Knowledge Graphs** (DBpedia, Google Knowledge Graph) — e.g., "heart attack" → "myocardial infarction"
- **Neural Reformulation** — dense retrieval models (DPR, ColBERT)

## D. Context Tracking in Multi-Turn Conversations
- **Coreference resolution** — "Italian ones" → "Italian restaurants in New York"
- **Session-based search** — maintaining state across turns
- **Conversational memory** — short-term (context window) + long-term (preference vectors in FAISS)

Let's demonstrate query expansion using sentence embeddings:

In [None]:
# Query Expansion using Sentence Embeddings
# We find terms semantically related to the query using cosine similarity

# Load a sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Original query
query = "heart attack symptoms"

# Candidate expansion terms (in practice, these would come from a thesaurus or knowledge graph)
expansion_candidates = [
    "myocardial infarction",      # Medical synonym
    "chest pain",                  # Symptom
    "shortness of breath",         # Symptom
    "cardiovascular disease",      # Related broader term
    "aspirin treatment",           # Related treatment
    "broken bone",                 # Unrelated
    "software engineering",        # Completely unrelated
    "cardiac arrest",              # Related but different
    "stroke symptoms",             # Related condition
    "high blood pressure",         # Risk factor
]

# Encode query and candidates
query_emb = model.encode([query])
candidate_embs = model.encode(expansion_candidates)

# Compute cosine similarities
similarities = cosine_similarity(query_emb, candidate_embs)[0]

# Rank by similarity
ranked = sorted(zip(expansion_candidates, similarities), key=lambda x: x[1], reverse=True)

print(f"Query: \"{query}\"\n")
print(f"{'Expansion Term':<30} {'Similarity':>10}")
print("-" * 42)
for term, sim in ranked:
    marker = " ✓ expand" if sim > 0.4 else ""
    print(f"{term:<30} {sim:>10.4f}{marker}")

print(f"\n--- Key Insight ---")
print(f"Terms like 'myocardial infarction' and 'chest pain' score high because")
print(f"they are semantically related. 'software engineering' scores low.")
print(f"A threshold (e.g., > 0.4) selects useful expansion terms.")

---
# 4. BM25: The Classic Retrieval Model

**BM25** (Best Matching 25) is the most widely used retrieval function in production search engines (Elasticsearch, Apache Solr). It ranks documents based on:

$$\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$

Where:
- $f(q_i, D)$ = frequency of term $q_i$ in document $D$
- $|D|$ = document length
- $\text{avgdl}$ = average document length in the collection
- $k_1$ = term frequency saturation parameter (typically 1.2–2.0)
- $b$ = document length normalization parameter (typically 0.75)
- $\text{IDF}(q_i) = \log\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}$ where $N$ = total documents, $n(q_i)$ = documents containing $q_i$

### Strengths
- Fast and efficient (inverted index lookup)
- Well-understood and reliable
- No training required

### Weaknesses
- Pure lexical matching — can't handle synonyms
- Bag-of-words — ignores word order and semantics
- "apple fruit" won't match "pear banana orange"

Let's implement BM25 search on our corpus:

In [None]:
# BM25 Search
# Tokenize corpus (simple whitespace + lowercase)
tokenized_corpus = [doc.lower().split() for doc in corpus]

# Build BM25 index
bm25 = BM25Okapi(tokenized_corpus)

# Search queries
queries = [
    "How does BM25 ranking work?",
    "neural network embeddings for search",
    "what is retrieval augmented generation",
    "conversational dialogue systems",        # Tests synonym handling
]

print("=== BM25 Search Results ===\n")
for query in queries:
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)
    
    # Get top 3
    top_indices = np.argsort(scores)[::-1][:3]
    
    print(f"Query: \"{query}\"")
    for rank, idx in enumerate(top_indices, 1):
        print(f"  #{rank} [{corpus_labels[idx]}] score={scores[idx]:.3f}")
        print(f"       {corpus[idx][:100]}...")
    print()

**Observation:** BM25 works well when query terms directly match document terms (e.g., "BM25" or "retrieval augmented generation"). But notice that "conversational dialogue systems" may not find the best matches because BM25 relies on exact word overlap — it can't understand that "dialogue" and "conversation" are related.

---
# 5. Dense Retrieval with Sentence Transformers

**Dense retrieval** encodes queries and documents into a shared vector space using neural networks. Instead of matching keywords, it measures **semantic similarity** between embeddings.

### How It Works

1. **Encode** all documents into dense vectors using a pre-trained model (e.g., all-MiniLM-L6-v2)
2. **Encode** the query into the same vector space
3. **Compute** cosine similarity between the query vector and all document vectors
4. **Rank** by similarity score

### Models

| Model | Type | Speed | Accuracy |
|---|---|---|---|
| **DPR** (Dense Passage Retrieval) | Bi-encoder | Fast | Good |
| **ANCE** | Bi-encoder with hard negatives | Fast | Better |
| **ColBERT** | Late interaction (token-level) | Medium | Very good |
| **Cross-Encoder** | Full attention on pair | Slow | Best |

### Bi-Encoder vs Cross-Encoder

- **Bi-encoder:** Encodes query and document *separately* → fast (can pre-compute doc embeddings), but less accurate
- **Cross-encoder:** Encodes query-document *pair together* → more accurate, but too slow for first-stage retrieval (used for re-ranking)

Let's compare dense retrieval to BM25 on the same queries:

In [None]:
# Dense Retrieval with Sentence Transformers (Bi-Encoder)

# Encode all corpus documents (typically done offline)
corpus_embeddings = model.encode(corpus, convert_to_numpy=True, show_progress_bar=False)

print(f"Corpus encoded: {corpus_embeddings.shape}  (documents × embedding dimensions)")

# Search same queries as BM25 for comparison
print("\n=== Dense Retrieval Results ===\n")
for query in queries:
    query_embedding = model.encode([query], convert_to_numpy=True)
    
    # Cosine similarity
    similarities = cosine_similarity(query_embedding, corpus_embeddings)[0]
    
    # Get top 3
    top_indices = np.argsort(similarities)[::-1][:3]
    
    print(f"Query: \"{query}\"")
    for rank, idx in enumerate(top_indices, 1):
        print(f"  #{rank} [{corpus_labels[idx]}] score={similarities[idx]:.4f}")
        print(f"       {corpus[idx][:100]}...")
    print()

**Observation:** Dense retrieval handles semantically related queries much better than BM25. For example, "conversational dialogue systems" now correctly retrieves passages about conversational search, even when the exact words don't match. The model understands that "dialogue" and "conversation" are semantically related.

However, dense retrieval can sometimes miss results where specific technical terms matter (where BM25 excels). This is why **hybrid search** combines both.

---
# 6. Hybrid Search: BM25 + Dense Retrieval

**Hybrid search** combines the strengths of both approaches:

| Approach | Strength | Weakness |
|---|---|---|
| **BM25** | Exact keyword matching, fast | Misses synonyms and paraphrases |
| **Dense Retrieval** | Semantic understanding | May miss exact technical terms |
| **Hybrid** | Best of both | Slightly more complex |

### Score Fusion

The key challenge is combining scores from two different systems. Common approaches:

1. **Linear combination:** $\text{score}_{\text{hybrid}} = \alpha \cdot \text{score}_{\text{BM25}} + (1 - \alpha) \cdot \text{score}_{\text{dense}}$
2. **Reciprocal Rank Fusion (RRF):** $\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}$ where $k$ is a constant (typically 60)

Both scores need to be **normalized** to the same scale before fusion.

This is the approach used by modern search systems like **Bing**, **Google AI Overview**, and **Perplexity AI**.

In [None]:
# Hybrid Search: BM25 + Dense Retrieval with Score Fusion

def hybrid_search(query, corpus, bm25, model, corpus_embeddings, alpha=0.5, top_k=5):
    """
    Combine BM25 and dense retrieval scores using linear interpolation.
    
    Parameters:
        alpha: weight for BM25 (1-alpha for dense). 0.5 = equal weight.
    """
    # BM25 scores
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    
    # Dense retrieval scores (cosine similarity)
    query_embedding = model.encode([query], convert_to_numpy=True)
    dense_scores = cosine_similarity(query_embedding, corpus_embeddings)[0]
    
    # Normalize both to [0, 1] range
    scaler = MinMaxScaler()
    bm25_norm = scaler.fit_transform(bm25_scores.reshape(-1, 1)).flatten()
    dense_norm = scaler.fit_transform(dense_scores.reshape(-1, 1)).flatten()
    
    # Linear fusion
    hybrid_scores = alpha * bm25_norm + (1 - alpha) * dense_norm
    
    # Get top-k
    top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
    
    return [(idx, hybrid_scores[idx], bm25_norm[idx], dense_norm[idx]) for idx in top_indices]

# Compare all three approaches
test_queries = [
    "How does BM25 ranking work?",
    "neural embeddings for document search",
    "conversational dialogue with context",
    "what is ColBERT late interaction",
]

for query in test_queries:
    print(f"Query: \"{query}\"\n")
    
    # BM25 only
    bm25_scores = bm25.get_scores(query.lower().split())
    bm25_top = np.argsort(bm25_scores)[::-1][0]
    
    # Dense only
    q_emb = model.encode([query], convert_to_numpy=True)
    dense_scores = cosine_similarity(q_emb, corpus_embeddings)[0]
    dense_top = np.argsort(dense_scores)[::-1][0]
    
    # Hybrid
    hybrid_results = hybrid_search(query, corpus, bm25, model, corpus_embeddings, alpha=0.5)
    hybrid_top = hybrid_results[0][0]
    
    print(f"  BM25 top-1:   [{corpus_labels[bm25_top]}]")
    print(f"  Dense top-1:  [{corpus_labels[dense_top]}]")
    print(f"  Hybrid top-1: [{corpus_labels[hybrid_top]}]")
    
    print(f"\n  Hybrid Top-3:")
    for idx, h_score, b_score, d_score in hybrid_results[:3]:
        print(f"    [{corpus_labels[idx]:25s}] hybrid={h_score:.3f}  (BM25={b_score:.3f}, Dense={d_score:.3f})")
    print()

**Observation:** The hybrid approach leverages both BM25's exact keyword matching and dense retrieval's semantic understanding. When both agree on the top result, we can be more confident. When they disagree, the hybrid score finds a balanced result. This is the foundation of modern search engines like Bing (with its Prometheus model) and Perplexity AI.

---
# 7. Neural Re-Ranking

Even after hybrid search, the initial retrieval may not produce the optimal ranking. **Re-ranking** uses a more powerful (but slower) model to re-score the top results:

### The Two-Stage Pipeline

```
Query → [Stage 1: Fast Retrieval (BM25 + Dense)] → Top-K candidates
      → [Stage 2: Cross-Encoder Re-Ranking] → Final ranked list
```

### Why Re-Ranking?

- **Bi-encoders** (Stage 1) encode query and document *separately* — fast but lose fine-grained interaction
- **Cross-encoders** (Stage 2) process the query-document *pair together* — capturing word-level interactions
- Cross-encoders are too slow for the entire corpus, but perfect for re-scoring a small candidate set

### Re-Ranking in Practice

| System | Re-Ranking Model |
|---|---|
| **Bing** | Prometheus (proprietary) |
| **Google** | DeepRank, LambdaMART |
| **Perplexity** | T5, GPT-4 cross-encoders |

Let's apply cross-encoder re-ranking to our hybrid search results:

In [None]:
# Neural Re-Ranking with a Cross-Encoder
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "How do modern search engines combine keyword and semantic search?"

# Stage 1: Get top-10 candidates from hybrid search
hybrid_candidates = hybrid_search(query, corpus, bm25, model, corpus_embeddings, alpha=0.5, top_k=10)

print(f"Query: \"{query}\"\n")
print("=== Stage 1: Hybrid Search Top-10 ===")
candidate_indices = []
for rank, (idx, h_score, b_score, d_score) in enumerate(hybrid_candidates, 1):
    candidate_indices.append(idx)
    print(f"  #{rank:2d} [{corpus_labels[idx]:25s}] hybrid={h_score:.3f}")

# Stage 2: Re-rank with cross-encoder
print("\n=== Stage 2: Cross-Encoder Re-Ranking ===")
pairs = [(query, corpus[idx]) for idx in candidate_indices]
ce_scores = cross_encoder.predict(pairs)

# Sort by cross-encoder score
reranked = sorted(zip(candidate_indices, ce_scores), key=lambda x: x[1], reverse=True)

print(f"\n{'Rank':<6} {'Topic':<28} {'CE Score':>10} {'Movement':>10}")
print("-" * 56)
for new_rank, (idx, ce_score) in enumerate(reranked, 1):
    old_rank = candidate_indices.index(idx) + 1
    movement = old_rank - new_rank
    arrow = "↑" + str(abs(movement)) if movement > 0 else "↓" + str(abs(movement)) if movement < 0 else "—"
    print(f"  #{new_rank:<4d} [{corpus_labels[idx]:25s}] {ce_score:>9.4f}  {arrow:>8}")

**Observation:** The cross-encoder re-ranks documents by jointly analyzing each query-document pair with full self-attention. This captures fine-grained relevance that bi-encoders miss. Notice how some documents move up or down — the cross-encoder understands *which* passages actually answer the question, not just which ones are topically related.

This two-stage pipeline (fast retrieval → precise re-ranking) is the standard architecture in modern conversational search systems.

---
# 8. Evaluation Metrics for Conversational AI in Search

Evaluating conversational search systems requires metrics beyond traditional precision/recall. We need to measure both **retrieval quality** and **generation quality**.

## Traditional IR Metrics (Recap from earlier lectures)

| Metric | What It Measures |
|---|---|
| **Precision@K** | Fraction of top-K results that are relevant |
| **Recall** | Fraction of all relevant results that are retrieved |
| **MRR** (Mean Reciprocal Rank) | Average of $\frac{1}{\text{rank of first relevant result}}$ |
| **MAP** (Mean Average Precision) | Average precision at each relevant result position |
| **NDCG** (Normalized Discounted Cumulative Gain) | Rank-weighted quality measure |

## New Metrics for Conversational AI / RAG

These metrics are critical for evaluating systems that *generate* answers rather than just returning links:

| Metric | Formula / Description |
|---|---|
| **Semantic Similarity** | $\text{sim}(a, g) = \cos(E_a, E_g)$ — cosine similarity between answer and ground-truth embeddings |
| **Answer Relevancy** | How relevant is the generated answer to the input question? Penalizes incomplete or off-topic answers |
| **Context Relevancy** | $\frac{\|S\|}{\text{Total sentences in context}}$ where $S$ = relevant sentences in the retrieved context |
| **Faithfulness** | $\frac{\text{Claims supported by context}}{\text{Total claims}}$ — does the answer only state things supported by the retrieved documents? |
| **Answer Correctness** | 1–5 scale examining accuracy against reference answer |

## The RAGAS Framework

**RAGAS** (Retrieval-Augmented Generation Assessment) is a framework for evaluating RAG pipelines. It combines all the above metrics and can also generate evaluation datasets. See: [docs.ragas.io](https://docs.ragas.io/)

Let's implement some of these metrics:

In [None]:
# Implementing Evaluation Metrics for Conversational AI

# --- Semantic Similarity ---
def semantic_similarity(answer: str, ground_truth: str, model) -> float:
    """Cosine similarity between answer and ground truth embeddings."""
    embs = model.encode([answer, ground_truth])
    return float(cosine_similarity([embs[0]], [embs[1]])[0][0])

# --- Context Relevancy ---
def context_relevancy(question: str, context_passages: list, model, threshold=0.3) -> float:
    """Fraction of context passages that are relevant to the question."""
    q_emb = model.encode([question])
    c_embs = model.encode(context_passages)
    sims = cosine_similarity(q_emb, c_embs)[0]
    relevant = sum(1 for s in sims if s > threshold)
    return relevant / len(context_passages)

# --- Faithfulness (simplified) ---
def faithfulness_score(answer_claims: list, context: str, model, threshold=0.5) -> float:
    """Fraction of answer claims that can be found in the context."""
    context_emb = model.encode([context])
    claim_embs = model.encode(answer_claims)
    sims = cosine_similarity(claim_embs, context_emb).flatten()
    supported = sum(1 for s in sims if s > threshold)
    return supported / len(answer_claims)

# --- Demonstration ---
question = "What is BM25 and how does it rank documents?"
ground_truth = "BM25 is a retrieval function that ranks documents based on term frequency, inverse document frequency, and document length normalization."
generated_answer = "BM25 is a bag-of-words retrieval function widely used in search engines. It ranks documents using term frequency and inverse document frequency."
hallucinated_answer = "BM25 uses deep learning transformers to understand query semantics and generates answers using GPT-4."

context = corpus[0]  # The BM25 passage
answer_claims_good = [
    "BM25 is a bag-of-words retrieval function",
    "BM25 is widely used in search engines",
    "BM25 ranks documents using term frequency",
]
answer_claims_bad = [
    "BM25 uses deep learning transformers",
    "BM25 understands query semantics",
    "BM25 generates answers using GPT-4",
]

print("=== Evaluation Metrics Demo ===\n")
print(f"Question: \"{question}\"\n")

# Semantic Similarity
sim_good = semantic_similarity(generated_answer, ground_truth, model)
sim_bad = semantic_similarity(hallucinated_answer, ground_truth, model)
print(f"Semantic Similarity:")
print(f"  Good answer:         {sim_good:.4f}")
print(f"  Hallucinated answer: {sim_bad:.4f}")

# Faithfulness
faith_good = faithfulness_score(answer_claims_good, context, model)
faith_bad = faithfulness_score(answer_claims_bad, context, model)
print(f"\nFaithfulness (claims supported by context):")
print(f"  Good answer:         {faith_good:.2f} ({int(faith_good * len(answer_claims_good))}/{len(answer_claims_good)} claims)")
print(f"  Hallucinated answer: {faith_bad:.2f} ({int(faith_bad * len(answer_claims_bad))}/{len(answer_claims_bad)} claims)")

# Context Relevancy
retrieved_context = [corpus[0], corpus[2], corpus[3], corpus[11]]  # BM25, TF-IDF, PageRank, Cross-Encoder
ctx_rel = context_relevancy(question, retrieved_context, model)
print(f"\nContext Relevancy (relevant passages / retrieved):")
print(f"  {ctx_rel:.2f}")
for i, (passage, label) in enumerate(zip(retrieved_context, ["BM25", "TF-IDF", "PageRank", "Cross-Encoder"])):
    q_emb = model.encode([question])
    p_emb = model.encode([passage])
    sim = cosine_similarity(q_emb, p_emb)[0][0]
    relevant = "✓ relevant" if sim > 0.3 else "✗ irrelevant"
    print(f"    [{label}] sim={sim:.3f} — {relevant}")

**Observation:** The evaluation metrics clearly distinguish between good and hallucinated answers:

- **Semantic similarity** scores the factually correct answer much higher than the hallucinated one
- **Faithfulness** catches that the hallucinated claims about "deep learning transformers" and "GPT-4" are NOT supported by the BM25 context passage
- **Context relevancy** identifies which retrieved passages actually help answer the question

These metrics form the foundation of the **RAGAS framework** used to evaluate modern RAG-based conversational search systems.

---
# 9. Commercial Conversational Search Systems

Three major systems illustrate how these techniques combine in production:

## Bing + ChatGPT (Microsoft)

| Stage | Implementation |
|---|---|
| **Preprocessing** | NER, POS tagging, intent classification, query expansion via GPT-4, multi-turn context retention |
| **Retrieval & Ranking** | **Prometheus model**: BM25 + neural retrieval (ColBERT, DPR) + LLM-based answer generation |
| **Post-processing** | Fact verification with citations, confidence scoring, RLHF from user feedback (thumbs up/down) |

## Google AI Overview (Gemini)

| Stage | Implementation |
|---|---|
| **Preprocessing** | Query expansion with T5-based models, Knowledge Graph integration |
| **Retrieval & Ranking** | ColBERT + DPR, Hybrid BM25 + Neural, RAG with live search, ranking Transformers (LambdaMART, DeepRank) |
| **Post-processing** | **AIS** (Attributable to Identified Sources — refuses to answer if unverifiable), citation tracking, real-time RL from engagement |

## Perplexity AI

| Stage | Implementation |
|---|---|
| **Preprocessing** | NER, intent classification (T5, GPT), query expansion (Word2Vec, FastText, Transformers), spell checking |
| **Retrieval & Ranking** | Real-time web search (not static index!), Hybrid BM25 + DPR/ColBERT, cross-encoders (T5, GPT-4), citation-aware ranking |
| **Post-processing** | AIS, TruthfulQA & RAGAS fact-checking, score thresholding (0.85), every claim linked to source, multiple LLMs (Claude, GPT-4, Mistral) + RAG |

## Comparison

| Feature | Bing | Google | Perplexity |
|---|---|---|---|
| **Retrieval** | Prometheus (BM25+Neural) | Hybrid BM25+Neural+KG | Real-time web + Hybrid |
| **LLM** | GPT-4 | Gemini | Multiple (GPT-4, Claude, Mistral) |
| **Fact-checking** | Citations + RLHF | AIS + Fact-Check API | AIS + RAGAS + TruthfulQA |
| **Unique feature** | Deep Windows integration | Knowledge Graph + AIS | Real-time sources, multi-LLM |

**Key takeaway:** All three systems use the same fundamental building blocks we've explored in this tutorial — hybrid search, neural re-ranking, and evaluation metrics — combined with proprietary models and infrastructure.

---
# 10. Building a Simple Conversational Search Pipeline

Let's tie everything together by building a simple conversational search pipeline that demonstrates the key concepts from this tutorial. This pipeline uses FAISS as a vector store and processes multi-turn queries:

In [None]:
# Complete Conversational Search Pipeline

class SimpleConversationalSearch:
    """A simple conversational search engine demonstrating the core concepts."""
    
    def __init__(self, corpus, model_name="all-MiniLM-L6-v2"):
        self.corpus = corpus
        self.model = SentenceTransformer(model_name)
        self.cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
        
        # Build BM25 index
        self.tokenized_corpus = [doc.lower().split() for doc in corpus]
        self.bm25 = BM25Okapi(self.tokenized_corpus)
        
        # Build FAISS index for dense retrieval
        self.corpus_embeddings = self.model.encode(corpus, convert_to_numpy=True)
        dim = self.corpus_embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dim)
        faiss.normalize_L2(self.corpus_embeddings)
        self.index.add(self.corpus_embeddings)
        
        # Conversation history for context tracking
        self.history = []
    
    def search(self, query, top_k=5, alpha=0.5, rerank=True):
        """Execute hybrid search with optional re-ranking."""
        # Track conversation
        self.history.append(query)
        
        # Context-aware query: append recent history for context
        context_query = query
        if len(self.history) > 1:
            context_query = " ".join(self.history[-3:])  # Last 3 turns
        
        # BM25 scores
        bm25_scores = self.bm25.get_scores(context_query.lower().split())
        
        # Dense retrieval scores
        q_emb = self.model.encode([context_query], convert_to_numpy=True)
        faiss.normalize_L2(q_emb)
        dense_scores, dense_ids = self.index.search(q_emb, len(self.corpus))
        
        # Reconstruct dense scores array (FAISS returns sorted)
        dense_score_map = np.zeros(len(self.corpus))
        for i, idx in enumerate(dense_ids[0]):
            dense_score_map[idx] = dense_scores[0][i]
        
        # Normalize and fuse
        scaler = MinMaxScaler()
        bm25_norm = scaler.fit_transform(bm25_scores.reshape(-1, 1)).flatten()
        dense_norm = scaler.fit_transform(dense_score_map.reshape(-1, 1)).flatten()
        hybrid_scores = alpha * bm25_norm + (1 - alpha) * dense_norm
        
        # Get top-k candidates
        top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
        
        if rerank:
            # Cross-encoder re-ranking
            pairs = [(query, self.corpus[idx]) for idx in top_indices]
            ce_scores = self.cross_encoder.predict(pairs)
            reranked = sorted(zip(top_indices, ce_scores), key=lambda x: x[1], reverse=True)
            return [(idx, score) for idx, score in reranked]
        
        return [(idx, hybrid_scores[idx]) for idx in top_indices]

# Build the search engine
search_engine = SimpleConversationalSearch(corpus)

# Simulate a multi-turn conversation
conversation = [
    "What retrieval methods are used in modern search engines?",
    "How do you combine keyword and semantic search?",
    "What about re-ranking the results?",
    "How do you evaluate the quality of the answers?",
]

print("=== Simulated Conversational Search Session ===\n")
for turn_num, query in enumerate(conversation, 1):
    print(f"User (Turn {turn_num}): \"{query}\"")
    results = search_engine.search(query, top_k=3)
    
    print(f"System: Here are the most relevant results:")
    for rank, (idx, score) in enumerate(results, 1):
        print(f"  #{rank} [{corpus_labels[idx]}] (score: {score:.4f})")
        print(f"     {corpus[idx][:120]}...")
    
    # Grounding: confirm understanding
    top_topic = corpus_labels[results[0][0]]
    print(f"  → (Grounding: The system identified '{top_topic}' as most relevant to your question)")
    print()

**Observation:** This pipeline demonstrates all the key concepts from the lecture:

1. **Hybrid search** (BM25 + dense retrieval with score fusion)
2. **Neural re-ranking** (cross-encoder re-scores the top candidates)
3. **Context tracking** (conversation history is appended to improve follow-up queries)
4. **Grounding** (the system acknowledges what it found)

In a production system, this would be extended with:
- Full coreference resolution for pronouns in follow-up turns
- RAG-based answer generation (covered in Tutorial 11)
- Agent-based orchestration with tools and guardrails (covered in Tutorial 12)

---
# Summary

| Concept | What We Learned |
|---|---|
| **Human conversation properties** | Turns, grounding, adjacency pairs, mixed initiative, coreference — search must handle all of these |
| **Search evolution** | Keyword (BM25) → Intent (BERT embeddings) → Conversational (multi-turn + RAG) |
| **Query understanding** | NER, query expansion (embedding similarity), coreference resolution, context tracking |
| **BM25** | Fast, exact keyword matching; foundation of all search engines |
| **Dense retrieval** | Semantic understanding via embeddings; handles synonyms and paraphrases |
| **Hybrid search** | Combines BM25 + dense retrieval with score fusion for best results |
| **Re-ranking** | Cross-encoders re-score top candidates with full query-document attention |
| **Evaluation metrics** | Semantic similarity, faithfulness, context relevancy, answer relevancy — RAGAS framework |
| **Commercial systems** | Bing, Google, Perplexity all use hybrid search + re-ranking + fact-checking |

### What's Next?

- **Tutorial 11:** RAG (Retrieval-Augmented Generation) — fusion methods, hallucination detection with RAGAS
- **Tutorial 12:** Agentic Search — agents with memory, tools, multi-agent orchestration

---
# Exercises

The following exercises are graded. Please provide your answers in the designated cells below.

## Exercise 1 — Hybrid Search vs BM25 (5 points)

Compare **BM25** and **Hybrid Search** (BM25 + Dense Retrieval) as retrieval strategies for a conversational search system. In your answer, address:

1. What are the specific strengths and weaknesses of BM25 alone? Give an example query where BM25 would fail.
2. How does dense retrieval complement BM25's weaknesses? What does score fusion achieve?
3. In what scenario might a pure BM25 approach actually outperform hybrid search?

Write your answer in the cell below (minimum 150 words).

YOUR ANSWER HERE

YOUR ANSWER HERE

## Exercise 2 — Evaluation Metrics for Conversational Search (5 points)

A conversational search system generates the following answer to the question *"What is BM25?"*:

> "BM25 is a deep learning algorithm that uses transformer attention to rank documents. It was developed by Google and is the basis for PageRank."

Using the metrics from Section 8, evaluate this answer. Address:

1. What would the **faithfulness** score be if the context contains the correct BM25 definition? Identify each claim and whether it is supported.
2. What would the **semantic similarity** score likely be compared to the ground truth "BM25 is a bag-of-words retrieval function based on term frequency and inverse document frequency"? Would it be high or low, and why?
3. How would the **RAGAS framework** combine these signals to flag this answer as problematic?

Write your answer in the cell below (minimum 150 words).

YOUR ANSWER HERE

YOUR ANSWER HERE

## Exercise 3 — Implementing Re-Ranking Evaluation (10 points)

Write code that evaluates the impact of **cross-encoder re-ranking** on retrieval quality. Your code should:

1. Use the `corpus` and `queries` defined earlier in this notebook
2. For each query, compute the **top-3 results** using:
   a. BM25 only
   b. Hybrid search (BM25 + Dense, alpha=0.5)
   c. Hybrid search + cross-encoder re-ranking
3. For each method, compute the **Mean Reciprocal Rank (MRR)** against the expected best passage for each query
4. Store the results in three variables: `mrr_bm25`, `mrr_hybrid`, `mrr_reranked` (each a float)

Use these query-answer pairs (the expected best passage index in the corpus):
```python
eval_queries = {
    "How does BM25 work?": 0,                        # BM25 passage
    "neural embeddings for retrieval": 1,             # Dense Retrieval passage
    "combining keyword and semantic search": 10,      # Hybrid Search passage
    "cross-encoder models for search": 11,            # Cross-Encoder Re-ranking passage
    "evaluating RAG pipelines": 7,                    # RAGAS passage
}
```

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE
raise NotImplementedError("Replace this line with your solution")

In [None]:
# Autograder test cell — do not modify
assert 'mrr_bm25' in dir(), "You need to define 'mrr_bm25'"
assert 'mrr_hybrid' in dir(), "You need to define 'mrr_hybrid'"
assert 'mrr_reranked' in dir(), "You need to define 'mrr_reranked'"
assert isinstance(mrr_bm25, float), "mrr_bm25 should be a float"
assert isinstance(mrr_hybrid, float), "mrr_hybrid should be a float"
assert isinstance(mrr_reranked, float), "mrr_reranked should be a float"
assert 0 <= mrr_bm25 <= 1, "mrr_bm25 should be between 0 and 1"
assert 0 <= mrr_hybrid <= 1, "mrr_hybrid should be between 0 and 1"
assert 0 <= mrr_reranked <= 1, "mrr_reranked should be between 0 and 1"
print(f"MRR (BM25 only):             {mrr_bm25:.4f}")
print(f"MRR (Hybrid):                {mrr_hybrid:.4f}")
print(f"MRR (Hybrid + Re-Ranking):   {mrr_reranked:.4f}")
print(f"\nRe-ranking improvement over BM25: {(mrr_reranked - mrr_bm25) / max(mrr_bm25, 0.001) * 100:.1f}%")
print("All auto-graded tests passed!")