# Module 14: Retrieval

**Goal:** Build a retrieval system that finds relevant documents using semantic search.

**Prerequisites:** Module 12 (Embeddings)

**Expected Runtime:** ~25 minutes

**Outputs:**
- Keyword vs semantic search comparison
- Retrieval evaluation metrics
- Hybrid search implementation

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
plt.rcParams['figure.figsize'] = (12, 5)

## Part 1: Create Document Corpus

Sample support articles for retrieval.

In [None]:
# Sample support knowledge base
documents = [
    {"id": 1, "title": "Refund Policy", 
     "content": "We offer full refunds within 30 days of purchase. To request a refund, go to Order History and select the item."},
    {"id": 2, "title": "How to Cancel Subscription", 
     "content": "Cancel your subscription from Account Settings. Click Manage Subscription then Cancel. You'll retain access until the end of your billing period."},
    {"id": 3, "title": "Password Reset Guide", 
     "content": "Reset your password by clicking Forgot Password on the login page. We'll send a reset link to your email."},
    {"id": 4, "title": "Change Login Credentials", 
     "content": "Update your email or password in Account Settings. For security, you'll need to confirm your current password."},
    {"id": 5, "title": "Shipping and Delivery", 
     "content": "Standard shipping takes 5-7 business days. Express shipping delivers in 2-3 days. Track your order in Order History."},
    {"id": 6, "title": "Return an Item", 
     "content": "Start a return from Order History. Select the item, choose Return, and print the prepaid shipping label. Money back within 5-10 days."},
    {"id": 7, "title": "Two-Factor Authentication", 
     "content": "Enable 2FA in Security Settings for extra protection. Use an authenticator app or SMS verification."},
    {"id": 8, "title": "Payment Methods", 
     "content": "We accept credit cards, debit cards, and PayPal. Add or update payment methods in Billing Settings."},
    {"id": 9, "title": "Contact Support", 
     "content": "Reach our team via live chat, email, or phone. Business hours: Monday-Friday 9AM-6PM EST."},
    {"id": 10, "title": "Membership Benefits", 
     "content": "Premium members get free shipping, exclusive discounts, and early access to sales. Upgrade in Subscription Settings."},
]

# Create test queries with ground truth
test_queries = [
    {"query": "how to get my money back", "relevant": [1, 6]},
    {"query": "reset password", "relevant": [3, 4]},
    {"query": "end my subscription", "relevant": [2]},
    {"query": "shipping time", "relevant": [5]},
    {"query": "secure my account", "relevant": [7, 4]},
]

df = pd.DataFrame(documents)
df['text'] = df['title'] + ' ' + df['content']

print(f"Corpus: {len(df)} documents")
df[['id', 'title']].head(10)

## Part 2: Keyword Search (BM25-style)

Simple TF-IDF based keyword matching.

In [None]:
# Build TF-IDF index for keyword search
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
tfidf_matrix = tfidf.fit_transform(df['text'])

def keyword_search(query, k=5):
    """Search using TF-IDF similarity (proxy for BM25)."""
    query_vec = tfidf.transform([query])
    scores = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_k_idx = scores.argsort()[::-1][:k]
    return [(df.iloc[i]['id'], scores[i]) for i in top_k_idx]

# Test keyword search
query = "reset password"
results = keyword_search(query, k=5)

print(f"Query: '{query}'\n")
print("Keyword Search Results:")
for doc_id, score in results:
    title = df[df['id'] == doc_id]['title'].values[0]
    print(f"  [{score:.3f}] {doc_id}. {title}")

## Part 3: Semantic Search (Embeddings)

Using TF-IDF as simple embeddings (in production, use sentence-transformers).

In [None]:
# Semantic embeddings - two options:

# OPTION 1: Production-quality with sentence-transformers
# Install: pip install sentence-transformers
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2')  # ~80MB, 384 dims
# semantic_embeddings = model.encode(df['text'].tolist())

# OPTION 2: Demo-friendly with TF-IDF + SVD (no extra install)
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=50, random_state=42)
semantic_embeddings = svd.fit_transform(tfidf_matrix)

print("Using TF-IDF + SVD for semantic embeddings (demo)")
print("For production, install sentence-transformers:")
print("  pip install sentence-transformers")
print("  model = SentenceTransformer('all-MiniLM-L6-v2')")
print("  embeddings = model.encode(texts)")

def semantic_search(query, k=5):
    """Search using semantic embeddings."""
    query_vec = tfidf.transform([query])
    query_emb = svd.transform(query_vec)
    scores = cosine_similarity(query_emb, semantic_embeddings).flatten()
    top_k_idx = scores.argsort()[::-1][:k]
    return [(df.iloc[i]['id'], scores[i]) for i in top_k_idx]

# Test semantic search
query = "how to get my money back"  # Uses "money back" instead of "refund"
print(f"Query: '{query}'\n")

print("Keyword Search:")
for doc_id, score in keyword_search(query, k=3):
    title = df[df['id'] == doc_id]['title'].values[0]
    print(f"  [{score:.3f}] {doc_id}. {title}")

print("\nSemantic Search:")
for doc_id, score in semantic_search(query, k=3):
    title = df[df['id'] == doc_id]['title'].values[0]
    print(f"  [{score:.3f}] {doc_id}. {title}")

print("\nðŸ’¡ Semantic search finds 'Refund Policy' even without the word 'refund' in the query!")

## Part 4: Hybrid Search

Combine keyword and semantic for best of both.

In [None]:
def hybrid_search(query, k=5, alpha=0.5):
    """Combine keyword and semantic search.
    
    alpha: weight for keyword (1-alpha for semantic)
    """
    # Get keyword scores
    query_vec = tfidf.transform([query])
    keyword_scores = cosine_similarity(query_vec, tfidf_matrix).flatten()
    
    # Get semantic scores
    query_emb = svd.transform(query_vec)
    semantic_scores = cosine_similarity(query_emb, semantic_embeddings).flatten()
    
    # Normalize to [0, 1]
    if keyword_scores.max() > 0:
        keyword_scores = keyword_scores / keyword_scores.max()
    if semantic_scores.max() > 0:
        semantic_scores = semantic_scores / semantic_scores.max()
    
    # Combine
    combined = alpha * keyword_scores + (1 - alpha) * semantic_scores
    top_k_idx = combined.argsort()[::-1][:k]
    
    return [(df.iloc[i]['id'], combined[i], keyword_scores[i], semantic_scores[i]) 
            for i in top_k_idx]

# Compare all methods
query = "end my membership"  # Neither "cancel" nor "subscription" appear

print(f"Query: '{query}'\n")
print("="*60)

for method, results in [
    ("Keyword", keyword_search(query, k=3)),
    ("Semantic", semantic_search(query, k=3)),
    ("Hybrid", [(r[0], r[1]) for r in hybrid_search(query, k=3, alpha=0.3)])
]:
    print(f"\n{method} Search:")
    for doc_id, score in results:
        title = df[df['id'] == doc_id]['title'].values[0]
        print(f"  [{score:.3f}] {doc_id}. {title}")

## Part 5: Evaluation Metrics

Measure retrieval quality with Precision, Recall, and MRR.

In [None]:
def precision_at_k(retrieved, relevant, k):
    """What fraction of top K are relevant?"""
    top_k = [r[0] for r in retrieved[:k]]
    hits = len(set(top_k) & set(relevant))
    return hits / k

def recall_at_k(retrieved, relevant, k):
    """What fraction of relevant are in top K?"""
    top_k = [r[0] for r in retrieved[:k]]
    hits = len(set(top_k) & set(relevant))
    return hits / len(relevant) if relevant else 0

def mrr(retrieved, relevant):
    """Mean Reciprocal Rank - where is the first relevant result?"""
    for i, (doc_id, _) in enumerate(retrieved):
        if doc_id in relevant:
            return 1.0 / (i + 1)
    return 0

# Evaluate all methods on test queries
results_table = []

for test in test_queries:
    query = test['query']
    relevant = test['relevant']
    k = 5
    
    for method_name, search_fn in [
        ('Keyword', keyword_search),
        ('Semantic', semantic_search),
        ('Hybrid', lambda q, k: [(r[0], r[1]) for r in hybrid_search(q, k, alpha=0.3)])
    ]:
        retrieved = search_fn(query, k)
        
        results_table.append({
            'Query': query[:25] + '...' if len(query) > 25 else query,
            'Method': method_name,
            'P@5': precision_at_k(retrieved, relevant, k),
            'R@5': recall_at_k(retrieved, relevant, k),
            'MRR': mrr(retrieved, relevant)
        })

results_df = pd.DataFrame(results_table)
print("=== Evaluation Results ===")
print(results_df.to_string(index=False))

In [None]:
# Aggregate by method
summary = results_df.groupby('Method')[['P@5', 'R@5', 'MRR']].mean().round(3)
print("\n=== Average Metrics by Method ===")
print(summary)

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))

x = np.arange(3)
width = 0.25

methods = ['Keyword', 'Semantic', 'Hybrid']
colors = ['#ef4444', '#22c55e', '#8b5cf6']

for i, method in enumerate(methods):
    values = summary.loc[method].values
    ax.bar(x + i * width, values, width, label=method, color=colors[i])

ax.set_ylabel('Score')
ax.set_title('Retrieval Method Comparison')
ax.set_xticks(x + width)
ax.set_xticklabels(['Precision@5', 'Recall@5', 'MRR'])
ax.legend()
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## Part 6: Chunking for RAG

How to split documents for retrieval-augmented generation.

In [None]:
# Example long document
long_doc = """
# Complete Guide to Account Management

## Password and Security
Your account security is our top priority. We recommend using a strong password 
with at least 12 characters, including uppercase, lowercase, numbers, and symbols.
Enable two-factor authentication for additional protection.

## Subscription Management
You can upgrade, downgrade, or cancel your subscription at any time from the 
Account Settings page. Premium members enjoy benefits like free shipping and 
exclusive discounts. Changes take effect at the start of your next billing cycle.

## Billing and Payments
We accept all major credit cards, debit cards, and PayPal. View your billing 
history and update payment methods in Billing Settings. For refunds, please 
contact support within 30 days of purchase.

## Contact Us
Our support team is available Monday through Friday, 9 AM to 6 PM EST. 
Reach us via live chat, email at support@example.com, or call 1-800-EXAMPLE.
"""

def chunk_by_paragraph(text, overlap_sentences=1):
    """Split text into chunks by paragraph with overlap."""
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    chunks = []
    
    for i, para in enumerate(paragraphs):
        chunk = para
        
        # Add overlap from previous paragraph
        if i > 0 and overlap_sentences > 0:
            prev_sentences = paragraphs[i-1].split('. ')[-overlap_sentences:]
            chunk = '. '.join(prev_sentences) + '... ' + chunk
        
        chunks.append({
            'chunk_id': i,
            'text': chunk,
            'length': len(chunk.split())
        })
    
    return chunks

chunks = chunk_by_paragraph(long_doc)

print(f"Document split into {len(chunks)} chunks:\n")
for chunk in chunks:
    print(f"Chunk {chunk['chunk_id']} ({chunk['length']} words):")
    print(f"  {chunk['text'][:80]}...")
    print()

## Part 7: TODO - Implement Reranking

Add a second-stage reranker for better precision.

In [None]:
# Implement a simple reranking function
# In production, use a cross-encoder model like sentence-transformers/ms-marco-MiniLM-L-6-v2

def rerank(query, candidates, top_k=3):
    """
    Rerank candidates based on query-document relevance.
    
    This simple implementation uses word overlap scoring.
    For production: use a cross-encoder that takes (query, doc) pairs.
    
    Args:
        query: search query
        candidates: list of (doc_id, initial_score) from retrieval
        top_k: number of results to return
    
    Returns:
        reranked list of (doc_id, score)
    """
    query_words = set(query.lower().split())
    scored = []
    
    for doc_id, initial_score in candidates:
        # Get document text
        doc_text = df[df['id'] == doc_id]['text'].values[0].lower()
        doc_words = set(doc_text.split())
        
        # Compute word overlap as simple relevance boost
        overlap = len(query_words & doc_words)
        
        # Combine initial score with overlap boost
        rerank_score = initial_score * 0.7 + (overlap / max(len(query_words), 1)) * 0.3
        scored.append((doc_id, rerank_score))
    
    # Sort by new score
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:top_k]

# Test reranking
query = "how to secure my account"
initial_results = semantic_search(query, k=10)
reranked = rerank(query, initial_results, top_k=3)

print(f"Query: '{query}'")
print("\nInitial Top 5:")
for doc_id, score in initial_results[:5]:
    print(f"  [{score:.3f}] {df[df['id']==doc_id]['title'].values[0]}")

print("\nAfter Reranking:")
for doc_id, score in reranked:
    print(f"  [{score:.3f}] {df[df['id']==doc_id]['title'].values[0]}")

## Self-Check

Run the cell below to verify your retrieval system and evaluation metrics are correct.

## Self-Check

Uncomment and run the asserts below to verify your retrieval system works correctly.

In [None]:
# SELF-CHECK: Verify your retrieval system
results = semantic_search("password reset", k=3)
assert len(results) == 3, "semantic_search should return k results"
assert all(0 <= score <= 1 for _, score in results), "Similarity scores should be in [0, 1]"
reranked_test = rerank("password reset", results, top_k=3)
assert len(reranked_test) == 3, "rerank should return top_k results"
print(f"âœ… Self-check passed! Retrieved {len(results)} results, reranked to {len(reranked_test)}")

## Part 8: Stakeholder Summary

### TODO: Write a 3-bullet summary (~100 words) for the PM

Template:
â€¢ **Keyword vs Semantic:** Keyword search matches exact words; semantic search understands meaning (e.g., "money back" finds "refund").
â€¢ **Quality metrics:** We measure Precision@K (accuracy of top results), Recall@K (coverage), and MRR (how fast we find relevant docs).
â€¢ **Recommendation:** Use hybrid search (alpha=____) to combine keyword precision with semantic understanding.

### Your Summary:

*Write your explanation here...*

---

## Key Takeaways

1. **Keyword search** is fast and precise for exact matches
2. **Semantic search** understands meaning and synonyms
3. **Hybrid search** combines both for best results
4. **Evaluation metrics:** Precision@K, Recall@K, MRR
5. **Chunking matters** for RAG quality

### Next Steps
- Explore the interactive playground
- Complete the quiz
- Try sentence-transformers for production-quality semantic search