# Information Retrieval & Search - Traditional NLP
## Interview Preparation Notebook for Senior Applied AI Scientist (Retail Banking)

---

**Goal**: Demonstrate mastery of traditional search and retrieval techniques (TF-IDF, BM25) that form the foundation for modern RAG systems.

**Interview Signal**: This notebook shows you understand the fundamentals that underpin vector search and RAG - critical for anyone building AI systems that need to retrieve relevant information.

## 1. Business Context (Banking Lens)

### Why Information Retrieval Exists in Retail Banking

Banks accumulate vast knowledge bases that employees and customers need to search effectively.

| Use Case | Content | Users | Search Challenge |
|----------|---------|-------|------------------|
| **Policy Search** | 10,000+ internal policies | All employees | Find relevant policy quickly |
| **FAQ Matching** | Customer service FAQ | Call center agents | Match customer question to answer |
| **Document Discovery** | Contracts, agreements | Legal/compliance | Find all related documents |
| **Knowledge Base** | Product documentation | Customers, advisors | Self-service answers |
| **Regulatory Search** | Compliance regulations | Compliance officers | Find applicable rules |

### The Business Problem

> "Our call center agents spend 3 minutes searching for the right policy to answer customer questions. How do we reduce this to 10 seconds?"

**Without good search**: Agents guess keywords, browse hierarchies, call colleagues  
**With good search**: Type natural question, get ranked relevant policies instantly

### Real Banking Example

**Query**: "What is the daily ATM withdrawal limit for premium checking accounts?"

**Expected Result**: Policy document section on withdrawal limits, ranked by relevance

### This is Pre-RAG Foundation

Modern RAG systems combine:
1. **Retrieval** (this notebook) - Find relevant documents
2. **Generation** (LLM) - Generate answer from retrieved docs

Understanding traditional retrieval is essential for building and debugging RAG systems.

### Interview Framing

```
"Before jumping to vector search and embeddings, I always establish a BM25 baseline. 
It's fast, interpretable, and surprisingly competitive. In many banking use cases, 
BM25 actually outperforms dense retrieval because banking queries often contain 
specific terms like 'Regulation E' or 'wire transfer' that benefit from exact matching."
```

## 2. Problem Definition

### Task Type: Ranking/Retrieval

| Aspect | Description |
|--------|-------------|
| **Input** | Query + Document corpus |
| **Output** | Ranked list of documents |
| **Core Challenge** | Define "relevance" mathematically |
| **Constraint** | Must be fast (sub-second for large corpora) |

### Sparse vs Dense Retrieval

| Approach | Representation | Similarity | Strengths |
|----------|---------------|------------|----------|
| **Sparse (BM25)** | Term frequencies | Lexical overlap | Exact match, fast, interpretable |
| **Dense (Embeddings)** | Dense vectors | Semantic similarity | Handles synonyms, paraphrase |

### Why Traditional Retrieval Before Dense Methods

1. **Exact keyword matching**: "Regulation E" must match "Regulation E"
2. **Interpretable**: Can explain why document was retrieved (shared terms)
3. **Fast at scale**: Inverted indices enable sub-ms retrieval over millions of docs
4. **No training required**: Works out of the box
5. **Strong baseline**: Often hard to beat significantly with dense methods

### Mathematical Foundation: TF-IDF

$$\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)$$

Where:
- $\text{TF}(t, d)$ = Term frequency of term $t$ in document $d$
- $\text{IDF}(t, D) = \log\frac{|D|}{|\{d \in D : t \in d\}|}$ = Inverse document frequency

### BM25 (Best Match 25)

$$\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}$$

Key improvements over TF-IDF:
- Term frequency saturation (diminishing returns)
- Document length normalization

## 3. Dataset

### Dataset: Banking Policy Documents (Simulated)

We'll create a simulated banking knowledge base with policy documents and FAQs.

In [None]:
# Install required packages
# !pip install scikit-learn nltk rank-bm25 numpy pandas matplotlib

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict, Counter
import re
import math
import warnings
warnings.filterwarnings('ignore')

# NLP libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Sklearn for TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Libraries loaded successfully")

In [None]:
# Create banking knowledge base
banking_documents = [
    {
        "id": "POL-001",
        "title": "ATM Withdrawal Limits Policy",
        "content": """Daily ATM withdrawal limits vary by account type. Standard checking accounts have a daily limit of $500. 
        Premium checking accounts have an increased daily limit of $1,000. Private banking clients can withdraw up to $2,500 per day. 
        Limits reset at midnight Eastern Time. Customers may request temporary limit increases by contacting customer service 
        at least 24 hours in advance. International ATM withdrawals are subject to additional daily limits of $300.""",
        "category": "Limits"
    },
    {
        "id": "POL-002",
        "title": "Wire Transfer Procedures",
        "content": """Domestic wire transfers can be initiated online, by phone, or in branch. The cutoff time for same-day 
        processing is 4:00 PM Eastern Time. International wire transfers require additional verification and have a 
        cutoff of 2:00 PM Eastern Time. Wire transfer fees are $25 for domestic and $45 for international transfers. 
        SWIFT code and IBAN are required for international transfers. Maximum wire transfer limit is $100,000 per day 
        without additional approval.""",
        "category": "Transfers"
    },
    {
        "id": "POL-003",
        "title": "Overdraft Protection Policy",
        "content": """Overdraft protection links your checking account to a savings account or line of credit. When your 
        checking balance is insufficient, funds are automatically transferred. There is a $10 transfer fee per 
        overdraft occurrence. Standard overdraft coverage charges $35 per item. Customers can opt out of overdraft 
        coverage for ATM and debit card transactions. Overdraft fees are limited to 4 per day maximum.""",
        "category": "Fees"
    },
    {
        "id": "POL-004",
        "title": "Mobile Check Deposit Policy",
        "content": """Mobile check deposit allows customers to deposit checks using the mobile app. Daily deposit limits 
        are $5,000 for standard accounts and $10,000 for premium accounts. Funds availability is typically next 
        business day for checks under $500 and 2 business days for larger amounts. Checks must be endorsed with 
        'For Mobile Deposit Only' and account number. Eligible checks include personal, business, and government checks.""",
        "category": "Deposits"
    },
    {
        "id": "POL-005",
        "title": "Account Closure Procedures",
        "content": """Customers may close their account at any time by visiting a branch or calling customer service. 
        All pending transactions must clear before closure. Outstanding loans or credit cards linked to the account 
        must be paid off or transferred. Early closure fee of $25 applies if account is closed within 90 days of opening. 
        Remaining balance will be mailed via check within 7-10 business days.""",
        "category": "Account Management"
    },
    {
        "id": "POL-006",
        "title": "Fraud Dispute Resolution",
        "content": """Customers should report suspected fraud immediately by calling 1-800-FRAUD or through online banking. 
        Provisional credit is typically issued within 10 business days for disputed transactions. Investigation may take 
        up to 45 days for domestic transactions and 90 days for international. Under Regulation E, customer liability is 
        limited to $50 if reported within 2 business days. Zero liability protection applies to credit card transactions.""",
        "category": "Fraud"
    },
    {
        "id": "POL-007",
        "title": "Interest Rate Disclosure",
        "content": """Savings account interest rates are variable and subject to change. Current APY for standard savings 
        is 0.50%. High-yield savings accounts earn 4.25% APY for balances over $10,000. Interest is compounded daily 
        and credited monthly. Certificate of Deposit rates are fixed for the term duration. Early withdrawal penalty 
        for CDs is 90 days of interest for terms under 1 year.""",
        "category": "Rates"
    },
    {
        "id": "POL-008",
        "title": "Zelle Payment Limits and Policies",
        "content": """Zelle enables instant person-to-person payments using email or phone number. Daily sending limit 
        is $2,500 and monthly limit is $10,000 for standard accounts. Premium accounts have $5,000 daily limits. 
        Zelle payments cannot be cancelled once sent. Recipient must be enrolled in Zelle to receive funds. 
        Business accounts have separate limits and terms.""",
        "category": "Transfers"
    },
    {
        "id": "POL-009",
        "title": "Safe Deposit Box Policy",
        "content": """Safe deposit boxes are available at select branches. Annual rental fees range from $50 for 
        small boxes to $300 for large boxes. Two keys are provided; replacement keys cost $25 each. Access is 
        available during regular banking hours. Contents are not insured by the bank; customers should obtain 
        separate insurance. Drilling fee of $150 applies if keys are lost.""",
        "category": "Services"
    },
    {
        "id": "POL-010",
        "title": "Foreign Currency Exchange",
        "content": """Foreign currency exchange is available at main branches. Exchange rates are updated daily and 
        include a spread. No commission for amounts under $1,000. Orders over $5,000 require 2 business days notice. 
        Available currencies include EUR, GBP, JPY, CAD, and 40 other currencies. Buyback rates are typically less 
        favorable than sell rates.""",
        "category": "Services"
    },
    {
        "id": "FAQ-001",
        "title": "How do I reset my online banking password?",
        "content": """To reset your online banking password, click 'Forgot Password' on the login page. Enter your 
        username and verify your identity using your registered phone number or email. You will receive a one-time 
        code to create a new password. Passwords must be 8-20 characters with at least one uppercase letter, one 
        number, and one special character.""",
        "category": "FAQ"
    },
    {
        "id": "FAQ-002",
        "title": "What is the routing number for wire transfers?",
        "content": """Our routing number for domestic wire transfers is 021000089. For ACH transfers and direct 
        deposits, use routing number 021000021. The routing number is also printed on the bottom left of your checks. 
        International wire transfers require our SWIFT code: BOFAUS3N. Always verify the routing number before 
        initiating transfers.""",
        "category": "FAQ"
    }
]

# Create DataFrame
docs_df = pd.DataFrame(banking_documents)
print(f"Knowledge base: {len(docs_df)} documents")
print(f"\nCategories: {docs_df['category'].unique()}")

In [None]:
# Sample queries for testing
test_queries = [
    "What is the ATM withdrawal limit for premium accounts?",
    "How do I send a wire transfer?",
    "What are the overdraft fees?",
    "How long does mobile deposit take?",
    "Report fraud on my account",
    "What is the savings account interest rate?",
    "Zelle payment limits",
    "routing number for direct deposit"
]

print("Test queries:")
for i, q in enumerate(test_queries, 1):
    print(f"  {i}. {q}")

## 4. Traditional NLP Pipeline

### 4.1 Text Preprocessing for Retrieval

In [None]:
class SearchPreprocessor:
    """
    Preprocessor for search/retrieval tasks.
    
    Key considerations:
    - Stemming helps match 'withdrawal' with 'withdrawals'
    - Remove stopwords to focus on content words
    - Lowercase for case-insensitive matching
    - BUT: Keep important banking terms intact
    """
    
    def __init__(self, use_stemming=True):
        self.use_stemming = use_stemming
        self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words('english'))
        
        # Banking terms to preserve (don't stem)
        self.preserve_terms = {
            'atm', 'apy', 'ach', 'iban', 'swift', 'zelle', 'cd', 'cds',
            'regulation', 'fdic', 'overdraft'
        }
    
    def tokenize(self, text):
        """Tokenize and normalize text."""
        # Lowercase
        text = text.lower()
        
        # Handle special patterns (keep numbers with $)
        text = re.sub(r'\$([\d,]+)', r'DOLLAR\1', text)
        
        # Remove punctuation except hyphens in compound words
        text = re.sub(r'[^\w\s-]', ' ', text)
        
        # Tokenize
        tokens = word_tokenize(text)
        
        # Process tokens
        processed = []
        for token in tokens:
            # Skip stopwords
            if token in self.stop_words:
                continue
            
            # Skip very short tokens
            if len(token) < 2:
                continue
            
            # Preserve banking terms
            if token in self.preserve_terms:
                processed.append(token)
            elif self.use_stemming:
                processed.append(self.stemmer.stem(token))
            else:
                processed.append(token)
        
        return processed
    
    def preprocess(self, text):
        """Return preprocessed text as string."""
        return ' '.join(self.tokenize(text))

preprocessor = SearchPreprocessor(use_stemming=True)

# Demo
sample = "What is the daily ATM withdrawal limit for premium checking accounts?"
print(f"Original: {sample}")
print(f"Tokens: {preprocessor.tokenize(sample)}")
print(f"Processed: {preprocessor.preprocess(sample)}")

### 4.2 TF-IDF Retrieval

In [None]:
class TFIDFRetriever:
    """
    TF-IDF based document retriever.
    
    Uses cosine similarity between query and document TF-IDF vectors.
    """
    
    def __init__(self):
        self.vectorizer = TfidfVectorizer(
            stop_words='english',
            ngram_range=(1, 2),  # Unigrams and bigrams
            max_df=0.9,
            min_df=1
        )
        self.doc_vectors = None
        self.documents = None
    
    def index(self, documents):
        """Build TF-IDF index from documents."""
        self.documents = documents
        
        # Combine title and content for indexing
        texts = [f"{doc['title']} {doc['content']}" for doc in documents]
        
        # Build TF-IDF matrix
        self.doc_vectors = self.vectorizer.fit_transform(texts)
        
        print(f"Indexed {len(documents)} documents")
        print(f"Vocabulary size: {len(self.vectorizer.vocabulary_)}")
    
    def search(self, query, top_k=5):
        """Search for relevant documents."""
        # Transform query
        query_vector = self.vectorizer.transform([query])
        
        # Calculate cosine similarity
        similarities = cosine_similarity(query_vector, self.doc_vectors).flatten()
        
        # Get top-k indices
        top_indices = similarities.argsort()[::-1][:top_k]
        
        # Build results
        results = []
        for idx in top_indices:
            if similarities[idx] > 0:  # Only include if there's some match
                results.append({
                    'rank': len(results) + 1,
                    'doc_id': self.documents[idx]['id'],
                    'title': self.documents[idx]['title'],
                    'score': float(similarities[idx]),
                    'content_preview': self.documents[idx]['content'][:200] + '...'
                })
        
        return results

# Build TF-IDF index
tfidf_retriever = TFIDFRetriever()
tfidf_retriever.index(banking_documents)

In [None]:
# Test TF-IDF retrieval
query = "What is the ATM withdrawal limit for premium accounts?"

results = tfidf_retriever.search(query, top_k=3)

print(f"TF-IDF RETRIEVAL")
print(f"Query: {query}")
print("=" * 60)

for result in results:
    print(f"\nRank {result['rank']}: {result['title']}")
    print(f"  Score: {result['score']:.4f}")
    print(f"  Preview: {result['content_preview'][:100]}...")

### 4.3 BM25 Retrieval

BM25 improves on TF-IDF with:
- Term frequency saturation (diminishing returns)
- Document length normalization

In [None]:
class BM25Retriever:
    """
    BM25 (Okapi BM25) retriever - the gold standard for sparse retrieval.
    
    Parameters:
    - k1: Term frequency saturation (typical: 1.2-2.0)
    - b: Document length normalization (typical: 0.75)
    """
    
    def __init__(self, k1=1.5, b=0.75):
        self.k1 = k1
        self.b = b
        self.preprocessor = SearchPreprocessor(use_stemming=True)
        
        # Index structures
        self.documents = None
        self.doc_tokens = None
        self.doc_lengths = None
        self.avgdl = 0
        self.idf = {}
        self.inverted_index = defaultdict(list)
    
    def index(self, documents):
        """Build BM25 index."""
        self.documents = documents
        n_docs = len(documents)
        
        # Tokenize all documents
        self.doc_tokens = []
        self.doc_lengths = []
        doc_freqs = Counter()
        
        for doc in documents:
            text = f"{doc['title']} {doc['content']}"
            tokens = self.preprocessor.tokenize(text)
            self.doc_tokens.append(tokens)
            self.doc_lengths.append(len(tokens))
            
            # Count document frequencies
            unique_tokens = set(tokens)
            for token in unique_tokens:
                doc_freqs[token] += 1
        
        # Calculate average document length
        self.avgdl = sum(self.doc_lengths) / len(self.doc_lengths)
        
        # Calculate IDF for each term
        for term, df in doc_freqs.items():
            # BM25 IDF formula
            self.idf[term] = math.log((n_docs - df + 0.5) / (df + 0.5) + 1)
        
        # Build inverted index with term frequencies
        for doc_idx, tokens in enumerate(self.doc_tokens):
            term_freqs = Counter(tokens)
            for term, freq in term_freqs.items():
                self.inverted_index[term].append((doc_idx, freq))
        
        print(f"Indexed {n_docs} documents")
        print(f"Vocabulary size: {len(self.idf)}")
        print(f"Average document length: {self.avgdl:.1f} tokens")
    
    def score(self, query_tokens, doc_idx):
        """Calculate BM25 score for a document."""
        doc_tokens = self.doc_tokens[doc_idx]
        doc_len = self.doc_lengths[doc_idx]
        term_freqs = Counter(doc_tokens)
        
        score = 0.0
        for term in query_tokens:
            if term not in self.idf:
                continue
            
            tf = term_freqs.get(term, 0)
            idf = self.idf[term]
            
            # BM25 scoring formula
            numerator = tf * (self.k1 + 1)
            denominator = tf + self.k1 * (1 - self.b + self.b * (doc_len / self.avgdl))
            score += idf * (numerator / denominator)
        
        return score
    
    def search(self, query, top_k=5):
        """Search for relevant documents."""
        query_tokens = self.preprocessor.tokenize(query)
        
        # Find candidate documents (those containing at least one query term)
        candidate_docs = set()
        for token in query_tokens:
            if token in self.inverted_index:
                for doc_idx, _ in self.inverted_index[token]:
                    candidate_docs.add(doc_idx)
        
        # Score candidates
        scores = []
        for doc_idx in candidate_docs:
            score = self.score(query_tokens, doc_idx)
            scores.append((doc_idx, score))
        
        # Sort by score
        scores.sort(key=lambda x: -x[1])
        
        # Build results
        results = []
        for rank, (doc_idx, score) in enumerate(scores[:top_k], 1):
            results.append({
                'rank': rank,
                'doc_id': self.documents[doc_idx]['id'],
                'title': self.documents[doc_idx]['title'],
                'score': score,
                'content_preview': self.documents[doc_idx]['content'][:200] + '...'
            })
        
        return results

# Build BM25 index
bm25_retriever = BM25Retriever(k1=1.5, b=0.75)
bm25_retriever.index(banking_documents)

In [None]:
# Test BM25 retrieval
query = "What is the ATM withdrawal limit for premium accounts?"

results = bm25_retriever.search(query, top_k=3)

print(f"BM25 RETRIEVAL")
print(f"Query: {query}")
print("=" * 60)

for result in results:
    print(f"\nRank {result['rank']}: {result['title']}")
    print(f"  Score: {result['score']:.4f}")
    print(f"  Preview: {result['content_preview'][:100]}...")

## 5. Model Training & Inference

In [None]:
# Compare TF-IDF vs BM25 on all test queries
def compare_retrievers(queries, tfidf_ret, bm25_ret, top_k=3):
    """Compare retrieval results from different methods."""
    
    results = []
    
    for query in queries:
        tfidf_results = tfidf_ret.search(query, top_k=top_k)
        bm25_results = bm25_ret.search(query, top_k=top_k)
        
        tfidf_docs = [r['doc_id'] for r in tfidf_results]
        bm25_docs = [r['doc_id'] for r in bm25_results]
        
        results.append({
            'query': query,
            'tfidf_top1': tfidf_docs[0] if tfidf_docs else None,
            'bm25_top1': bm25_docs[0] if bm25_docs else None,
            'agreement': tfidf_docs[0] == bm25_docs[0] if tfidf_docs and bm25_docs else False
        })
    
    return pd.DataFrame(results)

comparison = compare_retrievers(test_queries, tfidf_retriever, bm25_retriever)

print("RETRIEVER COMPARISON")
print("=" * 80)
print(comparison.to_string(index=False))

agreement_rate = comparison['agreement'].mean()
print(f"\nTop-1 Agreement Rate: {agreement_rate:.1%}")

In [None]:
# Production search interface
class ProductionSearchEngine:
    """
    Production-ready search engine with multiple retrieval methods.
    """
    
    def __init__(self, documents, default_method='bm25'):
        self.documents = {doc['id']: doc for doc in documents}
        self.default_method = default_method
        
        # Initialize retrievers
        self.tfidf = TFIDFRetriever()
        self.bm25 = BM25Retriever()
        
        # Build indices
        print("Building search indices...")
        self.tfidf.index(documents)
        self.bm25.index(documents)
        print("Search engine ready.")
    
    def search(self, query, top_k=5, method=None, min_score=0.0):
        """
        Search for documents matching query.
        
        Args:
            query: Search query string
            top_k: Number of results to return
            method: 'tfidf' or 'bm25' (default: bm25)
            min_score: Minimum relevance score to include
        """
        method = method or self.default_method
        
        # Input validation
        if not query or len(query.strip()) < 2:
            return {
                'status': 'error',
                'message': 'Query too short',
                'results': []
            }
        
        # Retrieve
        if method == 'tfidf':
            results = self.tfidf.search(query, top_k=top_k)
        else:
            results = self.bm25.search(query, top_k=top_k)
        
        # Filter by minimum score
        results = [r for r in results if r['score'] >= min_score]
        
        # Add full document content for top results
        for result in results:
            doc = self.documents[result['doc_id']]
            result['full_content'] = doc['content']
            result['category'] = doc['category']
        
        return {
            'status': 'success',
            'query': query,
            'method': method,
            'total_results': len(results),
            'results': results
        }
    
    def hybrid_search(self, query, top_k=5, tfidf_weight=0.3):
        """
        Combine TF-IDF and BM25 results with weighted scoring.
        """
        tfidf_results = self.tfidf.search(query, top_k=top_k*2)
        bm25_results = self.bm25.search(query, top_k=top_k*2)
        
        # Normalize scores
        def normalize(results):
            if not results:
                return {}
            max_score = max(r['score'] for r in results)
            return {r['doc_id']: r['score']/max_score if max_score > 0 else 0 
                   for r in results}
        
        tfidf_scores = normalize(tfidf_results)
        bm25_scores = normalize(bm25_results)
        
        # Combine scores
        all_docs = set(tfidf_scores.keys()) | set(bm25_scores.keys())
        combined = []
        
        for doc_id in all_docs:
            score = (tfidf_weight * tfidf_scores.get(doc_id, 0) + 
                    (1 - tfidf_weight) * bm25_scores.get(doc_id, 0))
            combined.append((doc_id, score))
        
        # Sort and return top-k
        combined.sort(key=lambda x: -x[1])
        
        results = []
        for rank, (doc_id, score) in enumerate(combined[:top_k], 1):
            doc = self.documents[doc_id]
            results.append({
                'rank': rank,
                'doc_id': doc_id,
                'title': doc['title'],
                'score': score,
                'content_preview': doc['content'][:200] + '...'
            })
        
        return {
            'status': 'success',
            'method': 'hybrid',
            'results': results
        }

# Initialize production search
search_engine = ProductionSearchEngine(banking_documents, default_method='bm25')

In [None]:
# Test production search
query = "How do I report fraud on my account?"

result = search_engine.search(query, top_k=3)

print("PRODUCTION SEARCH RESULT")
print("=" * 60)
print(f"Query: {result['query']}")
print(f"Method: {result['method']}")
print(f"Results found: {result['total_results']}")

for r in result['results']:
    print(f"\n{r['rank']}. [{r['doc_id']}] {r['title']}")
    print(f"   Category: {r['category']}")
    print(f"   Score: {r['score']:.4f}")
    print(f"   Preview: {r['content_preview'][:100]}...")

## 6. Evaluation Strategy

### Key Retrieval Metrics

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **Precision@K** | Relevant in top K / K | Quality of top results |
| **Recall@K** | Relevant in top K / Total relevant | Coverage |
| **MRR** | 1 / rank of first relevant | How quickly we find something relevant |
| **NDCG@K** | DCG@K / IDCG@K | Position-weighted relevance |
| **MAP** | Mean of Average Precision | Overall ranking quality |

In [None]:
def calculate_retrieval_metrics(results, relevant_docs, k=5):
    """
    Calculate standard retrieval metrics.
    
    Args:
        results: List of retrieved doc_ids in rank order
        relevant_docs: Set of relevant doc_ids (ground truth)
        k: Cutoff for @K metrics
    """
    results_at_k = results[:k]
    
    # Precision@K
    relevant_retrieved = sum(1 for doc in results_at_k if doc in relevant_docs)
    precision_at_k = relevant_retrieved / k if k > 0 else 0
    
    # Recall@K
    recall_at_k = relevant_retrieved / len(relevant_docs) if relevant_docs else 0
    
    # MRR (Mean Reciprocal Rank)
    mrr = 0
    for i, doc in enumerate(results, 1):
        if doc in relevant_docs:
            mrr = 1 / i
            break
    
    # NDCG@K
    def dcg(relevances):
        return sum(rel / math.log2(i + 2) for i, rel in enumerate(relevances))
    
    relevances = [1 if doc in relevant_docs else 0 for doc in results_at_k]
    ideal_relevances = sorted(relevances, reverse=True)
    
    dcg_score = dcg(relevances)
    idcg_score = dcg(ideal_relevances)
    ndcg_at_k = dcg_score / idcg_score if idcg_score > 0 else 0
    
    return {
        f'Precision@{k}': precision_at_k,
        f'Recall@{k}': recall_at_k,
        'MRR': mrr,
        f'NDCG@{k}': ndcg_at_k
    }

# Example evaluation
# Simulated ground truth: which documents are relevant for each query
ground_truth = {
    "What is the ATM withdrawal limit for premium accounts?": {"POL-001"},
    "How do I send a wire transfer?": {"POL-002", "FAQ-002"},
    "What are the overdraft fees?": {"POL-003"},
    "How long does mobile deposit take?": {"POL-004"},
    "Report fraud on my account": {"POL-006"},
    "What is the savings account interest rate?": {"POL-007"},
    "Zelle payment limits": {"POL-008"},
    "routing number for direct deposit": {"FAQ-002"}
}

# Evaluate BM25
print("RETRIEVAL EVALUATION (BM25)")
print("=" * 60)

all_metrics = []
for query, relevant in ground_truth.items():
    results = bm25_retriever.search(query, top_k=5)
    retrieved_ids = [r['doc_id'] for r in results]
    
    metrics = calculate_retrieval_metrics(retrieved_ids, relevant, k=3)
    metrics['query'] = query[:40] + '...'
    all_metrics.append(metrics)

metrics_df = pd.DataFrame(all_metrics)
print(metrics_df.to_string(index=False))

print(f"\nAVERAGE METRICS:")
for col in ['Precision@3', 'Recall@3', 'MRR', 'NDCG@3']:
    print(f"  {col}: {metrics_df[col].mean():.4f}")

## 7. Production Readiness Checklist

```
INDEX MANAGEMENT
[ ] Inverted index persistence (save/load)
[ ] Incremental index updates (add/remove docs)
[ ] Index versioning and rollback
[ ] Index size monitoring

QUERY PROCESSING
[ ] Query spell correction
[ ] Query expansion (synonyms)
[ ] Query logging for analytics
[ ] Rate limiting

RETRIEVAL QUALITY
[ ] Relevance feedback loop
[ ] Click-through rate tracking
[ ] A/B testing framework
[ ] Regular evaluation against benchmarks

PERFORMANCE
[ ] Sub-100ms latency for 1M docs
[ ] Caching for frequent queries
[ ] Load balancing for high traffic
[ ] Timeout handling

BANKING-SPECIFIC
[ ] Access control (who can search what)
[ ] Audit trail of searches
[ ] Sensitive document handling
[ ] Regulatory compliance tagging

MONITORING
[ ] Zero-result rate tracking
[ ] Query latency percentiles
[ ] Index freshness monitoring
[ ] Error rate alerting
```

## 8. Modern LLM-Based Approach

### Dense Retrieval and RAG

**Dense Retrieval** replaces sparse term matching with semantic embedding similarity:

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Index: Embed all documents
doc_embeddings = model.encode([doc['content'] for doc in documents])

# Search: Embed query and find similar
query_embedding = model.encode(query)
similarities = cosine_similarity([query_embedding], doc_embeddings)
```

**RAG (Retrieval-Augmented Generation)** combines retrieval with LLM generation:

```python
# Step 1: Retrieve relevant documents
docs = retriever.search(query, top_k=3)

# Step 2: Create prompt with context
context = "\n".join([doc['content'] for doc in docs])
prompt = f"""Based on the following policy documents, answer the question.

Documents:
{context}

Question: {query}

Answer:"""

# Step 3: Generate answer with LLM
answer = llm.generate(prompt)
```

### Hybrid Search

Modern systems combine sparse and dense:
```python
final_score = alpha * bm25_score + (1 - alpha) * dense_score
```

In [None]:
# Pseudocode for RAG pipeline
def create_rag_prompt(query, retrieved_docs, max_context_length=2000):
    """
    Create RAG prompt for question answering.
    
    Banking considerations:
    - Include document IDs for attribution
    - Instruct model to cite sources
    - Handle case when no relevant docs found
    """
    
    if not retrieved_docs:
        return """I could not find any relevant documents to answer this question. 
        Please contact customer service for assistance."""
    
    # Build context from retrieved docs
    context_parts = []
    total_length = 0
    
    for doc in retrieved_docs:
        doc_text = f"[{doc['doc_id']}] {doc['title']}\n{doc['full_content']}"
        if total_length + len(doc_text) > max_context_length:
            break
        context_parts.append(doc_text)
        total_length += len(doc_text)
    
    context = "\n\n---\n\n".join(context_parts)
    
    prompt = f"""You are a banking customer service assistant. Answer the customer's question 
based ONLY on the information provided in the policy documents below.

Rules:
1. Only use information from the provided documents
2. Cite the document ID (e.g., [POL-001]) when referencing specific information
3. If the documents don't contain enough information, say so
4. Be concise and direct

Policy Documents:
{context}

Customer Question: {query}

Answer:"""
    
    return prompt

# Example
query = "What is the daily ATM limit for premium accounts?"
results = search_engine.search(query, top_k=2)['results']

prompt = create_rag_prompt(query, results)

print("RAG PROMPT EXAMPLE")
print("=" * 60)
print(prompt[:1500] + "...")

## 9. Traditional vs LLM Decision Matrix

| Dimension | Sparse (BM25) | Dense (Embeddings) | Hybrid |
|-----------|---------------|-------------------|--------|
| **Exact Match** | Excellent | Poor | Good |
| **Semantic Match** | Poor | Excellent | Good |
| **Speed** | Very fast (<10ms) | Slower (50-100ms) | Medium |
| **Index Size** | Small | Large (vectors) | Larger |
| **Out-of-vocabulary** | Fails | Handles well | Handles well |
| **Interpretability** | High (term overlap) | Low (embedding space) | Medium |
| **Training Required** | No | Yes (or pretrained) | Partial |

### When to Use Each Approach

**Use BM25 (Sparse)**:
- Queries contain specific terms ("Regulation E", "wire transfer")
- Need interpretable results (why was this doc retrieved?)
- Index size is a concern
- Building a baseline quickly

**Use Dense Retrieval**:
- Semantic matching needed ("send money" = "transfer funds")
- User queries are natural language questions
- Multilingual requirements

**Use Hybrid**:
- Best of both worlds needed
- Production system where accuracy matters
- Diverse query types (some keyword, some semantic)

## 10. Interview Soundbites

### Ready-to-Say Statements

**On BM25:**
> "BM25 is my go-to baseline for any retrieval task. It's fast, requires no training, and surprisingly hard to beat. The key improvements over TF-IDF are term frequency saturation - seeing a word 10 times isn't 10x better than seeing it once - and document length normalization."

**On Sparse vs Dense:**
> "Dense retrieval excels at semantic matching, but sparse retrieval wins on exact terms. In banking, when someone searches 'Regulation E', they need documents containing those exact words. That's why I always start with BM25 and consider dense as an enhancement, not a replacement."

**On Hybrid Search:**
> "In production, I combine BM25 and dense retrieval. BM25 handles keyword queries and specific terms; dense handles paraphrases and semantic similarity. The weighted combination typically outperforms either alone by 10-15%."

**On Evaluation:**
> "MRR is my primary metric for single-answer queries like FAQ search - it tells me how quickly users find what they need. For comprehensive search where multiple docs are relevant, I use NDCG because it rewards getting relevant docs ranked higher."

**On RAG Architecture:**
> "RAG decouples retrieval from generation. The retriever finds relevant context, the LLM synthesizes an answer. This gives us the best of both: factual grounding from retrieval and fluent answers from generation. But the retriever quality is critical - garbage in, garbage out."

**On Production:**
> "Retrieval latency is critical for user experience. BM25 with inverted indices gives sub-10ms retrieval over millions of documents. Dense retrieval with vector search takes 50-100ms. For real-time applications, I often use BM25 for initial candidate generation, then re-rank with dense embeddings."

**On Failure Modes:**
> "The biggest failure mode is vocabulary mismatch. If users say 'money transfer' but docs say 'wire transfer', BM25 misses it. Solutions: query expansion with synonyms, embedding-based retrieval, or training on user click data to learn the mapping."

---

### Common Interview Questions

**Q: What's the difference between TF-IDF and BM25?**
> BM25 adds two improvements: (1) term frequency saturation via the k1 parameter - diminishing returns for repeated terms, and (2) document length normalization via the b parameter - don't unfairly penalize short documents. These make BM25 more robust in practice.

**Q: How do you handle synonyms in sparse retrieval?**
> Options: (1) Query expansion - add synonyms to query at search time, (2) Index expansion - add synonyms to documents at index time, (3) Stemming/lemmatization - normalize to common form, (4) Hybrid with dense retrieval which handles synonyms naturally.

**Q: How does retrieval fit into RAG?**
> RAG = Retrieval + Augmented Generation. Retrieval finds relevant documents, which become context for the LLM. The LLM generates an answer grounded in that context. Retrieval quality directly impacts answer quality - poor retrieval means the LLM lacks the information to answer correctly.

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════╗
║                    NOTEBOOK SUMMARY                               ║
╠══════════════════════════════════════════════════════════════════╣
║  Task: Information Retrieval / Search                            ║
║  Approach: Traditional NLP (TF-IDF, BM25)                        ║
║  Banking Use: Policy search, FAQ matching, document discovery    ║
║                                                                  ║
║  Key Takeaways:                                                  ║
║  1. BM25 is the gold standard baseline for sparse retrieval      ║
║  2. Term saturation + length normalization improve on TF-IDF     ║
║  3. Hybrid (sparse + dense) often best in production             ║
║  4. MRR/NDCG for evaluation, not just precision                  ║
║  5. This is the foundation for RAG systems                       ║
╚══════════════════════════════════════════════════════════════════╝
""")