# Information Retrieval with RAG - Modern Approaches
## Interview Preparation Notebook for Senior Applied AI Scientist (Retail Banking)

---

**Goal**: Demonstrate mastery of dense retrieval, hybrid search, and Retrieval-Augmented Generation (RAG) pipelines for knowledge-intensive banking applications.

**Interview Signal**: This notebook shows you understand the full RAG stack from embedding models to reranking to generation, with production considerations for banking.

## 1. Business Context (Banking Lens)

### Why RAG Now?

| Traditional IR Limitation | RAG Solution |
|--------------------------|---------------|
| Keyword mismatch ("loan" vs "credit") | Semantic understanding |
| Returns documents, not answers | Synthesized responses |
| Can't reason across documents | Multi-hop reasoning |
| No conversational context | Maintains dialogue state |

### Banking Use Cases

1. **Policy Q&A**: "What's our policy on foreign wire transfers over $10K?"
2. **Compliance Assistant**: Search regulations and explain applicability
3. **Customer Service Bot**: Answer product questions from knowledge base
4. **Internal Knowledge Search**: Find procedures across 10K+ internal documents
5. **Research Assistant**: Query earnings calls, SEC filings, analyst reports

### Why Not Just Use LLM Knowledge?

**Banking Requires**:
- **Recency**: LLM training data is stale; policies change weekly
- **Accuracy**: Can't hallucinate compliance rules
- **Auditability**: Must cite sources for regulators
- **Proprietary data**: Internal docs not in LLM training

## 2. Problem Definition

### RAG Architecture Overview

```
Query → [Retriever] → Top-K Documents → [Reranker] → Top-N Documents → [Generator] → Answer
         ↑                                                              ↑
    Embedding Model                                                  LLM + Context
```

### Retrieval Approaches

| Approach | How It Works | Strengths | Weaknesses |
|----------|-------------|-----------|------------|
| **Sparse (BM25)** | Term frequency matching | Fast, interpretable | Keyword mismatch |
| **Dense (DPR/SBERT)** | Semantic embedding similarity | Understands meaning | Misses exact matches |
| **Hybrid** | Combine sparse + dense | Best of both | More complex |
| **Learned Sparse (SPLADE)** | Neural term weighting | Interpretable + semantic | Newer, less tooling |

In [None]:
# Install required packages
# !pip install sentence-transformers faiss-cpu numpy pandas

In [None]:
import numpy as np
import pandas as pd
from collections import Counter
import math
import warnings
warnings.filterwarnings('ignore')

# Sample banking knowledge base
knowledge_base = [
    {
        "id": "policy_001",
        "title": "Wire Transfer Policy",
        "content": "International wire transfers exceeding $10,000 require additional verification. The customer must provide government-issued ID and state the purpose of the transfer. Transfers to high-risk countries require branch manager approval and a 24-hour hold period."
    },
    {
        "id": "policy_002", 
        "title": "Account Opening Requirements",
        "content": "New account opening requires two forms of identification: one government-issued photo ID and one proof of address dated within 60 days. For business accounts, articles of incorporation and EIN documentation are also required."
    },
    {
        "id": "policy_003",
        "title": "Overdraft Protection",
        "content": "Overdraft protection links checking to savings account. When checking balance is insufficient, funds are automatically transferred from savings. A $12 transfer fee applies per occurrence. Daily transfer limit is $1,000."
    },
    {
        "id": "policy_004",
        "title": "Fraud Reporting Procedures",
        "content": "Suspected fraud must be reported within 60 days of statement date. Customer liability is limited to $50 if reported within 2 business days, up to $500 if reported within 60 days, and unlimited thereafter. Immediately freeze the affected account."
    },
    {
        "id": "policy_005",
        "title": "Large Cash Transaction Reporting",
        "content": "Cash transactions over $10,000 require Currency Transaction Report (CTR) filing. Structuring transactions to avoid reporting is illegal. Suspicious patterns below threshold trigger Suspicious Activity Report (SAR)."
    },
    {
        "id": "faq_001",
        "title": "Interest Rate FAQ",
        "content": "Savings account interest is calculated daily and paid monthly. Current rate is 4.5% APY for balances over $10,000 and 3.2% APY for balances under $10,000. Rates are variable and may change."
    },
    {
        "id": "faq_002",
        "title": "Mobile Deposit Limits",
        "content": "Mobile check deposit limits are $5,000 per check and $10,000 per day for standard accounts. Premium accounts have $10,000 per check and $25,000 daily limits. Funds availability is typically next business day."
    },
    {
        "id": "reg_001",
        "title": "Regulation E - Electronic Transfers",
        "content": "Under Regulation E, consumers must report unauthorized electronic transfers within 60 days. Banks must investigate within 10 business days and provisionally credit the account. Final resolution required within 45 days."
    }
]

print(f"Knowledge base size: {len(knowledge_base)} documents")

## 3-4. Implementation

### 4.1 Dense Retrieval with Sentence Transformers

In [None]:
# Dense retrieval pseudocode
'''
from sentence_transformers import SentenceTransformer
import faiss

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dim, fast
# Or: 'BAAI/bge-base-en-v1.5' for better quality

# Embed documents
doc_texts = [d['title'] + " " + d['content'] for d in knowledge_base]
doc_embeddings = model.encode(doc_texts, normalize_embeddings=True)

# Build FAISS index
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity
index.add(doc_embeddings)

# Query
query = "How do I report fraud on my account?"
query_embedding = model.encode([query], normalize_embeddings=True)

# Search
k = 3
scores, indices = index.search(query_embedding, k)

for i, (score, idx) in enumerate(zip(scores[0], indices[0])):
    print(f"{i+1}. [{score:.3f}] {knowledge_base[idx]['title']}")
'''

print("Dense retrieval pseudocode shown above.")
print("Key models: all-MiniLM-L6-v2 (fast), BGE (quality), E5 (versatile)")

In [None]:
# Simulated dense retrieval for demonstration
def simulate_dense_retrieval(query, documents, top_k=3):
    """
    Simulate semantic similarity with simple word overlap + synonyms.
    In production, use actual embeddings!
    """
    # Simple synonym expansion
    synonyms = {
        'fraud': ['unauthorized', 'stolen', 'suspicious'],
        'wire': ['transfer', 'send', 'remittance'],
        'account': ['checking', 'savings', 'deposit'],
        'report': ['file', 'notify', 'alert'],
        'money': ['funds', 'cash', 'balance'],
        'interest': ['rate', 'apy', 'yield']
    }
    
    query_words = set(query.lower().split())
    expanded_query = query_words.copy()
    for word in query_words:
        if word in synonyms:
            expanded_query.update(synonyms[word])
    
    scores = []
    for doc in documents:
        doc_text = (doc['title'] + ' ' + doc['content']).lower()
        doc_words = set(doc_text.split())
        
        # Jaccard-like similarity with expansion
        overlap = len(expanded_query & doc_words)
        score = overlap / (len(expanded_query) + 0.1)
        scores.append((score, doc))
    
    # Sort by score descending
    scores.sort(key=lambda x: -x[0])
    return scores[:top_k]

# Test queries
test_queries = [
    "How do I report fraud on my account?",
    "What are the wire transfer limits for international payments?",
    "What interest rate do I get on savings?"
]

print("SIMULATED DENSE RETRIEVAL")
print("=" * 60)
for query in test_queries:
    print(f"\nQuery: {query}")
    results = simulate_dense_retrieval(query, knowledge_base, top_k=2)
    for score, doc in results:
        print(f"  [{score:.3f}] {doc['title']}")

### 4.2 Hybrid Search (BM25 + Dense)

In [None]:
def bm25_score(query, document, doc_lengths, avg_doc_length, k1=1.5, b=0.75):
    """Calculate BM25 score for a single document."""
    query_terms = query.lower().split()
    doc_terms = document.lower().split()
    doc_length = len(doc_terms)
    term_freqs = Counter(doc_terms)
    
    score = 0
    N = len(doc_lengths)
    
    for term in query_terms:
        if term in term_freqs:
            tf = term_freqs[term]
            # Simplified IDF
            df = sum(1 for d in doc_lengths if term in d.lower())
            idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
            
            # BM25 term score
            numerator = tf * (k1 + 1)
            denominator = tf + k1 * (1 - b + b * doc_length / avg_doc_length)
            score += idf * numerator / denominator
    
    return score

def hybrid_search(query, documents, alpha=0.5, top_k=3):
    """
    Combine BM25 (sparse) and simulated dense scores.
    alpha: weight for dense (1-alpha for sparse)
    """
    # Prepare documents
    doc_texts = [d['title'] + ' ' + d['content'] for d in documents]
    avg_length = np.mean([len(t.split()) for t in doc_texts])
    
    # Get sparse scores (BM25)
    sparse_scores = []
    for doc, text in zip(documents, doc_texts):
        score = bm25_score(query, text, doc_texts, avg_length)
        sparse_scores.append(score)
    
    # Get dense scores (simulated)
    dense_results = simulate_dense_retrieval(query, documents, top_k=len(documents))
    dense_scores = {r[1]['id']: r[0] for r in dense_results}
    
    # Normalize scores to [0, 1]
    sparse_max = max(sparse_scores) if max(sparse_scores) > 0 else 1
    dense_max = max(dense_scores.values()) if max(dense_scores.values()) > 0 else 1
    
    # Combine
    combined = []
    for doc, sparse_score in zip(documents, sparse_scores):
        norm_sparse = sparse_score / sparse_max
        norm_dense = dense_scores.get(doc['id'], 0) / dense_max
        hybrid_score = alpha * norm_dense + (1 - alpha) * norm_sparse
        combined.append((hybrid_score, doc))
    
    combined.sort(key=lambda x: -x[0])
    return combined[:top_k]

# Test hybrid search
print("HYBRID SEARCH (BM25 + Dense)")
print("=" * 60)
query = "What are the rules for large cash deposits?"
print(f"Query: {query}\n")

for alpha in [0.0, 0.5, 1.0]:
    print(f"Alpha={alpha} ({'BM25 only' if alpha==0 else 'Dense only' if alpha==1 else 'Hybrid'}):")
    results = hybrid_search(query, knowledge_base, alpha=alpha, top_k=2)
    for score, doc in results:
        print(f"  [{score:.3f}] {doc['title']}")
    print()

### 4.3 Reranking

In [None]:
# Reranking pseudocode
'''
from sentence_transformers import CrossEncoder

# Load cross-encoder reranker
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Or: 'BAAI/bge-reranker-base' for better quality

# Initial retrieval (fast, recall-focused)
candidates = hybrid_search(query, documents, top_k=20)

# Rerank (slow, precision-focused)
pairs = [(query, doc['content']) for _, doc in candidates]
rerank_scores = reranker.predict(pairs)

# Sort by reranker score
reranked = sorted(zip(rerank_scores, candidates), reverse=True)
top_docs = [doc for _, (_, doc) in reranked[:5]]
'''

print("Reranking improves precision after initial recall-focused retrieval.")
print("")
print("Two-stage retrieval pattern:")
print("1. Retrieve top-100 with fast bi-encoder")
print("2. Rerank to top-5 with slow cross-encoder")
print("")
print("Cross-encoders are 10-100x slower but much more accurate.")

### 4.4 Full RAG Pipeline

In [None]:
def create_rag_prompt(query, retrieved_docs, max_context_length=2000):
    """
    Create RAG prompt with retrieved context.
    """
    
    # Build context from retrieved documents
    context_parts = []
    total_length = 0
    
    for i, doc in enumerate(retrieved_docs):
        doc_text = f"[Document {i+1}: {doc['title']}]\n{doc['content']}"
        if total_length + len(doc_text) < max_context_length:
            context_parts.append(doc_text)
            total_length += len(doc_text)
    
    context = "\n\n".join(context_parts)
    
    prompt = f"""You are a helpful banking assistant. Answer the customer's question using ONLY the information provided in the documents below.

IMPORTANT RULES:
1. Only use information from the provided documents
2. If the documents don't contain the answer, say "I don't have information about that in my knowledge base"
3. Cite the document title when providing information
4. Be concise and direct
5. For compliance questions, always recommend consulting with compliance team

DOCUMENTS:
{context}

CUSTOMER QUESTION: {query}

ANSWER:"""
    
    return prompt

# Example RAG pipeline
query = "I think someone made unauthorized transactions on my account. What should I do?"

# Step 1: Retrieve
retrieved = hybrid_search(query, knowledge_base, alpha=0.5, top_k=3)
retrieved_docs = [doc for _, doc in retrieved]

# Step 2: Generate prompt
prompt = create_rag_prompt(query, retrieved_docs)

print("RAG PROMPT")
print("=" * 60)
print(prompt)

In [None]:
# Simulated RAG response
def simulate_rag_response(query, retrieved_docs):
    """
    Simulate what an LLM would generate based on retrieved docs.
    In production, this calls the actual LLM API.
    """
    # Find relevant doc
    for doc in retrieved_docs:
        if 'fraud' in doc['content'].lower() or 'unauthorized' in doc['content'].lower():
            return f"""Based on our Fraud Reporting Procedures:

1. **Report immediately**: You should report suspected fraud within 2 business days to limit your liability to $50.

2. **Time limits**: 
   - Within 2 days: max $50 liability
   - Within 60 days: max $500 liability
   - After 60 days: unlimited liability

3. **Next steps**: Contact us immediately to freeze the affected account.

[Source: {doc['title']}]

I recommend calling our fraud hotline at 1-800-XXX-XXXX right away."""
    
    return "I don't have specific information about that in my knowledge base. Please contact customer service."

# Test
print("SIMULATED RAG RESPONSE")
print("=" * 60)
print(f"Query: {query}\n")
response = simulate_rag_response(query, retrieved_docs)
print(response)

### 4.5 Advanced: Chunking Strategies

In [None]:
def chunk_document(text, chunk_size=500, overlap=100):
    """
    Split document into overlapping chunks.
    
    Args:
        text: Document text
        chunk_size: Target characters per chunk
        overlap: Characters to overlap between chunks
    """
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        
        # Try to break at sentence boundary
        if end < len(text):
            # Look for sentence end in last 100 chars
            for i in range(min(100, end - start)):
                if text[end - i] in '.!?':
                    end = end - i + 1
                    break
        
        chunk = text[start:end].strip()
        if chunk:
            chunks.append({
                'text': chunk,
                'start': start,
                'end': end
            })
        
        start = end - overlap
    
    return chunks

# Example: chunking a longer document
long_doc = """Wire Transfer Policy and Procedures

Section 1: Domestic Transfers
Domestic wire transfers can be initiated through online banking or at any branch. 
Standard processing time is same-day for requests before 4 PM ET. Fees are $25 for 
online transfers and $35 for branch-initiated transfers.

Section 2: International Transfers  
International wire transfers exceeding $10,000 require additional verification. The 
customer must provide government-issued ID and state the purpose of the transfer. 
Transfers to high-risk countries require branch manager approval and a 24-hour hold.

Section 3: Compliance Requirements
All wire transfers are subject to OFAC screening. Transfers flagged by our compliance 
system require manual review. Currency Transaction Reports (CTR) are filed for cash 
transactions exceeding $10,000."""

chunks = chunk_document(long_doc, chunk_size=300, overlap=50)

print("CHUNKING EXAMPLE")
print("=" * 60)
print(f"Original document: {len(long_doc)} characters")
print(f"Number of chunks: {len(chunks)}\n")

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} [{chunk['start']}:{chunk['end']}]:")
    print(f"  {chunk['text'][:100]}...")
    print()

## 5-6. Evaluation

### Retrieval Metrics
- **Recall@K**: % of relevant docs in top-K
- **MRR**: Mean Reciprocal Rank
- **NDCG**: Normalized Discounted Cumulative Gain

### End-to-End RAG Metrics
- **Answer Correctness**: Does answer match ground truth?
- **Faithfulness**: Is answer grounded in retrieved docs?
- **Relevance**: Are retrieved docs relevant to query?

In [None]:
def evaluate_retrieval(retrieved_ids, relevant_ids, k=5):
    """
    Evaluate retrieval quality.
    
    Args:
        retrieved_ids: List of retrieved document IDs (ranked)
        relevant_ids: Set of relevant document IDs
        k: Cutoff for metrics
    """
    retrieved_at_k = retrieved_ids[:k]
    
    # Recall@K
    recall = len(set(retrieved_at_k) & relevant_ids) / len(relevant_ids) if relevant_ids else 0
    
    # Precision@K
    precision = len(set(retrieved_at_k) & relevant_ids) / k
    
    # MRR
    mrr = 0
    for i, doc_id in enumerate(retrieved_ids):
        if doc_id in relevant_ids:
            mrr = 1 / (i + 1)
            break
    
    # NDCG@K
    dcg = sum(1 / np.log2(i + 2) for i, doc_id in enumerate(retrieved_at_k) if doc_id in relevant_ids)
    idcg = sum(1 / np.log2(i + 2) for i in range(min(len(relevant_ids), k)))
    ndcg = dcg / idcg if idcg > 0 else 0
    
    return {
        'recall@k': recall,
        'precision@k': precision,
        'mrr': mrr,
        'ndcg@k': ndcg
    }

# Example evaluation
retrieved = ['policy_004', 'reg_001', 'policy_001', 'faq_001']  # Our retrieval
relevant = {'policy_004', 'reg_001'}  # Ground truth for fraud query

metrics = evaluate_retrieval(retrieved, relevant, k=3)

print("RETRIEVAL EVALUATION")
print("=" * 40)
print(f"Query: 'How do I report fraud?'")
print(f"Retrieved: {retrieved[:3]}")
print(f"Relevant: {relevant}")
print(f"")
for metric, value in metrics.items():
    print(f"{metric}: {value:.3f}")

In [None]:
def check_faithfulness(answer, source_docs):
    """
    Check if claims in answer are grounded in source documents.
    Simplified version - in production use NLI models.
    """
    # Combine source content
    source_text = ' '.join(d['content'].lower() for d in source_docs)
    
    # Extract numbers from answer
    import re
    answer_numbers = re.findall(r'\$[\d,]+|\d+(?:\.\d+)?%?|\d+ (?:days?|hours?)', answer.lower())
    
    # Check if numbers appear in source
    grounded_numbers = []
    hallucinated_numbers = []
    
    for num in answer_numbers:
        if num in source_text or num.replace('$', '') in source_text:
            grounded_numbers.append(num)
        else:
            hallucinated_numbers.append(num)
    
    return {
        'grounded': grounded_numbers,
        'potentially_hallucinated': hallucinated_numbers,
        'faithfulness_score': len(grounded_numbers) / (len(answer_numbers) + 0.001)
    }

# Test
answer = """Report fraud within 2 business days to limit liability to $50. 
After 60 days, you could face $500 or unlimited liability."""

source = [knowledge_base[3]]  # Fraud policy

result = check_faithfulness(answer, source)
print("FAITHFULNESS CHECK")
print("=" * 40)
print(f"Grounded claims: {result['grounded']}")
print(f"Potentially hallucinated: {result['potentially_hallucinated']}")
print(f"Faithfulness score: {result['faithfulness_score']:.2%}")

## 7. Production Readiness Checklist

```
INDEXING PIPELINE
[ ] Document chunking strategy (size, overlap, semantic boundaries)
[ ] Metadata extraction (titles, dates, categories)
[ ] Incremental index updates (not full rebuild)
[ ] Version control for document corpus
[ ] PII masking before embedding

RETRIEVAL
[ ] Hybrid search (sparse + dense)
[ ] Reranking for precision
[ ] Query expansion/rewriting
[ ] Relevance threshold (don't return irrelevant docs)
[ ] Latency monitoring (p50, p99)

GENERATION
[ ] Prompt versioning
[ ] Output validation (format, length)
[ ] Faithfulness checking
[ ] Source citation in response
[ ] Fallback for low-confidence answers

BANKING-SPECIFIC
[ ] Compliance review of knowledge base
[ ] Audit trail for queries and responses
[ ] Human escalation path
[ ] Disclaimer on AI-generated content
[ ] Regular accuracy audits
```

## 8. Traditional vs RAG Comparison

| Dimension | BM25 | Dense Retrieval | Hybrid + RAG |
|-----------|------|-----------------|---------------|
| **Semantic Understanding** | None | High | High |
| **Exact Match** | Excellent | Poor | Good |
| **Answer Synthesis** | None (returns docs) | None | Yes |
| **Latency** | <50ms | 100-200ms | 500ms-2s |
| **Cost/query** | ~$0 | $0.001 | $0.01-0.05 |
| **Explainability** | High | Medium | Medium |
| **Hallucination Risk** | None | None | Medium |
| **Setup Complexity** | Low | Medium | High |

## 9. Advanced Techniques

### Query Expansion
```python
# Use LLM to expand query
prompt = """Generate 3 alternative phrasings for this search query:
Query: "wire transfer limit"
Alternatives:
1. maximum wire transfer amount
2. how much can I wire transfer
3. wire transaction threshold
"""
```

### Multi-hop Retrieval
```python
# For complex questions requiring multiple docs
# Step 1: Retrieve docs for sub-question 1
# Step 2: Use findings to formulate sub-question 2
# Step 3: Retrieve more docs
# Step 4: Synthesize across all retrieved docs
```

### ColBERT (Late Interaction)
```python
# Token-level matching instead of document-level
# Better accuracy than bi-encoder, faster than cross-encoder
# Good for production reranking
```

### Hypothetical Document Embedding (HyDE)
```python
# Generate hypothetical answer first, then retrieve
query = "What is the wire transfer limit?"
hypothetical = LLM("Answer this banking question: " + query)
# Embed hypothetical answer and search with it
results = retrieve(embed(hypothetical))
```

## 10. Interview Soundbites

**On Hybrid Search:**
> "I always use hybrid search in production - BM25 for exact matches (policy numbers, account codes) and dense for semantic (customer paraphrasing). The combination is strictly better than either alone, typically 10-15% improvement in recall."

**On Chunking:**
> "Chunking is the most underrated part of RAG. I chunk at semantic boundaries (section headers, paragraphs), not fixed character counts. And I always preserve metadata - a chunk without its document title loses context."

**On Reranking:**
> "Two-stage retrieval is standard now: retrieve top-100 with fast bi-encoder, rerank to top-5 with cross-encoder. The cross-encoder is 100x slower but catches the semantic matches that bi-encoders miss."

**On Faithfulness:**
> "For banking RAG, faithfulness isn't optional. I run NLI-based fact checking on every response, verify numbers match source documents, and always include citations. A hallucinated compliance answer is worse than no answer."

**On Evaluation:**
> "I evaluate retrieval and generation separately. You can have perfect retrieval and terrible generation, or vice versa. MRR for retrieval, faithfulness scores for generation, and end-to-end answer correctness for the full pipeline."

**On When NOT to Use RAG:**
> "RAG adds latency and complexity. If the answer is always in one document type with consistent structure, just retrieve the document - don't generate. RAG shines when you need to synthesize across documents or explain in natural language."

---

**Q: How do you handle documents that update frequently?**
> "Incremental indexing, not full rebuilds. I track document hashes and only re-embed changed docs. For time-sensitive content like policies, I add effective dates to metadata and filter at query time. Old versions stay indexed for audit purposes."

**Q: How do you prevent hallucination in RAG?**
> "Three defenses: (1) Strong retrieval - if the right docs aren't retrieved, hallucination is inevitable. (2) Grounding instructions in the prompt - 'only use information from the documents'. (3) Post-generation validation - check that claims can be traced to sources. If validation fails, return 'I don't have that information' instead."

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════╗
║                    NOTEBOOK SUMMARY                               ║
╠══════════════════════════════════════════════════════════════════╣
║  Task: Information Retrieval with RAG                            ║
║  Approaches: Dense retrieval, Hybrid search, Full RAG            ║
║  Banking Use: Policy Q&A, compliance search, knowledge base      ║
║                                                                  ║
║  Key Takeaways:                                                  ║
║  1. Hybrid search (BM25 + dense) beats either alone              ║
║  2. Reranking with cross-encoder improves precision              ║
║  3. Chunking strategy matters more than embedding model          ║
║  4. Faithfulness checking is mandatory for banking               ║
║  5. Always cite sources in generated responses                   ║
╚══════════════════════════════════════════════════════════════════╝
""")