# Part 3: "The Librarian" ‚Äì Advanced Hybrid RAG System

**Goal:** Build an Advanced Hybrid RAG Pipeline that retrieves financial entities
(e.g., "Form 10-K", "$37B") that pure semantic search often misses.

**Pipeline:**
```
Question ‚Üí Dense (Weaviate nearVector) + BM25 (Weaviate keyword)
         ‚Üí Reciprocal Rank Fusion (RRF)
         ‚Üí Cross-Encoder Reranking
         ‚Üí Answer Generation (OpenAI / Intern fine-tuned / Base model)
```

**Steps in this notebook:**
1. Setup: imports and environment
2. Load config + locate PDF
3. Build / load Weaviate index
4. Dense retrieval demo
5. BM25 retrieval demo
6. RRF fusion
7. Cross-encoder reranking
8. End-to-end `query_librarian()` demo
9. Generator comparison: OpenAI vs Intern fine-tuned vs Base model

## Setup: Imports and Environment

In [None]:
import os
import sys
import time
from pathlib import Path

# Add project root to path (same pattern as other notebooks)
project_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
sys.path.insert(0, str(project_root))
os.chdir(project_root)

print(f"‚úì Project root: {project_root}")

In [None]:
# Load OpenAI API key (Colab secrets or .env)
try:
    from google.colab import userdata
    os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
    print("‚úì OpenAI API key loaded from Colab secrets")
except Exception:
    from dotenv import load_dotenv
    load_dotenv(project_root / '.env')
    print("‚úì Environment loaded from .env")

print(f"‚úì OPENAI_API_KEY set: {'Yes' if os.environ.get('OPENAI_API_KEY') else 'No'}")

In [None]:
# Import project modules
from src.utils.config_loader import load_config
from src.ingestion.pdf_loader import load_pdf, clean_text
from src.ingestion.chunker import chunk_text

# RAG modules
from src.rag.weaviate_store import connect_weaviate, ensure_collection
from src.rag.index_builder import ensure_index_built, build_index
from src.rag.retrieval import dense_search, bm25_search
from src.rag.fusion import rrf_fusion
from src.rag.reranker import rerank
from src.rag.generation import generate_answer, build_rag_prompt
from src.rag.librarian_inference import query_librarian

from sentence_transformers import SentenceTransformer, CrossEncoder

print("‚úì All imports successful")

## Step 1: Load Configuration

In [None]:
config_path = project_root / 'config' / 'config.yaml'
config = load_config(config_path)

rag_cfg = config['rag']

# Resolve PDF path
raw_data = project_root / config['environment']['paths']['raw_data']
doc_name = config['project']['document']
pdf_path = raw_data / doc_name

print("‚úì Configuration loaded")
print(f"  PDF: {pdf_path} (exists: {pdf_path.exists()})")
print(f"  Vector DB: {rag_cfg['vector_db']['provider']} ({rag_cfg['vector_db']['mode']})")
print(f"  Embedding model: {rag_cfg['embeddings']['model']}")
print(f"  Reranker: {rag_cfg['refinement']['reranker']['model']}")
print(f"  RRF k: {rag_cfg['refinement']['rrf']['k']}")
print(f"  Retrieval top-k: {rag_cfg['retrieval']['top_k']}")
print(f"  Reranker top-k: {rag_cfg['refinement']['reranker']['top_k']}")
print(f"  Generator mode: {rag_cfg['inference'].get('generator_mode', 'openai')}")
print(f"  Answer LLM: {rag_cfg['inference']['answer_llm']['model']}")

## Step 2: Build / Load Weaviate Index

This loads the PDF, chunks it, embeds with SentenceTransformers, and upserts
into Weaviate. Idempotent: if the index already exists, it skips rebuilding.

In [None]:
# Load the embedding model (cached for reuse)
embedder = SentenceTransformer(rag_cfg['embeddings']['model'])
print(f"‚úì Embedder loaded: {rag_cfg['embeddings']['model']}")
print(f"  Embedding dimension: {embedder.get_sentence_embedding_dimension()}")

In [None]:
# Connect to Weaviate and build/verify the index
client = connect_weaviate(config)
print(f"‚úì Weaviate client connected ({rag_cfg['vector_db']['mode']} mode)")

count = ensure_index_built(
    config,
    client=client,
    embedder=embedder,
    force_rebuild=False,  # Set True to rebuild from scratch
    verbose=True,
)
print(f"\n‚úì Weaviate index ready: {count} chunks indexed")

## Step 3: Dense Retrieval Demo

Dense (vector) retrieval encodes the question and finds chunks with the
closest embedding vectors. Good for semantic/paraphrase queries but can
miss exact entity names.

In [None]:
test_query = "What was the total revenue reported in the annual report?"

dense_results = dense_search(
    test_query,
    client=client,
    embedder=embedder,
    config=config,
    top_n=10,
)

print(f"üîµ Dense search: '{test_query}'")
print(f"   Retrieved {len(dense_results)} results\n")
for i, r in enumerate(dense_results[:5], 1):
    print(f"  [{i}] chunk {r['meta']['chunk_id']} "
          f"(score: {r['score']:.4f}) "
          f"pp. {r['meta']['page_start']}-{r['meta']['page_end']}")
    print(f"      {r['content'][:150]}...\n")

## Step 4: BM25 Retrieval Demo

BM25 (keyword/sparse) retrieval matches exact terms. Essential for
entity-heavy financial queries like "Form 10-K" or "$37B".

In [None]:
bm25_results = bm25_search(
    test_query,
    client=client,
    config=config,
    top_n=10,
)

print(f"üü† BM25 search: '{test_query}'")
print(f"   Retrieved {len(bm25_results)} results\n")
for i, r in enumerate(bm25_results[:5], 1):
    print(f"  [{i}] chunk {r['meta']['chunk_id']} "
          f"(score: {r['score']:.4f}) "
          f"pp. {r['meta']['page_start']}-{r['meta']['page_end']}")
    print(f"      {r['content'][:150]}...\n")

In [None]:
# Compare: which chunks appear in dense but not BM25 and vice versa?
dense_ids = {r['id'] for r in dense_results}
bm25_ids = {r['id'] for r in bm25_results}

print(f"Dense-only chunks: {len(dense_ids - bm25_ids)}")
print(f"BM25-only chunks:  {len(bm25_ids - dense_ids)}")
print(f"Overlap:           {len(dense_ids & bm25_ids)}")
print(f"\n‚Üí This overlap gap is why hybrid search matters!")

## Step 5: Reciprocal Rank Fusion (RRF)

RRF combines the two ranked lists into one. Formula:

$$\text{RRF}(d) = \sum_{i} \frac{1}{k + \text{rank}_i(d)}$$

Docs that rank high in **both** lists get the highest fused scores.

In [None]:
rrf_k = int(config['rag']['refinement']['rrf']['k'])

fused_results = rrf_fusion(dense_results, bm25_results, k=rrf_k)

print(f"üîÄ RRF Fusion (k={rrf_k}): {len(fused_results)} unique candidates\n")
for i, r in enumerate(fused_results[:8], 1):
    print(f"  [{i}] chunk {r['meta']['chunk_id']} "
          f"RRF={r['rrf_score']:.4f} "
          f"(dense_rank={r.get('dense_rank', '-')}, bm25_rank={r.get('bm25_rank', '-')})")
    print(f"      {r['content'][:120]}...\n")

## Step 6: Cross-Encoder Reranking

The cross-encoder sees (query, document) **together** ‚Äì unlike the bi-encoder
which embeds them separately. This gives higher accuracy but is too slow
for the full corpus, so we only apply it to the top fused candidates.

In [None]:
# Load cross-encoder reranker
reranker_model = CrossEncoder(rag_cfg['refinement']['reranker']['model'])
rerank_top_k = int(rag_cfg['refinement']['reranker']['top_k'])

print(f"‚úì Reranker loaded: {rag_cfg['refinement']['reranker']['model']}")

reranked_results = rerank(
    test_query,
    fused_results[:20],  # rerank top-20 fused candidates
    cross_encoder=reranker_model,
    top_k=rerank_top_k,
)

print(f"\nüèÜ Reranked to top-{len(reranked_results)}:\n")
for i, r in enumerate(reranked_results, 1):
    print(f"  [{i}] chunk {r['meta']['chunk_id']} "
          f"rerank_score={r['rerank_score']:.3f} "
          f"(RRF={r.get('rrf_score', 0):.4f})")
    print(f"      {r['content'][:150]}...\n")

## Step 7: End-to-End `query_librarian()` Demo

The `query_librarian(question)` function runs the full pipeline:
Dense + BM25 ‚Üí RRF ‚Üí Rerank ‚Üí Generate answer.

It returns the answer plus source chunks and pipeline statistics.

In [None]:
# Close the manually-opened client ‚Äì query_librarian manages its own
client.close()
print("‚úì Manual Weaviate client closed (query_librarian handles its own)")

In [None]:
result = query_librarian(
    "What was the total revenue reported in the annual report?",
    config_path=str(config_path),
    generator_mode="openai",
    verbose=True,
)

print("\n" + "=" * 70)
print(f"Answer ({result['generator_mode']}):")
print("=" * 70)
print(result['answer'])
print("\nPipeline stats:")
for k, v in result['stats'].items():
    print(f"  {k}: {v}")
print("\nTop sources:")
for i, src in enumerate(result['sources'], 1):
    print(f"  [{i}] chunk {src['chunk_id']} "
          f"(pp. {src['page_start']}-{src['page_end']}) "
          f"rerank={src['scores'].get('rerank', 'N/A')}")

## Step 8: Entity-Heavy Queries (Why Hybrid Search Matters)

Financial documents contain specific entities that pure semantic search
often misses. These queries test the hybrid retrieval advantage.

In [None]:
entity_queries = [
    "What information is disclosed in the Form 10-K filing?",
    "What are the key financial metrics for fiscal year 2024?",
    "What are the main business segments and their revenue contributions?",
    "What risk factors are highlighted in the annual report?",
    "What does the report say about stock-based compensation?",
]

print("üìä Testing entity-heavy queries with query_librarian()\n")
for q in entity_queries:
    print(f"Q: {q}")
    r = query_librarian(q, config_path=str(config_path), generator_mode="openai", verbose=False)
    print(f"A: {r['answer'][:300]}{'...' if len(r['answer']) > 300 else ''}")
    print(f"   [{r['stats']['total_ms']:.0f}ms | {r['stats']['reranked_k']} sources]\n")

## Step 9: Generator Comparison ‚Äì OpenAI vs Intern (Fine-tuned) vs Base Model

**Experiment:** Run the **same retrieved context** through three different
answer generators to compare how much fine-tuning impacts contextual
understanding.

| Generator | Description |
|-----------|-------------|
| **OpenAI (gpt-4o)** | Cloud API baseline ‚Äì strong general model |
| **Intern fine-tuned** | Llama-3-8B + LoRA from Part 2 ‚Äì domain-adapted |
| **Base model** | Llama-3-8B-Instruct without LoRA ‚Äì control group |

> **Note:** Running Intern / Base requires GPU (Colab T4). If running on CPU-only,
> this section will only show OpenAI results.

In [None]:
import torch

gpu_available = torch.cuda.is_available()
print(f"GPU available: {gpu_available}")
if gpu_available:
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Decide which generators to compare
generators_to_test = ["openai"]
if gpu_available:
    generators_to_test.extend(["intern_finetuned", "intern_base"])
    print("\n‚úì Will compare: OpenAI vs Intern (fine-tuned) vs Base model")
else:
    print("\n‚ö† No GPU ‚Äì comparing OpenAI only. Run in Colab for full comparison.")

In [None]:
comparison_questions = [
    "What was the total revenue reported in the annual report?",
    "What are the main risk factors mentioned in the filing?",
    "What is the company's strategy for growth in the coming years?",
]

print("üî¨ Generator Comparison Experiment")
print("=" * 70)

for q_idx, question in enumerate(comparison_questions, 1):
    print(f"\n{'='*70}")
    print(f"Q{q_idx}: {question}")
    print("=" * 70)

    for mode in generators_to_test:
        print(f"\n  ü§ñ Generator: {mode}")
        try:
            t0 = time.time()
            result = query_librarian(
                question,
                config_path=str(config_path),
                generator_mode=mode,
                verbose=False,
            )
            elapsed = (time.time() - t0) * 1000
            print(f"  Answer: {result['answer'][:400]}")
            print(f"  ‚è± {elapsed:.0f}ms total")
        except Exception as e:
            print(f"  ‚ùå Error: {e}")

print("\n" + "=" * 70)
print("‚úÖ Generator comparison complete")

## Summary

**What we built:**

| Component | Implementation |
|-----------|---------------|
| **Vector Database** | Weaviate (embedded mode) |
| **Dense Retrieval** | SentenceTransformer `all-MiniLM-L6-v2` ‚Üí Weaviate `near_vector` |
| **Sparse Retrieval** | Weaviate built-in BM25 |
| **Fusion** | Reciprocal Rank Fusion (RRF, k=60) |
| **Reranking** | CrossEncoder `ms-marco-MiniLM-L-6-v2` |
| **Generation** | OpenAI gpt-4o / Intern fine-tuned / Base model |
| **Entrypoint** | `query_librarian(question)` ‚Üí answer + sources + stats |

**Key Insights:**
- Hybrid search (dense + BM25) catches entities that pure vector search misses
- RRF elegantly combines rankings without needing score calibration
- Cross-encoder reranking provides high-precision final ordering
- Fine-tuning the generator on domain data can improve answer quality

**Artifacts:**
- `src/rag/` ‚Äì reusable RAG modules
- `.weaviate/` ‚Äì embedded Weaviate persistence
- `config/config.yaml` ‚Äì all parameters under `rag.*`