# How to Run (Chapter 5 Setup)

Follow these steps before running any cells:

- macOS/Linux
  1) Open Terminal
  2) cd to this repository root (27July2025)
  3) Run: `bash chapter5/setup/setup.sh`
  4) Activate: `source .venv-ch5/bin/activate`

- Windows (PowerShell)
  1) Open PowerShell
  2) cd to this repository root (27July2025)
  3) Run: `powershell -ExecutionPolicy Bypass -File chapter5/setup/setup.ps1`
  4) Activate: `.\.venv-ch5\Scripts\Activate.ps1`

- Google Colab
  1) Just run the first code cell. It detects Colab and mounts Google Drive
  2) Paths are set under: `/content/drive/MyDrive/data-strategy-book/27July2025/`
  3) No virtual environment needed in Colab; dependencies install via pip cells as needed

Environment selection:
- Local notebooks should be run inside `.venv-ch5` for Chapter 5
- `.env` is loaded from `chapter5/setup/.env` if present; copy `.env.example` and fill keys as needed

After setup, proceed with the Colab/path cell to confirm DB/DATA/TRACE locations.


In [1]:
# Reranker providers (commented): enable one, add API key in chapter5/setup/.env, and set cfg['enable_rerank']=True

# --- Cohere (hosted cross-encoder) ---
# import os
# import cohere
# COHERE_API_KEY = os.getenv("COHERE_API_KEY")
# co = cohere.Client(COHERE_API_KEY) if COHERE_API_KEY else None
# def cohere_rerank(query: str, docs: list[str], top_n: int = 5):
#     if not co:
#         print("[rerank] COHERE_API_KEY missing; skipping")
#         return docs[:top_n]
#     resp = co.rerank(query=query, documents=docs, top_n=top_n, model="rerank-english-v3.0")
#     # Return documents ordered by relevance
#     return [docs[r.index] for r in resp.results]

# --- Voyage (hosted cross-encoder) ---
# import os
# import voyageai as vo
# VOYAGE_API_KEY = os.getenv("VOYAGE_API_KEY")
# vc = vo.Client(api_key=VOYAGE_API_KEY) if VOYAGE_API_KEY else None
# def voyage_rerank(query: str, docs: list[str], top_n: int = 5):
#     if not vc:
#         print("[rerank] VOYAGE_API_KEY missing; skipping")
#         return docs[:top_n]
#     res = vc.rerank(query=query, documents=docs, model="rerank-2-lite", top_k=top_n)
#     # Voyage returns indices sorted by score desc
#     return [docs[i] for i in res.indices]

# --- Local (open-source; slower on CPU) ---
# from rerankers import Reranker
# _local_rr = Reranker("cross-encoder/ms-marco-MiniLM-L-6-v2")
# def local_rerank(query: str, docs: list[str], top_n: int = 5):
#     ranked = _local_rr.rank(query, docs)  # returns list[(doc, score)]
#     return [d for d, _ in ranked[:top_n]]

# Integration suggestion in demo cell:
# if cfg['enable_rerank']:
#     docs = [h['text'] for h in raw_hits]
#     # choose one provider
#     # docs = cohere_rerank(q, docs, cfg['rerank_k'])
#     # docs = voyage_rerank(q, docs, cfg['rerank_k'])
#     # docs = local_rerank(q, docs, cfg['rerank_k'])
#     # Map docs back to hits by text or add IDs to avoid ambiguity

# Advanced RAG Techniques — Companion Notebook

This notebook accompanies Chapter 5. It layers advanced capabilities on top of Chapter 4 without changing chunking/embeddings.


# How to Run (Chapter 5 Setup)

This notebook builds on Chapter 4. Run these steps before executing cells.

- macOS/Linux
  1) Open Terminal
  2) cd to repo root: data-strategy-book/27July2025
  3) Run setup: bash chapter5/setup/setup.sh
  4) Activate env: source .venv-ch5/bin/activate
  5) Start Jupyter: jupyter lab (or jupyter notebook)
  6) Select the Python kernel that points to .venv-ch5 if prompted

- Windows (PowerShell)
  1) Open PowerShell
  2) cd to repo root: data-strategy-book/27July2025
  3) Run setup: powershell -ExecutionPolicy Bypass -File chapter5/setup/setup.ps1
  4) Activate env: .\.venv-ch5\Scripts\Activate.ps1
  5) Start Jupyter: jupyter lab (or jupyter notebook)
  6) Select the Python kernel that points to .venv-ch5 if prompted

- Google Colab
  1) Just run the first code cell; it detects Colab and mounts Google Drive
  2) Paths are set to /content/drive/MyDrive/data-strategy-book/27July2025/
  3) No virtualenv needed; pip installs happen inline as required

Environment selection and keys
- Local runs should use the .venv-ch5 environment for Chapter 5
- Copy chapter5/setup/.env.example to chapter5/setup/.env and fill keys if needed
- You can verify setup by running: python chapter5/setup/validate_setup.py

Paths used by this notebook
- DB: ch5_db (under repo root or in Drive on Colab)
- Traces: chapter5/traces/advanced_rag.jsonl
- Data: chapter5/data/

After setup, proceed with the “Colab detection and Drive mount + path selection” cell to confirm DB/DATA/TRACE locations.

In [2]:
# Colab detection and Drive mount + path selection
from pathlib import Path
def in_colab():
    try:
        import google.colab  # type: ignore
        return True
    except Exception:
        return False

if in_colab():
    from google.colab import drive  # type: ignore
    drive.mount('/content/drive', force_remount=False)
    BASE = '/content/drive/MyDrive/data-strategy-book/27July2025'
else:
    BASE = str(Path.cwd().parents[2])  # repo root heuristic

DB_PATH = str(Path(BASE) / 'ch5_db')
TRACES_PATH = str(Path(BASE) / 'chapter5' / 'traces' / 'advanced_rag.jsonl')
DATA_PATH = str(Path(BASE) / 'chapter5' / 'data')
print('Paths:')
print(' BASE  =', BASE)
print(' DB    =', DB_PATH)
print(' TRACES=', TRACES_PATH)
print(' DATA  =', DATA_PATH)


Paths:
 BASE  = /Users/relhousieny/code/personal/books/data-strategy-book
 DB    = /Users/relhousieny/code/personal/books/data-strategy-book/ch5_db
 TRACES= /Users/relhousieny/code/personal/books/data-strategy-book/chapter5/traces/advanced_rag.jsonl
 DATA  = /Users/relhousieny/code/personal/books/data-strategy-book/chapter5/data


In [3]:
# Kernel fix + silent installs + env load (aligned with Ch1/Ch4 style)
import sys, subprocess, os
def pip_install(pkg):
    try:
        subprocess.run([sys.executable, '-m', 'pip', 'install', pkg, '-q'], check=True)
    except Exception as e:
        print('Install failed:', pkg, e)

# Minimal deps expected for this notebook; setup scripts already pin versions
for p in ['python-dotenv']:
    pip_install(p)

from dotenv import load_dotenv
env_guess = Path.cwd().parents[2] / 'chapter5' / 'setup' / '.env'
if env_guess.exists():
    load_dotenv(env_guess)
else:
    load_dotenv()
print('Loaded .env (if present)')


Loaded .env (if present)


In [4]:
# Imports + config toggles
from datetime import datetime
from pathlib import Path
import json

# Local modules
from chapter5.code.retrieval.advanced import (
    Hit, retrieve, blend_hybrid_scores, apply_rerank, self_query_extract, multi_query_expand,
    route, apply_recency_boost, assemble_context
)
from chapter5.code.agents.iterative import agent_loop
from chapter5.code.prompts.profiles import build_prompt
from chapter5.code.safety.filters import pre_filters, post_validators
from chapter5.code.observability.trace import TraceRecord, write_trace_jsonl, now_ms
from chapter5.code.multimodal.pdf_utils import extract_captions_near_match

# Feature toggles
cfg = {
  'enable_hybrid': True,
  'enable_rerank': False,
  'rerank_model': 'stub',
  'rerank_k': 5,
  'enable_multi_query': False,
  'enable_self_query': False,
  'route_policy': 'auto',
  'recency_boost': True,
  'prompt_profile': 'grounded_baseline',
  'safety_pre': True,
  'safety_post': True,
  'trace_enabled': True,
}
cfg


ModuleNotFoundError: No module named 'chapter5'

> Rerank note: To enable reranking, set `cfg['enable_rerank']=True`, fill `.env` keys (COHERE_API_KEY or VOYAGE_API_KEY) in `chapter5/setup/.env`,
    and uncomment one provider in the "Reranker providers" cell above (Cohere, Voyage, or Local).

### Getting a reranker API key (quick)

- **Cohere Rerank**: Create an account → get API key at https://dashboard.cohere.com/api-keys → set `COHERE_API_KEY` in `chapter5/setup/.env`.
- **VoyageAI Rerank**: Sign up → get key at https://dashboard.voyageai.com/ → set `VOYAGE_API_KEY` in `.env`.
- **Jina AI Rerankers**: Create key at https://cloud.jina.ai/ (if using Jina’s hosted rerankers) → set provider-specific key in `.env`.

> Tip: Restart kernel after editing `.env` so the notebook picks up the new key.


## Connect to Chroma (reuse or create Chapter 5 DB)
We will use a local persistent DB. If Chapter 4's DB exists, you can also point to it.


In [None]:
import chromadb
client = chromadb.PersistentClient(path=DB_PATH)
COLL = 'chapter5_demo'
try:
    collection = client.get_collection(COLL)
except Exception:
    collection = client.create_collection(name=COLL, metadata={'hnsw:space': 'cosine'})
print('Collection:', collection.name)


## Data: tiny samples (personal, enterprise, policies)
We add a few records with primitive metadata to keep demos fast and meaningful.


In [None]:
from pathlib import Path
import uuid, time

def read_text(p):
    try:
        return Path(p).read_text(encoding='utf-8')
    except Exception:
        return ''

docs = []
base = Path(DATA_PATH)
samples = [
    base/'personal'/'note1.md',
    base/'personal'/'note2.md',
    base/'enterprise'/'policy_hr.md',
    base/'policies'/'policy_hate.md',
]
now_iso = datetime.utcnow().isoformat()
for sp in samples:
    if sp.exists():
        docs.append({
            'id': sp.stem + '_' + str(uuid.uuid4())[:8],
            'text': read_text(sp),
            'metadata': {
                'source': sp.parent.name,
                'doc_type': sp.suffix.replace('.', '') or 'txt',
                'created_at': now_iso,
                'path': str(sp)
            }
        })

if docs:
    collection.add(
        ids=[d['id'] for d in docs],
        documents=[d['text'] for d in docs],
        metadatas=[d['metadata'] for d in docs],
    )
print('Seeded docs:', len(docs))


## Advanced Retrieval demos (hybrid/rerank/routing/multi/self-query)


In [None]:
# Retrieval wiring to Chroma for this notebook
def chroma_search(query: str, n: int = 5):
    res = collection.query(query_texts=[query], n_results=n, include=['metadatas', 'documents', 'distances'])
    hits = []
    for i, doc in enumerate(res.get('documents', [[]])[0]):
        md = res['metadatas'][0][i] if res.get('metadatas') else {}
        score = 1.0 / (1.0 + res['distances'][0][i]) if res.get('distances') else 0.0
        hits.append({'id': md.get('path', f'doc_{i}'), 'text': doc, 'metadata': md, 'score': score})
    return hits

q = 'When is the follow-up and who attends?'
raw_hits = chroma_search(q, n=8)
print('Top raw hits:', len(raw_hits))

# Hybrid/rerank/multi/self-query stubs (using local functions)
diag = {'routed': 'auto'}
if cfg['enable_rerank']:
    raw_hits = apply_rerank([
        __import__('types').SimpleNamespace(**h) for h in raw_hits
    ], cfg['rerank_model'], cfg['rerank_k'])
print('Diagnostics:', diag)


## Iterative / Agentic loop


In [None]:
def retriever_for_agent(sub_q: str):
    return {'hits': chroma_search(sub_q, n=5), 'diagnostics': {'routed': 'auto'}}
out = agent_loop('Summarize meeting action items with attendees', max_hops=2, options={'retriever': retriever_for_agent})
out


## Multi-source / ACL-aware retrieval


In [None]:
hits = chroma_search('PTO policy for contractors', n=5)
ctx = assemble_context([__import__('types').SimpleNamespace(**h) for h in hits], max_chars=800, source_filters={'source': 'enterprise'})
print(ctx[:400])


## Pragmatic multi-modal (figure/caption)


> PDF demo: Setup auto-created `chapter5/data/pdfs/sample_report_with_figure.pdf`. The cell below will look for a PDF under `chapter5/data/pdfs/`.

In [None]:
# If a PDF sample exists under chapter5/data/pdfs, extract nearby snippets for a keyword
pdf_dir = Path(DATA_PATH) / 'pdfs'
pdfs = list(pdf_dir.glob('*.pdf'))
if pdfs:
    caps = extract_captions_near_match(str(pdfs[0]), 'figure')
    print(caps[:1])
else:
    print('No PDF sample found; skipping demo.')


## Freshness + Cache Augmented Generation (CAG)


In [None]:
# Recency boost demo (uses created_at metadatas)
hits = chroma_search('time off policy', n=8)
from chapter5.code.retrieval.advanced import Hit as _Hit
hits_cls = [_Hit(id=h['id'], text=h['text'], score=h['score'], metadata=h['metadata']) for h in hits]
boosted = apply_recency_boost(hits_cls)
[(h.id, round(h.score, 3)) for h in boosted[:5]]


## Prompt profiles + safety checks


In [None]:
question = 'Do contractors accrue PTO? Please cite.'
hits = chroma_search('contractors accrue PTO', n=5)
ctx = assemble_context([__import__('types').SimpleNamespace(**h) for h in hits], max_chars=1000)
if cfg['safety_pre']:
    ctx = pre_filters(ctx, user_ctx={'forbidden_sources': []})
prompt = build_prompt(cfg['prompt_profile'], ctx, question)
# Here we would call an LLM; we keep a stubbed answer for the skeleton
answer = 'Contractors do not accrue PTO. [source=enterprise; id=policy_hr]'
res = post_validators(answer, policy={'require_citation': True})
res


## Observability traces


In [None]:
rec = TraceRecord(
    query_id='demo-001', timestamp=now_ms(), retriever='chroma', routed='auto',
    k_before=8, k_after=5, rerank_used=cfg['enable_rerank'], latency_ms=12.3, cost_usd=0.0,
    profile=cfg['prompt_profile'], hops=1, citations=1
)
write_trace_jsonl(TRACES_PATH, rec)
print('Trace written to', TRACES_PATH)


## Mini end-to-end demos
- Multi-step question (agent loop)
- Hybrid + rerank (ordering change)
- Recency-sensitive (newer outranks)
- PDF figure/caption question (if sample PDF present)
