# Agent CFO — Performance Optimization & Design

---
This is the starter notebook for your project. Follow the required structure below.


You will design and optimize an Agent CFO assistant for a listed company. The assistant should answer finance/operations questions using RAG (Retrieval-Augmented Generation) + agentic reasoning, with response time (latency) as the primary metric.

Your system must:
*   Ingest the company’s public filings.
*   Retrieve relevant passages efficiently.
*   Compute ratios/trends via tool calls (calculator, table parsing).
*   Produce answers with valid citations to the correct page/table.


In [1]:
import os
os.environ["GEMINI_API_KEY"] = "AIzaSyBKaJ1EXo5qvIcLVjbWaSQeT_pL5VA6XhU"  # replace with your key

## 1. Config & Secrets

Fill in your API keys in secrets. **Do not hardcode keys** in cells.

In [2]:
import os

# Example:
# os.environ['GEMINI_API_KEY'] = 'your-key-here'
# os.environ['OPENAI_API_KEY'] = 'your-key-here'

COMPANY_NAME = "DBS Bank"


## 2. Data Download (Dropbox)

*   Annual Reports: last 3–5 years.
*   Quarterly Results Packs & MD&A (Management Discussion & Analysis).
*   Investor Presentations and Press Releases.
*   These files must be submitted later as a deliverable in the Dropbox data pack.
*   Upload them under `/content/data/`.

Scope limit: each team will ingest minimally 15 PDF files total.


## 3. System Requirements

**Retrieval & RAG**
*   Use a vector index (e.g., FAISS, LlamaIndex) + a keyword filter (BM25/ElasticSearch).
*   Citations must include: report name, year, page number, section/table.

**Agentic Reasoning**
*   Support at least 3 tool types: calculator, table extraction, multi-document compare.
*   Reasoning must follow a plan-then-act pattern (not a single unstructured call).

**Instrumentation**
*   Log timings for: T_ingest, T_retrieve, T_rerank, T_reason, T_generate, T_total.
*   Log: tokens used, cache hits, tools invoked.
*   Record p50/p95 latencies.

In [7]:
"""
Stage1.py — Ingestion Pipeline

Builds a Knowledge Base (KB) + Vector Store with metadata.
Outputs:
  - data/kb_chunks.parquet      # canonical KB with metadata per chunk
  - data/kb_texts.npy           # chunk texts (parallel array)
  - data/kb_index.faiss         # FAISS index of embeddings
  - data/kb_meta.json           # small meta: embedding dim, model, version

Environment (optional):
  OPENAI_API_KEY    — for text-embedding-3-large or 3-small
  GEMINI_API_KEY    — for gemini-embedding text-002 (if you prefer)

You can also use local SentenceTransformers if installed.
"""
from __future__ import annotations
import os, re, json, math, uuid, pathlib, warnings
from dataclasses import dataclass, asdict
from typing import List, Dict, Any, Optional, Iterable, Tuple

import pandas as pd
import numpy as np

# --- optional deps ---
try:
    import faiss  # type: ignore
    _HAVE_FAISS = True
except Exception:
    _HAVE_FAISS = False

try:
    from rank_bm25 import BM25Okapi  # lightweight BM25 for hybrid
    _HAVE_BM25 = True
except Exception:
    _HAVE_BM25 = False

# PDF text extraction (pypdf) — optional
try:
    from pypdf import PdfReader  # minimal + reliable
    _HAVE_PDF = True
except Exception:
    _HAVE_PDF = False

# Embeddings backends (we'll load lazily in Provider)


DATA_DIR = os.environ.get("AGENT_CFO_DATA_DIR", "All")
OUT_DIR = os.environ.get("AGENT_CFO_OUT_DIR", "data")
EMBED_BACKEND = os.environ.get("AGENT_CFO_EMBED_BACKEND", "auto")  # 'auto', 'openai', 'gemini', 'st'
CHUNK_TOKENS = 450  # ~sentence-y chunks; we chunk by chars but aim for this size
CHUNK_OVERLAP = 80

pathlib.Path(OUT_DIR).mkdir(parents=True, exist_ok=True)

# -----------------------------
# Utilities
# -----------------------------

_YEAR_PAT = re.compile(r"\b(20\d{2})\b")
_Q_PAT = re.compile(r"([1-4])Q(\d{2})", re.I)  # e.g., 3Q24 (relaxed, allows underscores etc.)
_FY_PAT = re.compile(r"\bFY\s?(20\d{2})\b", re.I)

# Additional period patterns found in page headers
_QY_PAT_1 = re.compile(r"\b([1-4])\s*Q\s*(20\d{2}|\d{2})\b", re.I)   # e.g., 1 Q 2025, 2Q24
_QY_PAT_2 = re.compile(r"\bQ\s*([1-4])\s*(20\d{2}|\d{2})\b", re.I)     # e.g., Q3 2024
_QY_PAT_3 = re.compile(r"\b([1-4])Q\s*(20\d{2}|\d{2})\b", re.I)        # e.g., 3Q 2024
_FY_PAT_2 = re.compile(r"\bF[Yy]\s*(20\d{2})\b")


def infer_period_from_text(text: str) -> Tuple[Optional[int], Optional[int]]:
    """Try to infer (year, quarter) from page text (headers/footers).
    Rules:
    - Prefer explicit quarter-year patterns (1Q25, Q3 2024, 3Q 2024).
    - Accept FY headers (FY2024) as (2024, None).
    - Ignore lone years to avoid picking up copyright years (e.g., © 2023).
    """
    if not text:
        return (None, None)
    s = text[:500]  # scan a bit more of the header area
    # 1) Explicit quarter-year first
    for pat in (_QY_PAT_1, _QY_PAT_2, _QY_PAT_3):
        m = pat.search(s)
        if m:
            q = int(m.group(1))
            yy = int(m.group(2))
            y = 2000 + yy if yy < 100 else yy
            return (y, q)
    # 2) FY header
    m = _FY_PAT_2.search(s)
    if m:
        return (int(m.group(1)), None)
    # 3) Ignore bare years (too noisy: copyright, footers, etc.)
    return (None, None)
# -----------------------------
# Lightweight table extractor (keywords windows)
# -----------------------------

_KEY_TABLE_SPECS = [
    (re.compile(r"net\s+interest\s+margin|\bnim\b", re.I), "NIM table"),
    (re.compile(r"operating\s+expenses|\bopex\b|staff\s+costs", re.I), "Opex table"),
    (re.compile(r"cost[- ]?to[- ]?income|\bcti\b|efficiency\s+ratio", re.I), "CTI table"),
]

def extract_key_tables_from_page(text: str, window_lines: int = 18) -> List[Tuple[str, str]]:
    """Find small windows around key table keywords and return blocks.
    Returns list of (section_hint, block_text).
    """
    if not text:
        return []
    lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
    out: List[Tuple[str, str]] = []
    for i, ln in enumerate(lines):
        for pat, label in _KEY_TABLE_SPECS:
            if pat.search(ln):
                start = max(0, i - 2)
                end = min(len(lines), i + window_lines)
                block = "\n".join(lines[start:end])
                out.append((label, block))
                break
    return out

SECTION_LABELS = {
    r"key ratios|highlights|summary": "highlights/summary",
    r"net interest margin|nim\b": "Net interest margin (NIM)",
    r"cost[- ]?to[- ]?income|cti|efficiency ratio": "Cost-to-income (CTI)",
    r"operating expenses|opex|expenses": "Operating expenses (Opex)",
    r"income statement|statement of (comprehensive )?income": "Income statement",
    r"balance sheet|statement of financial position": "Balance sheet",
    r"management discussion|md&a": "MD&A",
}

_TABULAR_EXTS = {'.csv', '.xls', '.xlsx'}

def _is_pdf(path: str) -> bool:
    return str(path).lower().endswith('.pdf')

def _is_tabular(path: str) -> bool:
    return any(str(path).lower().endswith(ext) for ext in _TABULAR_EXTS)


def infer_period_from_filename(fname: str) -> Tuple[Optional[int], Optional[int]]:
    """Infer (year, quarter) from common file naming conventions.
    Examples: DBS_3Q24_CFO_Presentation.pdf -> (2024, 3)
              dbs-annual-report-2023.pdf    -> (2023, None)
    """
    base = fname.upper()
    m = _Q_PAT.search(base)
    if m:
        q = int(m.group(1))
        yy = int(m.group(2))
        year = 2000 + yy if yy < 100 else yy
        return (year, q)
    m = _YEAR_PAT.search(base)
    if m:
        return (int(m.group(1)), None)
    m = _FY_PAT.search(base)
    if m:
        return (int(m.group(1)), None)
    return (None, None)


def clean_section_hint(text: str) -> Optional[str]:
    # naive regex scan to tag common sections; optional
    for pat, label in SECTION_LABELS.items():
        if re.search(pat, text, flags=re.IGNORECASE):
            return label
    return None


# -----------------------------
# Chunking
# -----------------------------

def _split_text(text: str, chunk_size_chars: int = 1800, overlap_chars: int = 320) -> List[str]:
    text = (text or "").strip()
    if not text:
        return []
    out = []
    i = 0
    n = len(text)
    while i < n:
        j = min(n, i + chunk_size_chars)
        out.append(text[i:j])
        if j == n:
            break
        i = max(i + chunk_size_chars - overlap_chars, j)  # ensure progress
    return out


# -----------------------------
# PDF parsing
# -----------------------------

def extract_pdf_pages(pdf_path: str) -> List[Tuple[int, str]]:
    """Return list of (page_number_1based, text)."""
    if not _HAVE_PDF:
        raise RuntimeError("pypdf not installed. pip install pypdf")
    reader = PdfReader(pdf_path)
    out = []
    for i, page in enumerate(reader.pages, start=1):
        try:
            txt = page.extract_text() or ""
        except Exception:
            txt = ""
        out.append((i, txt))
    return out


# -----------------------------
# Tabular (CSV/Excel) parsing
# -----------------------------

def _df_to_blocks(df: pd.DataFrame, rows_per_block: int = 40) -> List[str]:
    """Split a DataFrame into row blocks and render each as a compact CSV string.
    Keeps headers on each block for standalone readability.
    """
    if df is None or df.empty:
        return []
    # Drop all-empty columns
    df = df.dropna(axis=1, how='all')
    # Convert everything to string to prevent pyarrow dtype issues downstream
    df = df.astype(str)
    blocks = []
    n = len(df)
    for i in range(0, n, rows_per_block):
        part = df.iloc[i:i+rows_per_block]
        csv_str = part.to_csv(index=False)
        blocks.append(csv_str)
    return blocks


def extract_tabular_chunks(path: str) -> List[Tuple[str, Optional[str]]]:
    """Return a list of (block_text, sheet_name) for CSV/Excel files.
    For CSV → one sheet named 'CSV'. For Excel → one per sheet.
    """
    out: List[Tuple[str, Optional[str]]] = []
    lower = path.lower()
    try:
        if lower.endswith('.csv'):
            df = pd.read_csv(path, low_memory=False)
            for block in _df_to_blocks(df):
                out.append((block, 'CSV'))
        else:
            # Excel: iterate sheets safely
            xl = pd.ExcelFile(path)
            for sheet in xl.sheet_names:
                try:
                    df = xl.parse(sheet)
                except Exception:
                    continue
                for block in _df_to_blocks(df):
                    out.append((block, sheet))
    except Exception:
        # If any parsing error, skip gracefully
        return []
    return out


# -----------------------------
# Embedding providers
# -----------------------------
class EmbeddingProvider:
    name: str = ""
    dim: int = 0
    def embed_batch(self, texts: List[str]) -> np.ndarray:
        raise NotImplementedError


class OpenAIProvider(EmbeddingProvider):
    def __init__(self, model: str = "text-embedding-3-small"):
        from openai import OpenAI  # requires OPENAI_API_KEY
        self.client = OpenAI()
        self.model = model
        # dims: 3-small=1536, 3-large=3072
        self.dim = 1536 if "small" in model else 3072
        self.name = f"openai:{model}"
    def embed_batch(self, texts: List[str]) -> np.ndarray:
        if not texts:
            return np.zeros((0, self.dim), dtype=np.float32)
        resp = self.client.embeddings.create(model=self.model, input=texts)
        vecs = [d.embedding for d in resp.data]
        return np.asarray(vecs, dtype=np.float32)


class STProvider(EmbeddingProvider):
    def __init__(self, model: str = "sentence-transformers/all-MiniLM-L6-v2"):
        from sentence_transformers import SentenceTransformer  # optional
        self.model_name = model
        self.model = SentenceTransformer(model)
        self.dim = self.model.get_sentence_embedding_dimension()
        self.name = f"st:{model}"
    def embed_batch(self, texts: List[str]) -> np.ndarray:
        if not texts:
            return np.zeros((0, self.dim), dtype=np.float32)
        vecs = self.model.encode(texts, batch_size=64, show_progress_bar=False, convert_to_numpy=True, normalize_embeddings=True)
        return vecs.astype(np.float32)


def pick_provider(backend: str = EMBED_BACKEND) -> EmbeddingProvider:
    """Pick embedding provider based on argument or environment variable.
    backend can be 'auto', 'openai', 'gemini', or 'st'.
    Auto-detect priority: OpenAI → Gemini → SentenceTransformers."""
    backend = (backend or 'auto').lower()

    # --- Explicit backend ---
    if backend == 'openai':
        return OpenAIProvider('text-embedding-3-small')
    elif backend == 'st' or backend == 'sentence-transformers':
        return STProvider('sentence-transformers/all-MiniLM-L6-v2')
    elif backend == 'gemini':
        try:
            from google import generativeai as genai
            key = os.environ.get('GEMINI_API_KEY')
            if not key:
                raise RuntimeError('GEMINI_API_KEY not set')
            genai.configure(api_key=key)
            class GeminiProvider(EmbeddingProvider):
                def __init__(self):
                    self.name = 'gemini:embedding-001'
                    self.dim = 0  # default size unknown initially
                def embed_batch(self, texts: List[str]) -> np.ndarray:
                    vecs = []
                    for t in texts:
                        resp = genai.embed_content(model='models/embedding-001', content=t)
                        emb = resp.get('embedding') if isinstance(resp, dict) else getattr(resp, 'embedding', None)
                        if emb is None:
                            raise RuntimeError('Gemini embed_content returned no embedding')
                        vecs.append(emb)
                    arr = np.asarray(vecs, dtype=np.float32)
                    if self.dim == 0 and arr.size:
                        self.dim = int(arr.shape[1])
                    return arr
            return GeminiProvider()
        except Exception as e:
            warnings.warn(f'Gemini provider init failed: {e}')

    # --- Auto detection ---
    if os.environ.get('OPENAI_API_KEY'):
        try:
            return OpenAIProvider('text-embedding-3-small')
        except Exception as e:
            warnings.warn(f'OpenAI provider init failed: {e}')
    if os.environ.get('GEMINI_API_KEY'):
        try:
            from google import generativeai as genai
            genai.configure(api_key=os.environ['GEMINI_API_KEY'])
            class GeminiProvider(EmbeddingProvider):
                def __init__(self):
                    self.name = 'gemini:embedding-001'
                    self.dim = 0
                def embed_batch(self, texts: List[str]) -> np.ndarray:
                    vecs = []
                    for t in texts:
                        resp = genai.embed_content(model='models/embedding-001', content=t)
                        emb = resp.get('embedding') if isinstance(resp, dict) else getattr(resp, 'embedding', None)
                        if emb is None:
                            raise RuntimeError('Gemini embed_content returned no embedding')
                        vecs.append(emb)
                    arr = np.asarray(vecs, dtype=np.float32)
                    if self.dim == 0 and arr.size:
                        self.dim = int(arr.shape[1])
                    return arr
            return GeminiProvider()
        except Exception as e:
            warnings.warn(f'Gemini provider init failed: {e}')
    try:
        return STProvider('sentence-transformers/all-MiniLM-L6-v2')
    except Exception as e:
        raise SystemExit(f'No embedding backend available. Install sentence-transformers or set an API key. {e}')


# -----------------------------
# Safe Parquet save with dtype sanitization
# -----------------------------

def _sanitize_and_save_parquet(df: pd.DataFrame, path: str) -> None:
    """Sanitize dtypes and save to Parquet, with fallbacks.
    - Forces primitive/nullable dtypes that are parquet-friendly
    - Tries pyarrow → fastparquet → CSV fallback
    """
    d = df.copy()
    # Standardize dtypes
    if 'doc_id' in d:
        d['doc_id'] = d['doc_id'].astype('string')
    if 'file' in d:
        d['file'] = d['file'].astype('string')
    if 'section_hint' in d:
        d['section_hint'] = d['section_hint'].astype('string')
    if 'page' in d:
        d['page'] = pd.to_numeric(d['page'], errors='coerce').fillna(0).astype('int32')
    if 'year' in d:
        # nullable small int for compactness
        d['year'] = pd.to_numeric(d['year'], errors='coerce').astype('Int16')
    if 'quarter' in d:
        d['quarter'] = pd.to_numeric(d['quarter'], errors='coerce').astype('Int8')

    # Try engines in order
    errors = []
    for engine in ('pyarrow', 'fastparquet'):
        try:
            d.to_parquet(path, engine=engine, index=False)
            return
        except Exception as e:
            errors.append(f"{engine}: {e}")
    # Final CSV fallback
    csv_path = os.path.splitext(path)[0] + '.csv'
    d.to_csv(csv_path, index=False)
    raise RuntimeError(
        "Failed to save Parquet with both pyarrow and fastparquet. "
        f"Wrote CSV fallback at {csv_path}. Errors: {' | '.join(errors)}"
    )


# -----------------------------
# Main ingest
# -----------------------------
@dataclass
class Chunk:
    doc_id: str
    file: str
    page: int
    year: Optional[int]
    quarter: Optional[int]
    section_hint: Optional[str]
    text: str


def walk_pdfs(root: str) -> List[str]:
    # Kept for backward compatibility (returns only PDFs)
    files = []
    for p in pathlib.Path(root).rglob("*.pdf"):
        files.append(str(p))
    return sorted(files)


def walk_all_docs(root: str) -> List[str]:
    """Return PDFs + CSV + Excel paths under root."""
    paths: List[str] = []
    for p in pathlib.Path(root).rglob("*"):
        if not p.is_file():
            continue
        s = str(p)
        if _is_pdf(s) or _is_tabular(s):
            paths.append(s)
    return sorted(paths)


def build_kb() -> Dict[str, Any]:
    docs = walk_all_docs(DATA_DIR)
    print(f"[Stage1] Scanning folder: {DATA_DIR} → found {len(docs)} document(s)")
    if not docs:
        raise SystemExit(f"No PDFs, CSVs or Excels found under {DATA_DIR}. Place files there.")

    rows: List[Dict[str, Any]] = []
    texts: List[str] = []

    for path in docs:
        fname = os.path.basename(path)
        print(f"[Stage1] Processing: {fname}")
        year, quarter = infer_period_from_filename(fname)
        if _is_pdf(path):
            pages = extract_pdf_pages(path)
            print(f"          → Pages detected: {len(pages)}")
            for page_num, page_text in pages:
                if not page_text.strip():
                    continue
                section_hint = clean_section_hint(page_text[:500])
                for chunk_text in _split_text(page_text):
                    doc_id = str(uuid.uuid4())
                    rows.append({
                        "doc_id": doc_id,
                        "file": fname,
                        "page": page_num,
                        "year": year,
                        "quarter": quarter,
                        "section_hint": section_hint,
                    })
                    texts.append(chunk_text)
            # Second pass: infer period from page header text if missing, and extract key tables
            # Re-iterate pages to attach refined (year, quarter) per page and table windows
            for page_num, page_text in pages:
                if not page_text.strip():
                    continue
                # Infer per-page period (only trust explicit QY or FY)
                y2, q2 = infer_period_from_text(page_text)
                # Start from filename-derived period
                y_eff, q_eff = year, quarter
                # If we detected a quarter-year on the page, use both
                if q2 is not None:
                    q_eff = q2
                    if y2 is not None:
                        y_eff = y2
                else:
                    # No quarter found on page; only allow FY to override year
                    if y2 is not None and q_eff is None:
                        # Only replace year if we don't already have a quarter from filename
                        y_eff = y2
                # Extract small windows for key tables (NIM/Opex/CTI)
                for label, block in extract_key_tables_from_page(page_text):
                    doc_id = str(uuid.uuid4())
                    rows.append({
                        "doc_id": doc_id,
                        "file": fname,
                        "page": page_num,
                        "year": y_eff,
                        "quarter": q_eff,
                        "section_hint": label,
                    })
                    texts.append(block)
        elif _is_tabular(path):
            blocks = extract_tabular_chunks(path)
            print(f"          → Table blocks: {len(blocks)}")
            # Use page=1 for tabular sources; include sheet name in section_hint
            for block_text, sheet in blocks:
                hint_from_name = clean_section_hint(fname) or "table"
                section_hint = f"{hint_from_name} / {sheet}" if sheet else hint_from_name
                doc_id = str(uuid.uuid4())
                rows.append({
                    "doc_id": doc_id,
                    "file": fname,
                    "page": 1,
                    "year": year,
                    "quarter": quarter,
                    "section_hint": section_hint,
                })
                texts.append(block_text)
        else:
            print(f"          → Skipped (unsupported type)")
        print(f"[Stage1] Done: {fname}")

    print(f"[Stage1] Total raw chunks prepared: {len(texts)}")

    kb = pd.DataFrame(rows)
    print(f"[Stage1] Metadata rows: {len(kb)}")

    texts_np = np.array(texts, dtype=object)

    # embed
    provider = pick_provider(EMBED_BACKEND)
    print(f"[Stage1] Embedding provider selected: {getattr(provider, 'name', type(provider).__name__)} (backend={EMBED_BACKEND})")
    try:
        vecs = provider.embed_batch(list(texts_np))
    except Exception as e:
        warn_msg = str(e)
        print(f"[Stage1] ⚠️ Provider failed: {getattr(provider, 'name', type(provider).__name__)} → {warn_msg}")
        print("[Stage1] → Falling back to SentenceTransformers (all-MiniLM-L6-v2)...")
        fallback = STProvider('sentence-transformers/all-MiniLM-L6-v2')
        provider = fallback
        vecs = provider.embed_batch(list(texts_np))
    print(f"[Stage1] Embedded {vecs.shape[0]} chunks (dim={vecs.shape[1]})")

    if not _HAVE_FAISS:
        raise SystemExit("faiss is not installed. pip install faiss-cpu")

    # build index (L2 on normalized vectors works as cosine)
    index = faiss.IndexFlatIP(vecs.shape[1])
    # ensure normalized
    norms = np.linalg.norm(vecs, axis=1, keepdims=True) + 1e-12
    vecs_norm = (vecs / norms).astype(np.float32)
    index.add(vecs_norm)
    print(f"[Stage1] FAISS index size: {index.ntotal}")

    # save artifacts
    kb_path = os.path.join(OUT_DIR, "kb_chunks.parquet")
    text_path = os.path.join(OUT_DIR, "kb_texts.npy")
    index_path = os.path.join(OUT_DIR, "kb_index.faiss")
    meta_path = os.path.join(OUT_DIR, "kb_meta.json")

    # Save KB with robust parquet saver
    _sanitize_and_save_parquet(kb, kb_path)
    np.save(text_path, texts_np)
    faiss.write_index(index, index_path)
    with open(meta_path, "w") as f:
        json.dump({"embedding_provider": provider.name, "dim": int(vecs.shape[1])}, f)

    print(f"Saved KB rows: {len(kb)} → {kb_path}")
    print(f"Saved texts:    {texts_np.shape} → {text_path}")
    print(f"Saved index:    {index.ntotal} vecs → {index_path}")
    print(f"Saved meta:     {meta_path}")

    # --- Post-build coverage report ---
    try:
        qm = (~kb['quarter'].isna()).mean()
        ym = (~kb['year'].isna()).mean()
        print(f"[Stage1] Coverage → year filled: {ym:.1%}, quarter filled: {qm:.1%}")
        # spot-check mismatches between filename and stored metadata
        import re
        pat = re.compile(r"([1-4])Q(\d{2})", re.I)
        mismatches = 0
        for i,r in kb.iterrows():
            m = pat.search(str(r['file']))
            if not m:
                continue
            qf = int(m.group(1)); yf = 2000 + int(m.group(2))
            y_ok = (pd.isna(r['year'])) or (int(r['year']) == yf)
            q_ok = (pd.isna(r['quarter'])) or (int(r['quarter']) == qf)
            if not (y_ok and q_ok):
                mismatches += 1
                if mismatches <= 5:
                    print(f"  ↳ mismatch: {r['file']} p.{r['page']} stored=({r['year']},{r['quarter']}) expected=({yf},{qf})")
        if mismatches:
            print(f"[Stage1] Mismatch count (sampled): {mismatches}")
    except Exception as _:
        pass

    return {"kb": kb_path, "texts": text_path, "index": index_path, "meta": meta_path}


if __name__ == "__main__":
    build_kb()


[Stage1] Scanning folder: All → found 29 document(s)
[Stage1] Processing: 1Q24_CEO_presentation.pdf
          → Pages detected: 6
[Stage1] Done: 1Q24_CEO_presentation.pdf
[Stage1] Processing: 1Q24_CFO_presentation.pdf
          → Pages detected: 17
[Stage1] Done: 1Q24_CFO_presentation.pdf
[Stage1] Processing: 1Q24_trading_update.pdf
          → Pages detected: 6
[Stage1] Done: 1Q24_trading_update.pdf
[Stage1] Processing: 1Q25_CEO_presentation.pdf
          → Pages detected: 6
[Stage1] Done: 1Q25_CEO_presentation.pdf
[Stage1] Processing: 1Q25_CFO_presentation.pdf
          → Pages detected: 18
[Stage1] Done: 1Q25_CFO_presentation.pdf
[Stage1] Processing: 1Q25_trading_update.pdf
          → Pages detected: 7
[Stage1] Done: 1Q25_trading_update.pdf
[Stage1] Processing: 2Q24_CEO_presentation.pdf
          → Pages detected: 4
[Stage1] Done: 2Q24_CEO_presentation.pdf
[Stage1] Processing: 2Q24_CFO_presentation.pdf
          → Pages detected: 30
[Stage1] Done: 2Q24_CFO_presentation.pdf
[Stage1]

E0000 00:00:1759848352.583810 36343173 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


[Stage1] ⚠️ Provider failed: gemini:embedding-001 → 429 You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0 [violations {
  quota_metric: "generativelanguage.googleapis.com/embed_content_free_tier_requests"
  quota_id: "EmbedContentRequestsPerMinutePerProjectPerModel-FreeTier"
}
violations {
  quota_metric: "generativelanguage.googleapis.com/embed_content_free_tier_requests"
  quota_id: "EmbedContentRequestsPerMinutePerUserPerProjectPerModel-FreeTier"

In [8]:
import pandas as pd, re

df = pd.read_parquet("data/kb_chunks.parquet")
print("Rows:", len(df))
print("Missing year %:", df['year'].isna().mean())
print("Missing quarter %:", df['quarter'].isna().mean())

# Compare filename-derived expectation vs stored metadata
qpat = re.compile(r"\b([1-4])Q(\d{2})\b", re.I)
def yq_from_name(fn):
    m = qpat.search(fn.upper())
    if m:
        q = int(m.group(1)); yy = int(m.group(2)); y = 2000 + yy
        return y, q
    return None, None

mismatch = []
for i, r in df.iterrows():
    y2, q2 = yq_from_name(str(r.file))
    if q2 is not None:   # only check quartered docs
        y_ok = (pd.isna(r.year) and y2 is None) or (not pd.isna(r.year) and int(r.year)==y2)
        q_ok = (pd.isna(r.quarter) and q2 is None) or (not pd.isna(r.quarter) and int(r.quarter)==q2)
        if not (y_ok and q_ok):
            mismatch.append((r.file, r.page, r.year, r.quarter, y2, q2))
            if len(mismatch) > 20: break

print("Sample mismatches (file, page, stored_year, stored_q, expected_year, expected_q):")
for x in mismatch[:20]:
    print(x)

Rows: 3231
Missing year %: 0.0
Missing quarter %: 0.8170844939647168
Sample mismatches (file, page, stored_year, stored_q, expected_year, expected_q):


## 4. Baseline Pipeline

**Baseline (starting point)**
*   Naive chunking.
*   Single-pass vector search.
*   One LLM call, no caching.

In [9]:
_Q_PAT_FN = re.compile(r"([1-4])Q(\d{2})", re.I)

def _infer_yq_from_filename(fname: str) -> tuple[Optional[int], Optional[int]]:
    if not fname:
        return (None, None)
    s = str(fname).upper()
    m = _Q_PAT_FN.search(s)
    if m:
        q = int(m.group(1)); yy = int(m.group(2)); y = 2000 + yy
        return (y, q)
    m = re.search(r"(20\d{2})", s)
    if m:
        return (int(m.group(1)), None)
    return (None, None)
"""
Stage2.py — Baseline Retrieval + Generation (RAG)

Consumes Stage1 artifacts:
  data/kb_chunks.parquet
  data/kb_texts.npy
  data/kb_index.faiss

Retrieval:
  - Hybrid (Vector + BM25 if available)
  - Period-aware filter for phrases like "last N years/quarters"
Generation:
  - One LLM call (Gemini/OpenAI placeholder); returns answer + citations
"""
from __future__ import annotations
import os, re, json, math
from typing import List, Dict, Any, Optional

import numpy as np
import pandas as pd

# Timing / logging (simple)
import time, contextlib

@contextlib.contextmanager
def timeblock(row: dict, key: str):
    t0 = time.perf_counter()
    try:
        yield
    finally:
        row[key] = round((time.perf_counter() - t0) * 1000.0, 2)

class _Instr:
    def __init__(self):
        self.rows = []
    def log(self, row):
        self.rows.append(row)
    def df(self):
        cols = ['Query','T_retrieve','T_rerank','T_reason','T_generate','T_total','Tokens','Tools']
        df = pd.DataFrame(self.rows)
        for c in cols:
            if c not in df:
                df[c] = None
        return df[cols]

instr = _Instr()


VERBOSE = bool(int(os.environ.get("AGENT_CFO_VERBOSE", "1")))  # default ON; set 0 to silence

# --- Hardcoded LLM selection (instead of environment variables) ---
LLM_BACKEND = "gemini"  # choose from "gemini" or "openai"
GEMINI_MODEL_NAME = "models/gemini-2.5-flash"
OPENAI_MODEL_NAME = "gpt-4o-mini"

# --- Retrieval toggles ---
USE_VECTOR = True   # set False to force BM25-only retrieval

# --- Lazy, notebook-friendly globals (set by init_stage2) ---
OUT_DIR = None
KB_PARQUET = None
KB_TEXTS = None
KB_INDEX = None
KB_META = None

kb: Optional[pd.DataFrame] = None
texts: Optional[np.ndarray] = None
index = None
bm25 = None
_HAVE_FAISS = False
_HAVE_BM25 = False
_INITIALIZED = False

class _EmbedLoader:
    def __init__(self):
        self.impl = None
        self.dim = None
        self.name = None
        if KB_META and os.path.exists(KB_META):
            with open(KB_META) as f:
                meta = json.load(f)
                self.name = meta.get("embedding_provider")
                self.dim = meta.get("dim")
    def embed(self, texts: List[str]) -> np.ndarray:
        if self.impl is None:
            preferred = (self.name or '').lower()
            # 1) If KB was built with Sentence-Transformers
            if 'sentence-transformers' in preferred or preferred.startswith('st'):
                from sentence_transformers import SentenceTransformer
                model = "sentence-transformers/all-MiniLM-L6-v2"
                st = SentenceTransformer(model)
                self.impl = ("st", model)
                self.dim = st.get_sentence_embedding_dimension()
                def _fn(batch):
                    vecs = st.encode(batch, batch_size=64, show_progress_bar=False, convert_to_numpy=True, normalize_embeddings=True)
                    return vecs.astype(np.float32)
                self.fn = _fn
            # 2) If KB was built with OpenAI
            elif preferred.startswith('openai'):
                from openai import OpenAI
                if not os.environ.get("OPENAI_API_KEY"):
                    raise RuntimeError("KB was built with OpenAI embeddings but OPENAI_API_KEY is not set.")
                self.client = OpenAI()
                model = "text-embedding-3-small"
                self.impl = ("openai", model)
                self.dim = 1536
                def _fn(batch):
                    resp = self.client.embeddings.create(model=model, input=batch)
                    vecs = [d.embedding for d in resp.data]
                    return np.asarray(vecs, dtype=np.float32)
                self.fn = _fn
            # 3) If KB was built with Gemini
            elif preferred.startswith('gemini'):
                try:
                    from google import generativeai as genai
                except Exception as e:
                    raise RuntimeError("KB was built with Gemini embeddings but google-generativeai is not installed. `pip install google-generativeai`.") from e
                if not os.environ.get("GEMINI_API_KEY"):
                    raise RuntimeError("KB was built with Gemini embeddings but GEMINI_API_KEY is not set.")
                genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
                self.impl = ("gemini", "models/embedding-001")
                self.dim = 768 if (self.dim is None) else self.dim
                def _fn(batch):
                    vecs = []
                    for t in batch:
                        resp = genai.embed_content(model='models/embedding-001', content=t)
                        emb = resp.get('embedding') if isinstance(resp, dict) else getattr(resp, 'embedding', None)
                        if emb is None:
                            raise RuntimeError('Gemini embed_content returned no embedding')
                        vecs.append(emb)
                    return np.asarray(vecs, dtype=np.float32)
                self.fn = _fn
            # 4) Fallback auto-detect (prefer ST so it works offline)
            else:
                if os.environ.get("OPENAI_API_KEY"):
                    from openai import OpenAI
                    self.client = OpenAI()
                    model = "text-embedding-3-small"
                    self.impl = ("openai", model)
                    self.dim = 1536
                    def _fn(batch):
                        resp = self.client.embeddings.create(model=model, input=batch)
                        vecs = [d.embedding for d in resp.data]
                        return np.asarray(vecs, dtype=np.float32)
                    self.fn = _fn
                elif os.environ.get("GEMINI_API_KEY"):
                    from google import generativeai as genai
                    genai.configure(api_key=os.environ.get("GEMINI_API_KEY"))
                    self.impl = ("gemini", "models/embedding-001")
                    self.dim = 768 if (self.dim is None) else self.dim
                    def _fn(batch):
                        vecs = []
                        for t in batch:
                            resp = genai.embed_content(model='models/embedding-001', content=t)
                            emb = resp.get('embedding') if isinstance(resp, dict) else getattr(resp, 'embedding', None)
                            if emb is None:
                                raise RuntimeError('Gemini embed_content returned no embedding')
                            vecs.append(emb)
                        return np.asarray(vecs, dtype=np.float32)
                    self.fn = _fn
                else:
                    from sentence_transformers import SentenceTransformer
                    model = "sentence-transformers/all-MiniLM-L6-v2"
                    st = SentenceTransformer(model)
                    self.impl = ("st", model)
                    self.dim = st.get_sentence_embedding_dimension()
                    def _fn(batch):
                        vecs = st.encode(batch, batch_size=64, show_progress_bar=False, convert_to_numpy=True, normalize_embeddings=True)
                        return vecs.astype(np.float32)
                    self.fn = _fn
        return self.fn(texts)

EMB = None  # will be initialized inside init_stage2() after KB_META is known

def init_stage2(out_dir: str = "data") -> None:
    """Initialize Stage 2 in a Jupyter-friendly way.
    Loads KB artifacts, FAISS, and BM25. Call this once per notebook kernel.
    """
    import os
    global OUT_DIR, KB_PARQUET, KB_TEXTS, KB_INDEX, KB_META
    global kb, texts, index, bm25, _HAVE_FAISS, _HAVE_BM25, _INITIALIZED

    OUT_DIR = out_dir
    KB_PARQUET = os.path.join(OUT_DIR, "kb_chunks.parquet")
    KB_TEXTS   = os.path.join(OUT_DIR, "kb_texts.npy")
    KB_INDEX   = os.path.join(OUT_DIR, "kb_index.faiss")
    KB_META    = os.path.join(OUT_DIR, "kb_meta.json")

    if VERBOSE:
        print(f"[Stage2] init → OUT_DIR={OUT_DIR}")

    if not (os.path.exists(KB_PARQUET) and os.path.exists(KB_TEXTS) and os.path.exists(KB_INDEX)):
        raise RuntimeError(f"KB artifacts not found under '{OUT_DIR}'. Run Stage1.build_kb() first.")

    # Load KB tables
    kb = _load_kb_table(KB_PARQUET)
    texts = np.load(KB_TEXTS, allow_pickle=True)

    # (Optional but helpful) Print embedding provider from KB meta if available
    if KB_META and os.path.exists(KB_META):
        try:
            meta = json.load(open(KB_META))
            if VERBOSE:
                print(f"[Stage2] KB embedding provider={meta.get('embedding_provider')} dim={meta.get('dim')}")
        except Exception:
            pass

    if VERBOSE:
        print(f"[Stage2] KB rows={len(kb)}, texts={len(texts)}")

    # FAISS
    try:
        import faiss  # type: ignore
        _HAVE_FAISS = True
        idx = faiss.read_index(KB_INDEX)
    except Exception as e:
        _HAVE_FAISS = False
        idx = None
    globals()['index'] = idx

    if VERBOSE:
        print(f"[Stage2] FAISS loaded={bool(idx)}")

    # BM25 (optional)
    try:
        from rank_bm25 import BM25Okapi
        tokenized = [str(t).lower().split() for t in texts]
        bm25 = BM25Okapi(tokenized)
        _HAVE_BM25 = True
    except Exception:
        bm25 = None
        _HAVE_BM25 = False
    globals()['bm25'] = bm25

    if VERBOSE:
        print(f"[Stage2] BM25 enabled={_HAVE_BM25}")

    # Initialize query embedder **after** KB_META is known so it matches the store
    globals()['EMB'] = _EmbedLoader()
    if VERBOSE:
        try:
            impl = getattr(EMB, 'impl', None)
            print(f"[Stage2] Query embedder ready: {impl if impl else 'lazy-init'}")
        except Exception:
            pass

    # Mark initialized
    _INITIALIZED = True

def _ensure_init():
    if not globals().get('_INITIALIZED', False):
        raise RuntimeError("Stage2 is not initialized. Call init_stage2(out_dir='data') first in your notebook.")

# -----------------------------
# Robust KB loader (parquet → fastparquet → csv)
# -----------------------------

def _load_kb_table(parquet_path: str) -> pd.DataFrame:
    """Load the KB table with fallbacks.
    1) pandas.read_parquet (default engine)
    2) pandas.read_parquet(engine='fastparquet')
    3) CSV fallback at same basename (kb_chunks.csv)
    """
    try:
        return pd.read_parquet(parquet_path)
    except Exception as e1:
        try:
            return pd.read_parquet(parquet_path, engine='fastparquet')
        except Exception as e2:
            csv_path = os.path.splitext(parquet_path)[0] + '.csv'
            if os.path.exists(csv_path):
                df = pd.read_csv(csv_path)
                # Ensure required columns exist
                for c in ['doc_id','file','page','year','quarter','section_hint']:
                    if c not in df.columns:
                        df[c] = np.nan
                # Coerce numeric cols
                if 'page' in df: df['page'] = pd.to_numeric(df['page'], errors='coerce').fillna(0).astype(int)
                if 'year' in df: df['year'] = pd.to_numeric(df['year'], errors='coerce')
                if 'quarter' in df: df['quarter'] = pd.to_numeric(df['quarter'], errors='coerce')
                return df
            raise RuntimeError(
                "Failed to read KB Parquet with both engines and no CSV fallback. "
                f"Errors: pyarrow={e1} | fastparquet={e2}"
            )

# -----------------------------
# Helper: period filters
# -----------------------------

def _detect_last_n_years(q: str) -> Optional[int]:
    ql = q.lower()
    for pat in ["last three years", "last 3 years", "past three years", "past 3 years"]:
        if pat in ql:
            return 3
    return None

def _detect_last_n_quarters(q: str) -> Optional[int]:
    ql = q.lower()
    for pat in ["last five quarters", "last 5 quarters", "past five quarters", "past 5 quarters"]:
        if pat in ql:
            return 5
    return None


def _period_filter(hits: List[Dict[str, Any]], want_years: Optional[int], want_quarters: Optional[int]) -> List[Dict[str, Any]]:
    if not hits:
        return hits
    df = pd.DataFrame(hits)
    if want_quarters:
        df = df.sort_values(["year", "quarter"], ascending=[False, False])
        df = df[df["quarter"].notna()]
        seen = set(); keep_idx = []
        for i, r in df.iterrows():
            key = (int(r.year), int(r.quarter))
            if key in seen: continue
            keep_idx.append(i); seen.add(key)
            if len(keep_idx) >= want_quarters: break
        if VERBOSE:
            print(f"[Stage2] period filter (quarters) → kept={[(int(hits[i]['year']), int(hits[i]['quarter'])) for i in keep_idx]}")
        return [hits[i] for i in keep_idx] if keep_idx else hits
    if want_years:
        df = df.sort_values(["year"], ascending=[False])
        df = df[df["year"].notna()]
        seen = set(); keep_idx = []
        for i, r in df.iterrows():
            y = int(r.year)
            if y in seen: continue
            keep_idx.append(i); seen.add(y)
            if len(keep_idx) >= want_years: break
        if VERBOSE:
            print(f"[Stage2] period filter (years) → kept={[(int(hits[i]['year'])) for i in keep_idx]}")
        return [hits[i] for i in keep_idx] if keep_idx else hits
    return hits

# -----------------------------
# Hybrid retrieval
# -----------------------------

def hybrid_search(query: str, top_k=12, alpha=0.6) -> List[Dict[str, Any]]:
    _ensure_init()
    """Return list of hit dicts with metadata.
    alpha weights vector vs BM25: score = alpha*vec + (1-alpha)*bm25
    """
    row = {"Query": query, "Tools": ["retriever"]}
    with timeblock(row, "T_total"):
        with timeblock(row, "T_retrieve"):
            vec_scores = None
            if USE_VECTOR and _HAVE_FAISS and index is not None and EMB is not None:
                try:
                    qv = EMB.embed([query])
                    # Validate dimensionality against KB meta if available
                    try:
                        meta_dim = int(EMB.dim) if EMB.dim is not None else None
                    except Exception:
                        meta_dim = None
                    if meta_dim is not None and qv.shape[1] != meta_dim:
                        raise RuntimeError(f"Embedding dimension mismatch: query={qv.shape[1]} vs KB={meta_dim}. Rebuild Stage1 with the same provider or align Stage2 to use the same embedding backend.")
                    qv = qv / (np.linalg.norm(qv, axis=1, keepdims=True) + 1e-12)
                    sims, ids = index.search(qv.astype(np.float32), top_k)
                    vec_scores = {int(ix): float(s) for ix, s in zip(ids[0], sims[0]) if ix != -1}
                except Exception as e:
                    if VERBOSE:
                        print(f"[Stage2] Vector search disabled for this query → {type(e).__name__}: {e}")
                    vec_scores = None  # continue with BM25-only
            bm25_scores = None
            if _HAVE_BM25 and bm25 is not None:
                scores = bm25.get_scores(query.lower().split())
                top_idx = np.argsort(scores)[-top_k:][::-1]
                bm25_scores = {int(i): float(scores[i]) for i in top_idx}
        with timeblock(row, "T_rerank"):
            fused = {}
            if vec_scores:
                for i,s in vec_scores.items():
                    fused[i] = fused.get(i, 0.0) + alpha*s
            if bm25_scores:
                m = max(bm25_scores.values()) or 1.0
                for i,s in bm25_scores.items():
                    fused[i] = fused.get(i, 0.0) + (1-alpha)*(s/m)
            if not fused:
                hits = []
            else:
                top = sorted(fused.items(), key=lambda x: x[1], reverse=True)[:top_k]
                hits = []
                for i,score in top:
                    meta = kb.iloc[i]
                    y = int(meta.year) if not pd.isna(meta.year) else None
                    q = int(meta.quarter) if not pd.isna(meta.quarter) else None
                    if (y is None) or (q is None):
                        y2, q2 = _infer_yq_from_filename(meta.file)
                        if y is None:
                            y = y2
                        if q is None:
                            q = q2
                    hits.append({
                        "doc_id": meta.doc_id,
                        "file": meta.file,
                        "page": int(meta.page),
                        "year": y,
                        "quarter": q,
                        "section_hint": meta.section_hint if isinstance(meta.section_hint, str) else None,
                        "preview": str(texts[i])[:800],
                        "score": float(score),
                    })
    instr.log(row)
    if VERBOSE:
        kept = [(h.get('year'), h.get('quarter'), h.get('file')) for h in hits[:5]]
        print(f"[Stage2] retrieved top={len(hits)} sample={kept}")
    return hits


def format_citation(hit: dict) -> str:
    parts = [hit.get("file","?")]
    if hit.get("year"):
        if hit.get("quarter"):
            parts.append(f"{hit['quarter']}Q{str(hit['year'])[2:]}")
        else:
            parts.append(str(hit["year"]))
    parts.append(f"p.{hit.get('page','?')}")
    sec = hit.get("section_hint")
    if sec:
        parts.append(sec)
    return " — ".join(parts)


def _context_from_hits(hits: List[Dict[str,Any]], top_ctx=3, max_chars=1200) -> str:
    _ensure_init()
    blocks = []
    for h in hits[:top_ctx]:
        text = str(texts[kb.index[kb.doc_id == h["doc_id"]][0]]) if (kb.doc_id == h["doc_id"]).any() else h.get("preview","")
        if len(text) > max_chars:
            text = text[:max_chars] + " ..."
        blocks.append(f"[{format_citation(h)}]\n{text}")
    return "\n\n".join(blocks)

# -----------------------------
# LLM call helper
# -----------------------------

def _call_llm(prompt: str) -> str:
    backend = LLM_BACKEND.lower()
    if backend == "gemini":
        try:
            from google import generativeai as genai
        except Exception as e:
            raise RuntimeError("Selected backend 'gemini' but google-generativeai is not installed. `pip install google-generativeai`.") from e
        api_key = os.environ.get("GEMINI_API_KEY")
        if not api_key:
            raise RuntimeError("Selected backend 'gemini' but GEMINI_API_KEY is not set.")
        model_name = GEMINI_MODEL_NAME
        try:
            genai.configure(api_key=api_key)
            model = genai.GenerativeModel(model_name)
            resp = model.generate_content(prompt)
            text = getattr(resp, 'text', None) if resp is not None else None
            if not text:
                text = str(resp)
            if VERBOSE:
                print(f"[Stage2] LLM=Gemini ({model_name})")
            return text
        except Exception as e:
            raise RuntimeError(f"Gemini generation failed: {e}") from e
    elif backend == "openai":
        try:
            from openai import OpenAI
        except Exception as e:
            raise RuntimeError("Selected backend 'openai' but the OpenAI SDK is not installed. `pip install openai`.") from e
        api_key = os.environ.get("OPENAI_API_KEY")
        if not api_key:
            raise RuntimeError("Selected backend 'openai' but OPENAI_API_KEY is not set.")
        try:
            client = OpenAI()
            model = OPENAI_MODEL_NAME
            resp = client.chat.completions.create(
                model=model,
                messages=[{"role":"system","content":"You are Agent CFO."},{"role":"user","content": prompt}],
                temperature=0.2,
            )
            text = resp.choices[0].message.content
            if VERBOSE:
                print(f"[Stage2] LLM=OpenAI ({model})")
            return text
        except Exception as e:
            raise RuntimeError(f"OpenAI generation failed: {e}") from e
    else:
        raise RuntimeError("Invalid LLM_BACKEND setting; choose 'gemini' or 'openai'.")

# -----------------------------
# Generation (one call)
# -----------------------------

def answer_with_llm(query: str, top_k_retrieval=12, top_ctx=3) -> Dict[str, Any]:
    _ensure_init()
    want_years = _detect_last_n_years(query)
    want_quarters = _detect_last_n_quarters(query)

    hits = hybrid_search(query, top_k=top_k_retrieval, alpha=0.6)
    hits = _period_filter(hits, want_years, want_quarters)

    context = _context_from_hits(hits, top_ctx=top_ctx)

    system_task = (
        "You are Agent CFO. Answer the user's finance/operations question using ONLY the provided context. "
        "When you state any figures, also provide citations in the format: "
        "[Report, Year/Quarter, p.X, Section/Table]. Keep the answer concise and factual."
    )
    user_prompt = (
        f"Question:\n{query}\n\n"
        f"Context passages (use for citations):\n{context}\n\n"
        "Instructions:\n"
        "1) If a value cannot be supported by the context, say so.\n"
        "2) Include citations inline like: (DBS 3Q24 CFO Presentation — p.14 — Cost/Income table).\n"
        "3) End with a short one-line takeaway."
    )
    prompt = f"{system_task}\n\n{user_prompt}"

    row = {"Query": f"[generate] {query}", "Tools": ["retriever","generator"], "Tokens": 0}

    # Placeholder for your LLM call; swap in Gemini/OpenAI
    with timeblock(row, "T_total"), timeblock(row, "T_generate"):
        text = _call_llm(prompt)
        row["Tokens"] = int(len(prompt)//4)

    instr.log(row)

    explicit_citations = "\n".join(f"- {format_citation(h)}" for h in hits[:top_ctx])
    final_answer = text.strip() + "\n\nCitations:\n" + explicit_citations

    return {"answer": final_answer, "hits": hits[:top_ctx], "raw_model_text": text}

def get_logs() -> pd.DataFrame:
    """Return the instrumentation DataFrame for display in notebooks."""
    return instr.df()

def is_initialized() -> bool:
    return bool(globals().get('_INITIALIZED', False))

# Benchmark queries as required
BENCHMARK_QUERIES = [
    "Report the Gross Margin (or Net Interest Margin, if a bank) over the last 5 quarters, with values.",
    "Show Operating Expenses for the last 3 fiscal years, year-on-year comparison.",
    "Calculate the Operating Efficiency Ratio (Opex ÷ Operating Income) for the last 3 fiscal years, showing the working.",
]


def run_benchmark(top_k_retrieval=12, top_ctx=3) -> List[Dict[str, Any]]:
    out = []
    for q in BENCHMARK_QUERIES:
        out.append({"query": q, **answer_with_llm(q, top_k_retrieval=top_k_retrieval, top_ctx=top_ctx)})
    return out


if __name__ == "__main__":
    od = os.environ.get("AGENT_CFO_OUT_DIR", "data")
    init_stage2(od)
    if VERBOSE:
        print("[Stage2] Ready. Use answer_with_llm(query) to generate.")
    if os.environ.get("RUN_DEMO", "0") == "1":
        for r in run_benchmark():
            print("\nQ:", r["query"], "\n")
            print(r["answer"])

[Stage2] init → OUT_DIR=data
[Stage2] KB embedding provider=st:sentence-transformers/all-MiniLM-L6-v2 dim=384
[Stage2] KB rows=3231, texts=3231
[Stage2] FAISS loaded=True
[Stage2] BM25 enabled=True
[Stage2] Query embedder ready: lazy-init
[Stage2] Ready. Use answer_with_llm(query) to generate.


In [None]:
import json, importlib, Stage2
Stage2 = importlib.reload(Stage2)
Stage2.init_stage2("data")

print("→ KB meta:", json.load(open("data/kb_meta.json")))  # should say st:... and dim ~384
print("→ Query embedder impl (before first use):", Stage2.EMB.impl)

In [None]:
import importlib, Stage2
Stage2 = importlib.reload(Stage2)

print("LLM_BACKEND =", Stage2.LLM_BACKEND)
print("GEMINI_MODEL_NAME =", Stage2.GEMINI_MODEL_NAME)
print("OPENAI_MODEL_NAME =", Stage2.OPENAI_MODEL_NAME)

In [None]:
import importlib, Stage2
Stage2 = importlib.reload(Stage2)

# choose one generator; embeddings stay ST (per kb_meta)
# If using Gemini:
# os.environ["GEMINI_API_KEY"] = "..."   # set this in your process
# Stage2.LLM_BACKEND = "gemini"

# Or OpenAI:
# os.environ["OPENAI_API_KEY"] = "..."
# Stage2.LLM_BACKEND = "openai"

Stage2.init_stage2("data")
out = Stage2.answer_with_llm("Report the Net Interest Margin (NIM) over the last 5 quarters, with values.")
print(out["answer"])

### Just to check available models

In [13]:
import google.generativeai as genai
import os

# Best practice: store your key as an environment variable
# Or replace "YOUR_API_KEY" with your actual key string for a quick test
genai.configure(api_key=os.environ.get("GEMINI_API_KEY", "YOUR_API_KEY"))

print("Available Models:\n")

# List all models and check which ones support the 'generateContent' method
for model in genai.list_models():
  if 'generateContent' in model.supported_generation_methods:
    print(f"- {model.name}")

Available Models:

- models/gemini-2.5-pro-preview-03-25
- models/gemini-2.5-flash-preview-05-20
- models/gemini-2.5-flash
- models/gemini-2.5-flash-lite-preview-06-17
- models/gemini-2.5-pro-preview-05-06
- models/gemini-2.5-pro-preview-06-05
- models/gemini-2.5-pro
- models/gemini-2.0-flash-exp
- models/gemini-2.0-flash
- models/gemini-2.0-flash-001
- models/gemini-2.0-flash-exp-image-generation
- models/gemini-2.0-flash-lite-001
- models/gemini-2.0-flash-lite
- models/gemini-2.0-flash-preview-image-generation
- models/gemini-2.0-flash-lite-preview-02-05
- models/gemini-2.0-flash-lite-preview
- models/gemini-2.0-pro-exp
- models/gemini-2.0-pro-exp-02-05
- models/gemini-exp-1206
- models/gemini-2.0-flash-thinking-exp-01-21
- models/gemini-2.0-flash-thinking-exp
- models/gemini-2.0-flash-thinking-exp-1219
- models/gemini-2.5-flash-preview-tts
- models/gemini-2.5-pro-preview-tts
- models/learnlm-2.0-flash-experimental
- models/gemma-3-1b-it
- models/gemma-3-4b-it
- models/gemma-3-12b-it

E0000 00:00:1759844543.896133 36142634 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


## 5. Benchmark Runner

Run these 3 standardized queries. Produce JSON then prose answers with citations. These are the standardized queries.

*   Net Interest Margin (NIM) trend over last 5 quarters, values and 1–2 lines of explanation.
    *   Expected: quarterly financial highlights.
*   Operating Expenses (Opex) YoY for last 3 years; top 3 drivers from MD&A.
    *   Expected: Opex table + MD&A commentary.
*   Cost-to-Income Ratio (CTI) for last 3 years; show working + implications.
    *   Expected: Operating Income & Opex lines.


In [10]:
"""
Stage3.py — Benchmark Runner (Stage 3)

Runs the 3 standardized queries, times them, saves JSON, and prints prose answers with citations.

Artifacts written to OUT_DIR (default: data/):
  - bench_results.json      # structured results
  - bench_report.md         # human-readable answers with citations
"""
from __future__ import annotations
import os, json, time
from typing import List, Dict, Any

import pandas as pd

# Import Stage 2 API
from Stage2 import init_stage2, answer_with_llm

OUT_DIR = os.environ.get("AGENT_CFO_OUT_DIR", "data")

# --- Standardized queries (exact spec) ---
QUERIES: List[str] = [
    # 1) NIM trend over last 5 quarters
    "Report the Net Interest Margin (NIM) over the last 5 quarters, with values, and add 1–2 lines of explanation.",
    # 2) Opex YoY with top 3 drivers
    "Show Operating Expenses (Opex) for the last 3 fiscal years, year-on-year comparison, and summarize the top 3 Opex drivers from the MD&A.",
    # 3) CTI ratio for last 3 years with working & implications
    "Calculate the Cost-to-Income Ratio (CTI) for the last 3 fiscal years; show your working and give 1–2 lines of implications.",
]


def _format_hits(hits: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    out = []
    for h in hits:
        out.append({
            "file": h.get("file"),
            "year": h.get("year"),
            "quarter": h.get("quarter"),
            "page": h.get("page"),
            "section_hint": h.get("section_hint"),
        })
    return out


def run_benchmark(top_k_retrieval: int = 12, top_ctx: int = 3, out_dir: str = OUT_DIR, print_prose: bool = False) -> Dict[str, Any]:
    os.makedirs(out_dir, exist_ok=True)
    init_stage2(out_dir)

    rows = []
    results: List[Dict[str, Any]] = []

    for q in QUERIES:
        t0 = time.perf_counter()
        out = answer_with_llm(q, top_k_retrieval=top_k_retrieval, top_ctx=top_ctx)
        lat_ms = round((time.perf_counter() - t0) * 1000.0, 2)

        if print_prose:
            print(f"\n=== Question ===\n{q}")
            print("\n--- Answer ---\n")
            print(out["answer"].strip())
            if out.get("hits"):
                print("\n--- Citations (top ctx) ---")
                for h in _format_hits(out.get("hits", [])):
                    y = f" {h['year']}" if h.get('year') is not None else ""
                    qtr = f" {h['quarter']}Q{str(h['year'])[2:]}" if h.get('quarter') else ""
                    sec = f" — {h['section_hint']}" if h.get('section_hint') else ""
                    print(f"- {h['file']}{y}{qtr} — p.{h['page']}{sec}")
            print(f"\n(latency: {lat_ms} ms)")

        results.append({
            "query": q,
            "answer": out["answer"],
            "hits": _format_hits(out.get("hits", [])),
            "latency_ms": lat_ms,
        })
        rows.append({"Query": q, "Latency_ms": lat_ms})

    # Save JSON
    json_path = os.path.join(out_dir, "bench_results.json")
    with open(json_path, "w") as f:
        json.dump({"results": results}, f, indent=2)

    # Save simple markdown report
    md_lines = ["# Agent CFO — Benchmark Report\n"]
    for i, r in enumerate(results, start=1):
        md_lines.append(f"\n## Q{i}. {r['query']}")
        md_lines.append("\n**Answer**\n\n" + r["answer"].strip())
        if r.get("hits"):
            md_lines.append("\n**Citations (top ctx)**")
            for h in r["hits"]:
                y = f" {h['year']}" if h.get('year') is not None else ""
                qtr = f" {h['quarter']}Q{str(h['year'])[2:]}" if h.get('quarter') else ""
                sec = f" — {h['section_hint']}" if h.get('section_hint') else ""
                md_lines.append(f"- {h['file']}{y}{qtr} — p.{h['page']}{sec}")
    md_path = os.path.join(out_dir, "bench_report.md")
    with open(md_path, "w") as f:
        f.write("\n".join(md_lines) + "\n")

    df = pd.DataFrame(rows)
    if print_prose and not df.empty:
        p50 = float(df['Latency_ms'].quantile(0.5))
        p95 = float(df['Latency_ms'].quantile(0.95))
        print(f"\n=== Benchmark Summary ===\nSaved JSON: {json_path}\nSaved report: {md_path}\nLatency p50: {p50:.1f} ms, p95: {p95:.1f} ms")

    # Return a compact summary (and a DataFrame for notebook display if desired)
    return {"json_path": json_path, "md_path": md_path, "summary": df}


if __name__ == "__main__":
    run_benchmark(print_prose=True)

[Stage2] init → OUT_DIR=data
[Stage2] KB embedding provider=st:sentence-transformers/all-MiniLM-L6-v2 dim=384
[Stage2] KB rows=3231, texts=3231
[Stage2] FAISS loaded=True
[Stage2] BM25 enabled=True
[Stage2] Query embedder ready: lazy-init
[Stage2] retrieved top=12 sample=[(2025, 1, '1Q25_CFO_presentation.pdf'), (2024, 1, '3Q24_CFO_presentation.pdf'), (2024, 1, '1Q25_CFO_presentation.pdf'), (2024, 3, '3Q24_CFO_presentation.pdf'), (2020, None, 'dbs-annual-report-2020.pdf')]
[Stage2] period filter (quarters) → kept=[(2025, 1), (2024, 4), (2024, 3), (2024, 2), (2024, 1)]


E0000 00:00:1759848435.369582 36343173 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


[Stage2] LLM=Gemini (models/gemini-2.5-flash)

=== Question ===
Report the Net Interest Margin (NIM) over the last 5 quarters, with values, and add 1–2 lines of explanation.

--- Answer ---

The Net Interest Margin (NIM) over the last five quarters is as follows:

*   1Q25: 2.12% [1Q25_CFO_presentation.pdf, 1Q25, p.5, Net interest margin (NIM)]
*   4Q24: 2.15% [1Q25_CFO_presentation.pdf, 1Q25, p.5, Net interest margin (NIM)]
*   3Q24: 2.11% [1Q25_CFO_presentation.pdf, 1Q25, p.5, Net interest margin (NIM)]
*   2Q24: 2.14% [1Q25_CFO_presentation.pdf, 1Q25, p.5, Net interest margin (NIM)]
*   1Q24: 2.14% [1Q25_CFO_presentation.pdf, 1Q25, p.5, Net interest margin (NIM)]

The Group NIM declined 3 basis points in 1Q25 compared to the previous quarter, primarily due to lower interest rates [1Q25_CFO_presentation.pdf, 1Q25, p.5, Net interest margin (NIM)].

Overall, NIM has remained within a narrow range over the past five quarters.

Citations:
- 1Q25_CFO_presentation.pdf — 1Q25 — p.5 — Net in

E0000 00:00:1759848447.886224 36343173 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


[Stage2] LLM=Gemini (models/gemini-2.5-flash)

=== Question ===
Show Operating Expenses (Opex) for the last 3 fiscal years, year-on-year comparison, and summarize the top 3 Opex drivers from the MD&A.

--- Answer ---

Operating expenses for the last 3 fiscal years are not provided in the context.

The context states that expenses for Q1 2024 were SGD 2.08 billion [1Q24_trading_update.pdf, 1Q24, p.3, Cost-to-income (CTI)]. A year-on-year comparison cannot be performed due to the absence of full fiscal year figures.

The context mentions "non-recurring items" as a factor impacting quarterly expenses [1Q24_trading_update.pdf, 1Q24, p.3, Cost-to-income (CTI)] and "sign-on bonuses" amounting to SGD 574 thousand as payments made during the Financial Year 2020 [dbs-annual-report-2020.pdf, 2020, p.46, Other provisions]. However, the provided context does not summarize the top 3 operating expense drivers from the MD&A.

The available context is insufficient to provide Opex for the last three fi

E0000 00:00:1759848464.263146 36343173 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


[Stage2] LLM=Gemini (models/gemini-2.5-flash)

=== Question ===
Calculate the Cost-to-Income Ratio (CTI) for the last 3 fiscal years; show your working and give 1–2 lines of implications.

--- Answer ---

I cannot calculate the Cost-to-Income Ratio (CTI) for the last three fiscal years as the provided context does not contain the necessary figures for Cost and Income, nor does it explicitly state the CTI for any specific year. The passage mentioning "Cost-to-income (CTI)" provides a qualitative discussion rather than numerical data [dbs-annual-report-2020.pdf, 2020, p.12, Cost-to-income (CTI)].

The required figures are not available in the provided context.

Citations:
- dbs-annual-report-2020.pdf — 2020 — p.193
- dbs-annual-report-2020.pdf — 2020 — p.194
- dbs-annual-report-2020.pdf — 2020 — p.12 — Cost-to-income (CTI)

--- Citations (top ctx) ---
- dbs-annual-report-2020.pdf 2020 — p.193
- dbs-annual-report-2020.pdf 2020 — p.194
- dbs-annual-report-2020.pdf 2020 — p.12 — Cost-to-inc

## 6. Instrumentation

Log timings: T_ingest, T_retrieve, T_rerank, T_reason, T_generate, T_total. Log tokens, cache hits, tools.

In [None]:
# Example instrumentation schema
import pandas as pd
logs = pd.DataFrame(columns=['Query','T_ingest','T_retrieve','T_rerank','T_reason','T_generate','T_total','Tokens','CacheHits','Tools'])
logs

## 7. Optimizations

**Required Optimizations**

Each team must implement at least:
*   2 retrieval optimizations (e.g., hybrid BM25+vector, smaller embeddings, dynamic k).
*   1 caching optimization (query cache or ratio cache).
*   1 agentic optimization (plan pruning, parallel sub-queries).
*   1 system optimization (async I/O, batch embedding, memory-mapped vectors).

In [None]:
# TODO: Implement optimizations


## 8. Results & Plots

Show baseline vs optimized. Include latency plots (p50/p95) and accuracy tables.

In [None]:
# TODO: Generate plots with matplotlib
