# AIRMAN — Technical Assignment Solution

This notebook is organized **exactly** in this order:

1. **Level 1 (Compulsory)** — PDF ingestion, chunking with LangChain text splitter, vector index, strict grounded answering with citations, minimal API.
2. **Level 2 (Optional — Option 1)** — Hybrid retrieval (**BM25 + Vector**) + **Cross-Encoder reranker**.
3. **Question set + Evaluation** — 50 questions + baseline (Level 1) vs hybrid (Level 2) comparison.

The system follows a strict grounding rule:

> **This information is not available in the provided document(s).**

whenever the answer cannot be supported from retrieved text.


In [None]:
# Install dependencies (run once in a fresh environment)
%pip -q install "langchain>=0.2.0" "langchain-community>=0.2.0" "langchain-text-splitters>=0.2.0" \
    "pypdf>=4.0.0" "faiss-cpu>=1.7.4" "sentence-transformers>=2.2.2" \
    "rank-bm25>=0.2.2" "fastapi>=0.110.0" "uvicorn>=0.27.0" "pydantic>=2.6.0"

In [None]:
import os
import re
import json
from dataclasses import dataclass
from pathlib import Path
from typing import List, Dict, Any, Tuple

import numpy as np
import faiss

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi

from fastapi import FastAPI
from pydantic import BaseModel

In [None]:
DATA_DIR = Path("./data")
INDEX_DIR = Path("./index_store")

EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
RERANK_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"

CHUNK_SIZE = 1000
CHUNK_OVERLAP = 150

TOP_K_VECTOR = 12
TOP_K_BM25 = 24
TOP_K_FINAL = 6

RERANK_MIN_SCORE = 4.0

REFUSAL_TEXT = "This information is not available in the provided document(s)."

# Level 1 (Compulsory)

## Level 1 Overview (Compulsory)

What we implement and why:

- **PDF loading**: we read your provided aviation PDFs as the knowledge source.
- **Chunking (LangChain RecursiveCharacterTextSplitter)**: splits long pages into overlapping chunks so retrieval can match small sections precisely.
- **Embeddings + Vector index (FAISS)**: creates a fast semantic search index over chunks.
- **Grounded answering + citations**: returns an answer derived only from retrieved chunks and shows where it came from (source PDF + page).
- **Strict refusal**: if retrieval confidence is too low or context does not support the query, we refuse with the exact required sentence.
- **Minimal API (FastAPI)**: basic endpoints for health, ingest, and ask.

Design choice: we keep generation lightweight and grounded. The notebook focuses on retrieval + safe answer composition rather than free-form LLM generation.


### Level 1 — Load PDFs

We load PDFs using `PyPDFLoader` so we get text per page with metadata (source, page).
This metadata is later used for citations.


In [None]:
def list_pdf_files(data_dir: Path) -> List[Path]:
    return sorted([p for p in data_dir.glob("*.pdf") if p.is_file()])


def load_pdfs(pdfs: List[Path]) -> List[Dict[str, Any]]:
    docs = []
    for pdf in pdfs:
        loader = PyPDFLoader(str(pdf))
        pages = loader.load()
        for p in pages:
            md = dict(p.metadata or {})
            md["source"] = pdf.name
            docs.append({"text": p.page_content, "metadata": md})
    return docs

### Level 1 — Chunking (LangChain text splitter)

We use **LangChain `RecursiveCharacterTextSplitter`** because PDF pages can be long.
Chunking improves retrieval accuracy, and overlap prevents losing context across boundaries.


In [None]:
def build_text_splitter() -> RecursiveCharacterTextSplitter:
    return RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", ". ", " ", ""],
    )


def normalize_whitespace(s: str) -> str:
    return re.sub(r"\s+", " ", (s or "")).strip()


def chunk_documents(docs: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    splitter = build_text_splitter()
    chunks = []
    for d in docs:
        text = (d["text"] or "").strip()
        if not text:
            continue
        parts = splitter.split_text(text)
        for i, part in enumerate(parts):
            md = dict(d["metadata"])
            md["chunk_in_page"] = i
            chunks.append({"text": normalize_whitespace(part), "metadata": md})
    return chunks

### Level 1 — Vector index (FAISS)

We embed each chunk and store it in FAISS for fast semantic search.


In [None]:
@dataclass
class VectorIndex:
    embed_model_name: str
    dim: int
    faiss_index: Any
    texts: List[str]
    metadatas: List[Dict[str, Any]]

    @classmethod
    def build(cls, chunks: List[Dict[str, Any]], embed_model_name: str) -> "VectorIndex":
        model = SentenceTransformer(embed_model_name)
        texts = [c["text"] for c in chunks]
        metas = [c["metadata"] for c in chunks]

        emb = model.encode(texts, batch_size=64, show_progress_bar=True, normalize_embeddings=True)
        emb = np.asarray(emb, dtype="float32")

        dim = emb.shape[1]
        index = faiss.IndexFlatIP(dim)
        index.add(emb)

        return cls(embed_model_name=embed_model_name, dim=dim, faiss_index=index, texts=texts, metadatas=metas)

    def save(self, index_dir: Path) -> None:
        index_dir.mkdir(parents=True, exist_ok=True)
        faiss.write_index(self.faiss_index, str(index_dir / "faiss.index"))
        payload = {
            "embed_model_name": self.embed_model_name,
            "dim": self.dim,
            "texts": self.texts,
            "metadatas": self.metadatas,
        }
        (index_dir / "store.json").write_text(
            json.dumps(payload, ensure_ascii=False),
            encoding="utf-8"
        )

    @classmethod
    def load(cls, index_dir: Path) -> "VectorIndex":
        payload = json.loads((index_dir / "store.json").read_text(encoding="utf-8"))
        index = faiss.read_index(str(index_dir / "faiss.index"))
        return cls(
            embed_model_name=payload["embed_model_name"],
            dim=payload["dim"],
            faiss_index=index,
            texts=payload["texts"],
            metadatas=payload["metadatas"],
        )

    def search(self, query: str, top_k: int) -> List[Dict[str, Any]]:
        model = SentenceTransformer(self.embed_model_name)
        q = model.encode([query], normalize_embeddings=True)
        q = np.asarray(q, dtype="float32")

        scores, ids = self.faiss_index.search(q, top_k)
        results = []
        for score, idx in zip(scores[0].tolist(), ids[0].tolist()):
            if idx < 0:
                continue
            results.append(
                {
                    "id": int(idx),
                    "score": float(score),
                    "text": self.texts[idx],
                    "metadata": self.metadatas[idx],
                }
            )
        return results

### Level 1 — Grounded answering + citations + refusal

We answer only using retrieved chunks and cite sources.
If nothing supports the question, we return the exact refusal text.


In [None]:
def format_citation(md: Dict[str, Any]) -> str:
    doc = md.get("source", "unknown")
    page = md.get("page", None)
    if page is None:
        return f"{doc} | chunk"
    return f"{doc} | page {int(page) + 1}"


def simple_tokenize(text: str) -> List[str]:
    text = normalize_whitespace(text).lower()
    return re.findall(r"[a-z0-9']+", text)


def answer_from_context(question: str, chunks: List[Dict[str, Any]]) -> str:
    if not chunks:
        return REFUSAL_TEXT

    joined = normalize_whitespace(" ".join([c["text"] for c in chunks]))
    q = normalize_whitespace(question).lower()
    keywords = [w for w in simple_tokenize(q) if len(w) >= 4] or simple_tokenize(q)

    sentences = re.split(r"(?<=[\.\?\!])\s+", joined)
    scored = []
    for s in sentences:
        s_l = s.lower()
        hits = sum(1 for k in keywords if k in s_l)
        if hits > 0:
            scored.append((hits, s.strip()))
    scored.sort(reverse=True)

    if not scored:
        return REFUSAL_TEXT

    best = [s for _, s in scored[:3]]
    ans = " ".join(best).strip()
    return ans if ans else REFUSAL_TEXT


def ask_level1(question: str, vindex: VectorIndex, top_k: int = 6, debug: bool = False) -> Dict[str, Any]:
    hits = vindex.search(question, top_k)
    answer_text = answer_from_context(question, hits)

    if answer_text == REFUSAL_TEXT:
        return {"answer": REFUSAL_TEXT, "citations": [], "chunks": hits if debug else []}

    citations = [format_citation(h["metadata"]) for h in hits]
    out = {"answer": answer_text, "citations": citations}

    if debug:
        out["chunks"] = [
            {
                "citation": format_citation(h["metadata"]),
                "score": h["score"],
                "text_snippet": h["text"][:300],
            }
            for h in hits
        ]
    return out

### Level 1 — Ingestion and persistence

We ingest PDFs, chunk them, embed chunks, and persist FAISS + metadata for reuse.


In [None]:
def ingest_level1(data_dir: Path, index_dir: Path) -> VectorIndex:
    pdfs = list_pdf_files(data_dir)
    if not pdfs:
        raise FileNotFoundError(f"No PDFs found in {data_dir.resolve()}")

    raw_docs = load_pdfs(pdfs)
    chunks = chunk_documents(raw_docs)

    vindex = VectorIndex.build(chunks, EMBED_MODEL_NAME)
    vindex.save(index_dir)

    meta = {
        "pdfs": [p.name for p in pdfs],
        "chunk_size": CHUNK_SIZE,
        "chunk_overlap": CHUNK_OVERLAP,
        "embed_model": EMBED_MODEL_NAME,
        "num_chunks": len(vindex.texts),
    }
    (index_dir / "meta_level1.json").write_text(
        json.dumps(meta, ensure_ascii=False, indent=2),
        encoding="utf-8"
    )
    return vindex


def load_level1(index_dir: Path) -> VectorIndex:
    return VectorIndex.load(index_dir)

### Level 1 — Minimal API

FastAPI endpoints to ingest and ask questions.


In [None]:
app = FastAPI(title="AIRMAN RAG (Level 1 + Level 2)")

class IngestRequest(BaseModel):
    data_dir: str = "./data"
    index_dir: str = "./index_store"

class AskRequest(BaseModel):
    question: str
    debug: bool = False
    mode: str = "level1"


@app.get("/health")
def health():
    return {"status": "ok"}


@app.post("/ingest")
def ingest(req: IngestRequest):
    vindex = ingest_level1(Path(req.data_dir), Path(req.index_dir))
    return {"status": "ingested", "num_chunks": len(vindex.texts)}


@app.post("/ask")
def ask_api(req: AskRequest):
    if req.mode == "level1":
        vindex = load_level1(INDEX_DIR)
        return ask_level1(req.question, vindex, top_k=TOP_K_FINAL, debug=req.debug)

    vindex, bm25 = load_level2(INDEX_DIR)
    return ask_level2(req.question, vindex, bm25, debug=req.debug)


# Run:
# import uvicorn
# uvicorn.run(app, host="0.0.0.0", port=8000)

# Level 2 (Optional) — Option 1

## Level 2 (Optional) — Option 1 Overview

Option 1 adds accuracy improvements on top of Level 1:

- **BM25 keyword retrieval**: improves recall for exact terms, acronyms, numbers (e.g., QNH, FL, hPa).
- **Hybrid candidate pool (BM25 + Vector)**: merges both retriever results to reduce misses.
- **Cross-Encoder reranker**: re-scores query-chunk pairs with a stronger model so the final top chunks are more precise.
- **Confidence gating**: if reranker score is below threshold, we refuse to avoid hallucination.

This is a common production-grade pattern: *retrieve broadly, rerank carefully, answer strictly from context*.


### Level 2 — BM25 retriever

BM25 is lexical retrieval. It is strong for exact strings, acronyms, and numbers.


In [None]:
@dataclass
class BM25Index:
    bm25: Any
    texts: List[str]
    metadatas: List[Dict[str, Any]]

    @classmethod
    def build(cls, texts: List[str], metadatas: List[Dict[str, Any]]) -> "BM25Index":
        tokenized = [simple_tokenize(t) for t in texts]
        bm25 = BM25Okapi(tokenized)
        return cls(bm25=bm25, texts=texts, metadatas=metadatas)

    def search(self, query: str, top_k: int) -> List[Dict[str, Any]]:
        q = simple_tokenize(query)
        scores = self.bm25.get_scores(q)
        top_idx = np.argsort(scores)[::-1][:top_k]
        results = []
        for idx in top_idx.tolist():
            results.append(
                {
                    "id": int(idx),
                    "score": float(scores[idx]),
                    "text": self.texts[idx],
                    "metadata": self.metadatas[idx],
                }
            )
        return results

### Level 2 — Hybrid merge + reranking

We merge candidates from vector + BM25 and rerank with a cross-encoder for precision.


In [None]:
def merge_candidates(vec_hits: List[Dict[str, Any]], bm25_hits: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    merged = {}
    for h in vec_hits:
        idx = h["id"]
        merged[idx] = {
            "id": idx,
            "text": h["text"],
            "metadata": h["metadata"],
            "vector_score": float(h["score"]),
            "bm25_score": None,
            "retrievers": ["vector"],
        }

    for h in bm25_hits:
        idx = h["id"]
        if idx not in merged:
            merged[idx] = {
                "id": idx,
                "text": h["text"],
                "metadata": h["metadata"],
                "vector_score": None,
                "bm25_score": float(h["score"]),
                "retrievers": ["bm25"],
            }
        else:
            merged[idx]["bm25_score"] = float(h["score"])
            merged[idx]["retrievers"] = sorted(list(set(merged[idx]["retrievers"] + ["bm25"])))

    return list(merged.values())


def rerank(query: str, candidates: List[Dict[str, Any]], model_name: str, top_k: int) -> List[Dict[str, Any]]:
    if not candidates:
        return []
    ce = CrossEncoder(model_name)
    pairs = [(query, c["text"]) for c in candidates]
    scores = ce.predict(pairs)

    for c, s in zip(candidates, scores.tolist()):
        c["rerank_score"] = float(s)

    candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
    return candidates[:top_k]

### Level 2 — Ingest/load + ask

Level 2 reuses the Level 1 vector store and adds BM25 and reranking at query time.


In [None]:
def ingest_level2(data_dir: Path, index_dir: Path) -> Tuple[VectorIndex, BM25Index]:
    vindex = ingest_level1(data_dir, index_dir)
    bm25 = BM25Index.build(vindex.texts, vindex.metadatas)

    meta = json.loads((index_dir / "meta_level1.json").read_text(encoding="utf-8"))
    meta["bm25_enabled"] = True
    meta["rerank_model"] = RERANK_MODEL_NAME
    (index_dir / "meta_level2.json").write_text(
        json.dumps(meta, ensure_ascii=False, indent=2),
        encoding="utf-8"
    )

    return vindex, bm25


def load_level2(index_dir: Path) -> Tuple[VectorIndex, BM25Index]:
    vindex = VectorIndex.load(index_dir)
    bm25 = BM25Index.build(vindex.texts, vindex.metadatas)
    return vindex, bm25


def ask_level2(question: str, vindex: VectorIndex, bm25: BM25Index, debug: bool = False) -> Dict[str, Any]:
    vec_hits = vindex.search(question, TOP_K_VECTOR)
    bm_hits = bm25.search(question, TOP_K_BM25)

    candidates = merge_candidates(vec_hits, bm_hits)
    reranked = rerank(question, candidates, RERANK_MODEL_NAME, TOP_K_FINAL)

    if not reranked:
        return {"answer": REFUSAL_TEXT, "citations": [], "chunks": reranked if debug else []}

    best_score = reranked[0].get("rerank_score", -1.0)
    if best_score < RERANK_MIN_SCORE:
        return {"answer": REFUSAL_TEXT, "citations": [], "chunks": reranked if debug else []}

    answer_text = answer_from_context(question, reranked)
    if answer_text == REFUSAL_TEXT:
        return {"answer": REFUSAL_TEXT, "citations": [], "chunks": reranked if debug else []}

    citations = [format_citation(c["metadata"]) for c in reranked]
    out = {"answer": answer_text, "citations": citations}

    if debug:
        out["chunks"] = [
            {
                "citation": format_citation(c["metadata"]),
                "rerank_score": c.get("rerank_score"),
                "retrievers": c.get("retrievers"),
                "vector_score": c.get("vector_score"),
                "bm25_score": c.get("bm25_score"),
                "text_snippet": c["text"][:300],
            }
            for c in reranked
        ]
    return out

# Question set + Evaluation (Level 1 + Level 2 comparison)

## Question Set + Evaluation

We include:

- **50 questions** (20 factual, 20 applied, 10 higher-order).
- **Comparison**:
  - Level 1 baseline = vector-only retrieval
  - Level 2 = hybrid retrieval with reranking
- **Simple metrics**:
  - refusal rates
  - answer rates
  - number of citations produced

This evaluation is meant to demonstrate retrieval improvements and safe behavior, not to claim perfect correctness without human review.


In [None]:
QUESTIONS = [
    "What is the definition of meteorology?",
    "What is the definition of the atmosphere?",
    "What is the approximate composition of dry air by volume in the troposphere?",
    "In the ISA, what is the sea level standard temperature?",
    "In the ISA, what is the sea level standard pressure in hPa?",
    "In the ISA, what is the standard lapse rate below 11 km?",
    "What does the tropopause mark in terms of temperature change with height?",
    "What are two reasons for studying meteorology for aviation?",
    "What is the role of the Air Data Computer (ADC) in an aircraft?",
    "What are the two ADC system types mentioned and how do they differ at a high level?",
    "In FMC initialization, what is checked or input on the IDENT and POS INIT pages?",
    "What is the purpose of the CDU scratchpad in an FMC?",
    "What does the term 'Decision Point Procedure' relate to in fuel policy?",
    "How is an 'Isolated aerodrome' defined in fuel planning context?",
    "What is the difference between a QDR and a QDM in VOR terminology?",
    "What does RNAV stand for in the context of navigation systems?",
    "In the ISA deviation calculation, how is deviation computed from actual and ISA temperature?",
    "What does the document say about ozone hazards at high altitude?",
    "What does system redundancy mean in the air data system context?",
    "What does the document describe as the main purpose of flight planning?",
    "If an outside air temperature of -30°C is measured at FL200, what is the ISA temperature deviation?",
    "If the tropopause is reported at FL330, what can you infer about significant cloud tops relative to it?",
    "If the ADC on one side fails, what arrangements can allow the captain's instruments to be fed from the other side?",
    "During FMC pre-flight initialization, what sequence of pages would you expect to use after IDENT?",
    "If the navigation database is out of date, what action is described to activate the next cycle?",
    "Given a flight to an isolated aerodrome, what additional fuel requirement is described for turbine aircraft?",
    "In decision point procedure planning, what does contingency fuel between departure and decision point enable?",
    "If asked for the common emergency VHF frequency, how would you answer using only the document text?",
    "If a question is not supported by any retrieved chunk, what exact refusal must your system return?",
    "If BM25 returns many candidates but vector retrieval returns few, how can hybrid retrieval help?",
    "Why might chunk overlap reduce retrieval misses in dense technical text?",
    "How would you cite an answer when you only have chunk metadata for source and page?",
    "How would you decide to refuse answering when retrieval confidence is low?",
    "If a user asks about something outside aviation docs, how should the assistant respond?",
    "How would you debug a wrong answer: what retrieved chunks would you inspect first?",
    "If vector similarity is high but the chunk is off-topic, how does a cross-encoder reranker help?",
    "If you must show top 3 chunks in debug mode, what fields would you include in the response?",
    "If a user asks for a multi-step explanation, how can you keep it grounded in retrieved text?",
    "If you use FAISS with normalized embeddings and inner product, what similarity measure does it approximate?",
    "If pages are 0-indexed by the loader, how do you show human-friendly page numbers in citations?",
    "Compare vector-only retrieval vs hybrid retrieval: why can hybrid improve recall for rare acronyms and numbers?",
    "Describe a failure mode where BM25 helps but reranking is still necessary.",
    "When could a high rerank score still lead to a wrong answer, and how would you mitigate it?",
    "Explain how you would measure retrieval hit-rate without manual labeling, and what are its limitations.",
    "Propose a confidence thresholding approach using reranker scores and how it triggers refusal.",
    "Explain how chunk size affects citations quality and answer completeness in a textbook PDF.",
    "If two chunks disagree, how would you craft an answer that stays faithful and notes the condition?",
    "How would you extend the system to handle follow-up questions while keeping grounding strict?",
    "Explain why showing retrieved chunks in debug mode helps prevent hallucinations during development.",
    "Explain how you would generate and curate a 50-question set to cover factual, applied, and reasoning skills.",
]
len(QUESTIONS)

In [None]:
def sentence_split(text: str) -> List[str]:
    text = normalize_whitespace(text)
    if not text:
        return []
    parts = re.split(r"(?<=[\.\?\!])\s+", text)
    return [p.strip() for p in parts if p.strip()]


def faithfulness_score(answer: str, retrieved_text: str) -> float:
    ans_sents = sentence_split(answer)
    if not ans_sents:
        return 0.0
    ctx = normalize_whitespace(retrieved_text).lower()
    matched = 0
    considered = 0
    for s in ans_sents:
        s_norm = normalize_whitespace(s).lower()
        if len(s_norm) < 12:
            continue
        considered += 1
        if s_norm in ctx:
            matched += 1
    return matched / max(1, considered)


def retrieval_hit(answer: str, retrieved_text: str) -> bool:
    if not answer or answer == REFUSAL_TEXT:
        return False
    return faithfulness_score(answer, retrieved_text) >= 0.34


def get_top_chunks_text(chunks: List[Dict[str, Any]], top_n: int = 3) -> str:
    top = chunks[:top_n] if chunks else []
    return normalize_whitespace(" ".join([c.get("text_snippet", "") for c in top]))


def run_eval(vindex: VectorIndex, bm25: BM25Index, questions: List[str]) -> List[Dict[str, Any]]:
    rows = []
    for q in questions:
        out1 = ask_level1(q, vindex, top_k=TOP_K_FINAL, debug=True)
        out2 = ask_level2(q, vindex, bm25, debug=True)

        ch1 = out1.get("chunks", [])
        ch2 = out2.get("chunks", [])

        ctx1 = get_top_chunks_text(ch1, top_n=3)
        ctx2 = get_top_chunks_text(ch2, top_n=3)

        ans1 = out1.get("answer", "")
        ans2 = out2.get("answer", "")

        f1 = faithfulness_score(ans1, ctx1) if ans1 != REFUSAL_TEXT else 1.0
        f2 = faithfulness_score(ans2, ctx2) if ans2 != REFUSAL_TEXT else 1.0

        hit1 = retrieval_hit(ans1, ctx1)
        hit2 = retrieval_hit(ans2, ctx2)

        hall1 = 0.0 if ans1 == REFUSAL_TEXT else (1.0 - f1)
        hall2 = 0.0 if ans2 == REFUSAL_TEXT else (1.0 - f2)

        rows.append(
            {
                "question": q,
                "l1_answer": ans1,
                "l2_answer": ans2,
                "l1_refused": ans1 == REFUSAL_TEXT,
                "l2_refused": ans2 == REFUSAL_TEXT,
                "l1_hit": bool(hit1),
                "l2_hit": bool(hit2),
                "l1_faithfulness": float(round(f1, 3)),
                "l2_faithfulness": float(round(f2, 3)),
                "l1_hallucination": float(round(hall1, 3)),
                "l2_hallucination": float(round(hall2, 3)),
                "l1_citations": out1.get("citations", []),
                "l2_citations": out2.get("citations", []),
                "l1_top_chunks": ch1[:3],
                "l2_top_chunks": ch2[:3],
            }
        )
    return rows


def aggregate(rows: List[Dict[str, Any]]) -> Dict[str, Any]:
    n = len(rows)
    if n == 0:
        return {}

    def avg(key: str) -> float:
        return float(round(sum(r[key] for r in rows) / n, 3))

    l1_ref = sum(1 for r in rows if r["l1_refused"])
    l2_ref = sum(1 for r in rows if r["l2_refused"])

    l1_hit = sum(1 for r in rows if r["l1_hit"])
    l2_hit = sum(1 for r in rows if r["l2_hit"])

    return {
        "n_questions": n,
        "level1_refusal_rate": round(l1_ref / n, 3),
        "level2_refusal_rate": round(l2_ref / n, 3),
        "level1_answer_rate": round(1 - (l1_ref / n), 3),
        "level2_answer_rate": round(1 - (l2_ref / n), 3),
        "level1_retrieval_hit_rate": round(l1_hit / n, 3),
        "level2_retrieval_hit_rate": round(l2_hit / n, 3),
        "level1_faithfulness_avg": avg("l1_faithfulness"),
        "level2_faithfulness_avg": avg("l2_faithfulness"),
        "level1_hallucination_avg": avg("l1_hallucination"),
        "level2_hallucination_avg": avg("l2_hallucination"),
    }


def pick_best_worst(rows: List[Dict[str, Any]], level: str, k: int = 5):
    ans_key = "l1_answer" if level == "l1" else "l2_answer"
    faith_key = "l1_faithfulness" if level == "l1" else "l2_faithfulness"
    ref_key = "l1_refused" if level == "l1" else "l2_refused"
    top_key = "l1_top_chunks" if level == "l1" else "l2_top_chunks"

    answered = [r for r in rows if not r[ref_key]]
    answered_sorted = sorted(answered, key=lambda r: r[faith_key], reverse=True)

    best = answered_sorted[:k]
    worst = list(reversed(answered_sorted[-k:]))

    def explain(r):
        f = r[faith_key]
        a = r[ans_key]
        chunks = r[top_key]
        chunk_cites = [c.get("citation") for c in chunks if c.get("citation")]
        why = "Answer sentences are mostly present in retrieved text." if f >= 0.67 else "Answer is weakly supported by retrieved text; likely missing exact sentence match."
        return {
            "question": r["question"],
            "answer": a,
            "faithfulness": f,
            "top_chunk_citations": chunk_cites,
            "explanation": why,
        }

    return [explain(r) for r in best], [explain(r) for r in worst]


def write_report_full(index_dir: Path, rows: List[Dict[str, Any]], summary: Dict[str, Any]) -> Path:
    index_dir.mkdir(parents=True, exist_ok=True)

    meta_l1 = {}
    meta_l2 = {}
    if (index_dir / "meta_level1.json").exists():
        meta_l1 = json.loads((index_dir / "meta_level1.json").read_text(encoding="utf-8"))
    if (index_dir / "meta_level2.json").exists():
        meta_l2 = json.loads((index_dir / "meta_level2.json").read_text(encoding="utf-8"))

    best_l1, worst_l1 = pick_best_worst(rows, "l1", k=5)
    best_l2, worst_l2 = pick_best_worst(rows, "l2", k=5)

    lines = []
    lines.append("# AIRMAN Evaluation Report")
    lines.append("")
    lines.append("## Level 1 metadata")
    lines.append("```json")
    lines.append(json.dumps(meta_l1, indent=2, ensure_ascii=False))
    lines.append("```")
    lines.append("")
    lines.append("## Level 2 metadata")
    lines.append("```json")
    lines.append(json.dumps(meta_l2, indent=2, ensure_ascii=False))
    lines.append("```")
    lines.append("")
    lines.append("## Baseline vs Hybrid Metrics")
    lines.append("```json")
    lines.append(json.dumps(summary, indent=2, ensure_ascii=False))
    lines.append("```")
    lines.append("")
    lines.append("## Qualitative Analysis — 5 Best (Level 1)")
    lines.append("```json")
    lines.append(json.dumps(best_l1, indent=2, ensure_ascii=False))
    lines.append("```")
    lines.append("")
    lines.append("## Qualitative Analysis — 5 Worst (Level 1)")
    lines.append("```json")
    lines.append(json.dumps(worst_l1, indent=2, ensure_ascii=False))
    lines.append("```")
    lines.append("")
    lines.append("## Qualitative Analysis — 5 Best (Level 2)")
    lines.append("```json")
    lines.append(json.dumps(best_l2, indent=2, ensure_ascii=False))
    lines.append("```")
    lines.append("")
    lines.append("## Qualitative Analysis — 5 Worst (Level 2)")
    lines.append("```json")
    lines.append(json.dumps(worst_l2, indent=2, ensure_ascii=False))
    lines.append("```")
    lines.append("")
    lines.append("## Per-question comparison")
    lines.append("| # | Question | L1 refused | L1 hit | L1 faith | L1 halluc | L2 refused | L2 hit | L2 faith | L2 halluc |")
    lines.append("|---:|---|---:|---:|---:|---:|---:|---:|---:|---:|")
    for i, r in enumerate(rows, start=1):
        q = r["question"].replace("|", "\\|")
        lines.append(
            f"| {i} | {q} | {r['l1_refused']} | {r['l1_hit']} | {r['l1_faithfulness']} | {r['l1_hallucination']} | "
            f"{r['l2_refused']} | {r['l2_hit']} | {r['l2_faithfulness']} | {r['l2_hallucination']} |"
        )

    report_path = index_dir / "report.md"
    report_path.write_text("\n".join(lines), encoding="utf-8")
    return report_path


if INDEX_DIR.exists() and (INDEX_DIR / "faiss.index").exists():
    vindex = load_level1(INDEX_DIR)
else:
    vindex = ingest_level1(DATA_DIR, INDEX_DIR)

bm25 = BM25Index.build(vindex.texts, vindex.metadatas)
rows = run_eval(vindex, bm25, QUESTIONS)
summary = aggregate(rows)
report_path = write_report_full(INDEX_DIR, rows, summary)
summary, report_path