## This notebooks contains the code to build the MVP of our RAG assistant. It includes:
- A search function that queries the ChromaDB collection for relevant chunks based on a question
- A function to build a context string from the search hits, which will be used as input to the LLM
- A function to call the local Ollama LLM with a prompt and get an answer
The final part of the notebook calls the answer function with a sample question and prints the answer.
We also have a compare_papers function that takes a question and a list of paper IDs, retrieves relevant chunks
for each paper, builds a context, and prompts the LLM to compare the papers with respect to the question.

In [11]:
# !pip -q install pymupdf pandas tqdm tiktoken
# !pip -q install --upgrade openai

from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path.cwd().parent

In [2]:
from sentence_transformers import SentenceTransformer

# Use the BGE model from BAAI, which is a strong open-source embedding model 
# that converts text into vector embeddings. These embeddings can be used for
# tasks like semantic search, etc. We will use this model later to convert
# our text chunks into embeddings
embed_model_name = "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(embed_model_name)

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 199/199 [00:00<00:00, 201.03it/s, Materializing param=pooler.dense.weight]                               
[1mBertModel LOAD REPORT[0m from: BAAI/bge-base-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [3]:
import chromadb
from chromadb.config import Settings

DATA_DIR = PROJECT_ROOT / "data"
CHROMA_DIR = DATA_DIR / "chroma_db"
client = chromadb.PersistentClient(path=str(CHROMA_DIR), settings=Settings(anonymized_telemetry=False))

collection = client.get_or_create_collection(
    name="hallucination_faithfulness_chunks",
    metadata={"embedding_model": embed_model_name}
)

In [4]:
def search(query, k=5, where=None, *, return_text_preview_chars=700):
    q_emb = model.encode([query], normalize_embeddings=True).tolist()

    res = collection.query(
        query_embeddings=q_emb,
        n_results=k,
        where=where
    )

    hits = []
    n = min(k, len(res["ids"][0]))

    for i in range(n):
        meta = res["metadatas"][0][i] or {}
        doc  = res["documents"][0][i] or ""
        dist = res["distances"][0][i]

        hits.append({
            "rank": i + 1,
            "distance": float(dist),
            "paper_id": meta.get("paper_id"),
            "year": meta.get("year"),
            "page": meta.get("page"),
            "title": meta.get("title", ""),
            "source_file": meta.get("source_file", ""),
            "text": doc,
            "text_preview": doc[:return_text_preview_chars].strip()
        })

    return hits

def build_context(hits, max_chars=6000):
    blocks = []
    total = 0
    for h in hits:
        block = f"[{h['paper_id']} p.{h['page']}] {h['title']}\n{h['text'].strip()}\n"
        if total + len(block) > max_chars:
            break
        blocks.append(block)
        total += len(block)
    return "\n\n".join(blocks)

In [None]:
with open("key.txt") as f:
    api_key = f.read().strip()

from openai import OpenAI
client = OpenAI(api_key=api_key)

response = client.responses.create(
    model="gpt-4o-mini",
    input="Write a short bedtime story about a unicorn."
)

print(response.output_text)

In [20]:
import requests

GPT_MODEL = "gpt-4o-mini"

def call_llm(prompt: str, model: str = GPT_MODEL, temperature: float = 0.2, max_tokens: int = 700) -> str:
    """
    Call gpt-4o-mini model.
    """

    response = client.responses.create(
        model=model,
        input=prompt,
        temperature=temperature,
        max_output_tokens=max_tokens
    )
    
    return response.output_text.strip()

In [6]:
# ==============================
# Verification layer
# ==============================

def verify_answer_support(answer_text, hits):
    cited = set()
    for h in hits:
        pid = h.get("paper_id")
        if pid and pid in answer_text:
            cited.add(pid)

    return {
        "cited_papers": list(cited),
        "num_cited": len(cited),
        "num_hits": len(hits)
    }


def self_check(question, answer, context):
    prompt = f"""
Given the QUESTION and CONTEXT,
does the ANSWER strictly rely only on the CONTEXT?

If any claim is unsupported, list it.

QUESTION:
{question}

CONTEXT:
{context}

ANSWER:
{answer}
"""
    return call_llm(prompt)

In [7]:
def make_citations(hits, top_n=5):
    # make citation [paper_id p.page] file
    cits = []
    for h in hits[:top_n]:
        cits.append({
            "paper_id": h.get("paper_id"),
            "page": h.get("page"),
            "title": h.get("title"),
            "source_file": h.get("source_file"),
            "distance": h.get("distance"),
        })
    return cits


def answer(question, *, k=8, where=None, max_ctx_chars=8000):
    hits = search(question, k=k, where=where)
    context = build_context(hits, max_chars=max_ctx_chars)  

    prompt = f"""
You are a research assistant specialized in LLM hallucination & faithfulness.
Answer the QUESTION using ONLY the CONTEXT.
If the context is insufficient, say "I don't know from the provided papers."

You must cite sources inline like: [paper_id p.page]
At the end, output a bullet list "Sources" with unique citations.

QUESTION:
{question}

CONTEXT:
{context}
""".strip()

    llm_text = call_llm(prompt)

    verification = verify_answer_support(llm_text, hits)
    self_eval = self_check(question, llm_text, context)

    return {
        "question": question,
        "answer": llm_text,
        "hits": hits,
        "citations": make_citations(hits, top_n=5),
        "verification": verification,
        "self_evaluation": self_eval,
        "context_chars": len(context),
    }

## The same query is used twice to compare the results:

In [None]:
res = answer("How do papers define faithfulness or groundedness?", k=10)
print(res["answer"])
print(res["verification"])
print(res["self_evaluation"])

Papers define faithfulness as a multi-party property where a faithful explanation allows a listener model to reach the same conclusion as a speaker model, without access to the speaker's final answer. Faithfulness is framed in terms of reasoning executability, meaning a reasoning chain is considered faithful if it can be executed by a listener to recover the same conclusion. The degree of faithfulness is quantified by whether the listener reaches the same answer as the speaker, indicating a successful execution of the reasoning process [2602.16154v1 p.1][2602.16154v1 p.9].

Sources:
- [2602.16154v1 p.1]
- [2602.16154v1 p.9]
{'cited_papers': ['2602.16154v1'], 'num_cited': 1, 'num_hits': 10}
The answer provided does rely strictly on the context given, as it summarizes the definition of faithfulness as described in the context. The definition includes key elements such as the multi-party nature of faithfulness, the role of reasoning executability, and how faithfulness is quantified.

### 

In [23]:
res = answer("How do papers define faithfulness or groundedness?", k=10)
print(res["answer"])
print(res["verification"])
print(res["self_evaluation"])

The papers define faithfulness as a multi-party property where a faithful explanation allows a listener model to reach the same conclusion as the speaker model without access to the speaker's final answer. Faithfulness is framed as a question of reasoning executability, meaning a reasoning chain is considered faithful if it can be executed by a listener to recover the same conclusion based on a provided reasoning prefix. This is distinct from correctness, which is objectively defined. The proposed method, Reasoning Execution by Multiple Listeners (REMUL), emphasizes training models for both faithfulness and correctness simultaneously, indicating that faithfulness is critical for monitorability, verifiability, and trust in models [2602.16154v1 p.1][2602.16154v1 p.9].

- **Sources**
  - [2602.16154v1 p.1]
  - [2602.16154v1 p.9]
{'cited_papers': ['2602.16154v1'], 'num_cited': 1, 'num_hits': 10}
The answer provided does rely strictly on the context given, as it accurately summarizes the de

In [None]:
def compare_papers(question, paper_ids, *, k_per_paper=4):
    all_hits = []
    for pid in paper_ids:
        hits = search(question, k=k_per_paper, where={"paper_id": pid})
        all_hits.extend(hits)

    # sort theo distance 
    all_hits = sorted(all_hits, key=lambda x: x["distance"])
    context = build_context(all_hits, max_chars=9000)

    prompt = f"""
Compare the papers with respect to the QUESTION.
Use ONLY the CONTEXT. Cite inline [paper_id p.page].
Output:
- Summary table (paper_id -> key points)
- Agreement / disagreement
- Practical takeaway

QUESTION:
{question}

CONTEXT:
{context}
""".strip()

    llm_text = call_llm(prompt)
    return {"question": question, "paper_ids": paper_ids, "answer": llm_text, "hits": all_hits}

## Compare the 2 papers with a question twice to compare the results

In [None]:
cmp_res = compare_papers(
    "How is faithfulness defined and measured?",
    paper_ids=["2602.16154v1", "2602.14529v1"]
)
print(cmp_res["answer"])

# 4m55s

### Summary Table

| Paper ID | Key Points |
| --- | --- |
| 2602.16154v1 p.3 | - Defines faithfulness as the degree to which a listener model reaches the same answer as the original speaker reasoning model.<br>- Uses multi-listener soft execution for training, where truncated reasoning chains are provided to multiple listeners to compute matching rewards. |
| 2602.16154v1 p.6 | - Evaluates faithfulness using AOC metrics across different methods and shows REMUL balances faithfulness and correctness.<br>- Faithfulness-only training improves AOC, while hint-optimized models perform poorly. |
| 2602.16154v1 p.1 | - Argues that faithfulness is a multi-party property enabling listeners to reach the same conclusion as speakers without access to answers.<br>- Proposes REMUL for balancing faithfulness and correctness. |

### Agreement / Disagreement

- **Agreement**: Both papers agree on defining faithfulness in terms of listener models reaching the same conclusions as speaker models.
- **Disa

In [None]:
cmp_res = compare_papers(
    "How is faithfulness defined and measured?",
    paper_ids=["2602.16154v1", "2602.14529v1"]
)
print(cmp_res["answer"])

# 1m16.4s

### Summary Table

| Paper ID | Key Points |
| --- | --- |
| 2602.16154v1 p.3 | Faithfulness is defined as the degree to which a listener model reaches the same answer as the original speaker reasoning model. The faithfulness reward is computed by comparing the answers of multiple listeners with the original answer. |
| 2602.16154v1 p.6 | REMUL trains models for both faithfulness and correctness simultaneously, using truncated CoT answering and adding mistakes to measure faithfulness. Faithfulness-only training improves AOC metrics, while hint-optimized models perform worse on these metrics. |
| 2602.16154v1 p.1 | Faithfulness is a multi-party property where the speaker's reasoning must be executable by a listener without access to the answer. REMUL uses soft execution and multiple listeners for training. |

### Agreement / Disagreement

- **Agreement**: Both papers agree that faithfulness involves ensuring the listener model reaches the same conclusion as the speaker, even when only p

In [None]:
def retrieval_stats(hits):
    unique_papers = len(set(h['paper_id'] for h in hits))
    return {
        "unique_papers": unique_papers,
        "total_hits": len(hits),
        "diversity_ratio": unique_papers / len(hits)
    }