## This notebooks contains the code to build the MVP of our RAG assistant. It includes:
- A search function that queries the ChromaDB collection for relevant chunks based on a question
- A function to build a context string from the search hits, which will be used as input to the LLM
- A function to call the local Ollama LLM with a prompt and get an answer
The final part of the notebook calls the answer function with a sample question and prints the answer.
We also have a compare_papers function that takes a question and a list of paper IDs, retrieves relevant chunks
for each paper, builds a context, and prompts the LLM to compare the papers with respect to the question.

In [None]:
# !pip -q install pymupdf pandas tqdm tiktoken

from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path.cwd().parent

In [3]:
from sentence_transformers import SentenceTransformer

# Use the BGE model from BAAI, which is a strong open-source embedding model 
# that converts text into vector embeddings. These embeddings can be used for
# tasks like semantic search, etc. We will use this model later to convert
# our text chunks into embeddings
embed_model_name = "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(embed_model_name)

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 199/199 [00:00<00:00, 765.12it/s, Materializing param=pooler.dense.weight]                               
[1mBertModel LOAD REPORT[0m from: BAAI/bge-base-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [4]:
import chromadb
from chromadb.config import Settings

DATA_DIR = PROJECT_ROOT / "data"
CHROMA_DIR = DATA_DIR / "chroma_db"
client = chromadb.PersistentClient(path=str(CHROMA_DIR), settings=Settings(anonymized_telemetry=False))

collection = client.get_or_create_collection(
    name="hallucination_faithfulness_chunks",
    metadata={"embedding_model": embed_model_name}
)

In [5]:
def search(query, k=5, where=None, *, return_text_preview_chars=700):
    q_emb = model.encode([query], normalize_embeddings=True).tolist()

    res = collection.query(
        query_embeddings=q_emb,
        n_results=k,
        where=where
    )

    hits = []
    n = min(k, len(res["ids"][0]))

    for i in range(n):
        meta = res["metadatas"][0][i] or {}
        doc  = res["documents"][0][i] or ""
        dist = res["distances"][0][i]

        hits.append({
            "rank": i + 1,
            "distance": float(dist),
            "paper_id": meta.get("paper_id"),
            "year": meta.get("year"),
            "page": meta.get("page"),
            "title": meta.get("title", ""),
            "source_file": meta.get("source_file", ""),
            "text": doc,
            "text_preview": doc[:return_text_preview_chars].strip()
        })

    return hits

def build_context(hits, max_chars=6000):
    blocks = []
    total = 0
    for h in hits:
        block = f"[{h['paper_id']} p.{h['page']}] {h['title']}\n{h['text'].strip()}\n"
        if total + len(block) > max_chars:
            break
        blocks.append(block)
        total += len(block)
    return "\n\n".join(blocks)

In [6]:
import requests

OLLAMA_URL = "http://localhost:11434/api/generate"
OLLAMA_MODEL = "qwen2.5:7b-instruct"

def call_llm(prompt: str, model: str = OLLAMA_MODEL, temperature: float = 0.2, max_tokens: int = 700) -> str:
    """
    Call local Ollama model. No API key. Free (runs on your machine).
    """
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": False,
        "options": {
            "temperature": temperature,
            "num_predict": max_tokens,
        }
    }
    r = requests.post(OLLAMA_URL, json=payload, timeout=300)
    r.raise_for_status()
    data = r.json()
    return data.get("response", "").strip()

In [None]:
def make_citations(hits, top_n=5):
    # tạo citations gọn: [paper_id p.page] file
    cits = []
    for h in hits[:top_n]:
        cits.append({
            "paper_id": h.get("paper_id"),
            "page": h.get("page"),
            "title": h.get("title"),
            "source_file": h.get("source_file"),
            "distance": h.get("distance"),
        })
    return cits


def answer(question, *, k=8, where=None, max_ctx_chars=8000):
    hits = search(question, k=k, where=where)
    context = build_context(hits, max_chars=max_ctx_chars)  # nếu hàm bạn có param khác thì sửa tên

    prompt = f"""
You are a research assistant specialized in LLM hallucination & faithfulness.
Answer the QUESTION using ONLY the CONTEXT.
If the context is insufficient, say "I don't know from the provided papers."

You must cite sources inline like: [paper_id p.page]
At the end, output a bullet list "Sources" with unique citations.

QUESTION:
{question}

CONTEXT:
{context}
""".strip()

    llm_text = call_llm(prompt)

    return {
        "question": question,
        "answer": llm_text,
        "hits": hits,
        "citations": make_citations(hits, top_n=5),
        "context_chars": len(context),
    }

## The same query is used twice to compare the results:

In [None]:
res = answer("How do papers define faithfulness or groundedness?", k=10)
print(res["answer"])

# 2m16s

Faithfulness or groundedness is defined as a multi-party property where a faithful explanation enables a listener model to come to the same conclusion as the speaker without access to the speaker’s answer. Specifically, faithfulness is framed in terms of reasoning executability: a reasoning chain is considered faithful if it can be executed by a similarly capable listener to recover the same conclusion, without access to the speaker's final answer.

Sources:
- [2602.16154v1 p.3]
- [2602.16154v1 p.9]


In [None]:
res = answer("How do papers define faithfulness or groundedness?", k=10)
print(res["answer"])

Papers define faithfulness or groundedness as follows:

- Faithfulness is a multi-party property where a faithful explanation enables a listener model to come to the same conclusion as the speaker without access to the speaker’s answer. [2602.16154v1 p.1]
- Faithfulness can be measured by training models to produce reasoning traces that other models can effectively execute across multiple truncation points, encouraging the generation of responses executable by listener models using soft execution. [2602.16154v1 p.3]

Sources:
- [2602.16154v1 p.1]
- [2602.16154v1 p.3]


In [None]:
def compare_papers(question, paper_ids, *, k_per_paper=4):
    all_hits = []
    for pid in paper_ids:
        hits = search(question, k=k_per_paper, where={"paper_id": pid})
        all_hits.extend(hits)

    # sort theo distance 
    all_hits = sorted(all_hits, key=lambda x: x["distance"])
    context = build_context(all_hits, max_chars=9000)

    prompt = f"""
Compare the papers with respect to the QUESTION.
Use ONLY the CONTEXT. Cite inline [paper_id p.page].
Output:
- Summary table (paper_id -> key points)
- Agreement / disagreement
- Practical takeaway

QUESTION:
{question}

CONTEXT:
{context}
""".strip()

    llm_text = call_llm(prompt)
    return {"question": question, "paper_ids": paper_ids, "answer": llm_text, "hits": all_hits}

## Compare the 2 papers with a question twice to compare the results

In [None]:
cmp_res = compare_papers(
    "How is faithfulness defined and measured?",
    paper_ids=["2602.16154v1", "2602.14529v1"]
)
print(cmp_res["answer"])

# 4m55s

### Summary Table

| Paper ID | Key Points |
| --- | --- |
| 2602.16154v1 p.3 | - Defines faithfulness as the degree to which a listener model reaches the same answer as the original speaker reasoning model.<br>- Uses multi-listener soft execution for training, where truncated reasoning chains are provided to multiple listeners to compute matching rewards. |
| 2602.16154v1 p.6 | - Evaluates faithfulness using AOC metrics across different methods and shows REMUL balances faithfulness and correctness.<br>- Faithfulness-only training improves AOC, while hint-optimized models perform poorly. |
| 2602.16154v1 p.1 | - Argues that faithfulness is a multi-party property enabling listeners to reach the same conclusion as speakers without access to answers.<br>- Proposes REMUL for balancing faithfulness and correctness. |

### Agreement / Disagreement

- **Agreement**: Both papers agree on defining faithfulness in terms of listener models reaching the same conclusions as speaker models.
- **Disa

In [None]:
cmp_res = compare_papers(
    "How is faithfulness defined and measured?",
    paper_ids=["2602.16154v1", "2602.14529v1"]
)
print(cmp_res["answer"])

# 1m16.4s

### Summary Table

| Paper ID | Key Points |
| --- | --- |
| 2602.16154v1 p.3 | Faithfulness is defined as the degree to which a listener model reaches the same answer as the original speaker reasoning model. The faithfulness reward is computed by comparing the answers of multiple listeners with the original answer. |
| 2602.16154v1 p.6 | REMUL trains models for both faithfulness and correctness simultaneously, using truncated CoT answering and adding mistakes to measure faithfulness. Faithfulness-only training improves AOC metrics, while hint-optimized models perform worse on these metrics. |
| 2602.16154v1 p.1 | Faithfulness is a multi-party property where the speaker's reasoning must be executable by a listener without access to the answer. REMUL uses soft execution and multiple listeners for training. |

### Agreement / Disagreement

- **Agreement**: Both papers agree that faithfulness involves ensuring the listener model reaches the same conclusion as the speaker, even when only p