
# MDP — Evaluate Coverage & Faithfulness with Pipeline Retrievers/Embeddings

This notebook evaluates the two datasets generated previously (**Coverage / Ground-truth Recall** and **Faithfulness / Context Consistency**) using the models/retrieval+embedding methods defined in your pipeline notebook:

- `rag_pipeline_biollm_hybrid_embeddings.ipynb` (imported via `import-ipynb`)
- Datasets from: `MDP_coverage_faith_builder.ipynb` (already created)

It will:
1. Load `coverage_dataset.csv`, `faithfulness_dataset.csv`, `chunk_index.csv`.
2. Import your pipeline notebook and build **multiple retrievers** with different embeddings.
3. Compute **Doc-Hit@K / Chunk-Hit@K / ContextRecall@K** for each retriever.
4. Compute **Faithfulness AUC/ACC** using your pipeline's NLI/judge if available (fallback: token-recall proxy).


In [1]:

# ==== Config (edit paths & variants) ====

# Data produced earlier by MDP_coverage_faith_builder.ipynb
DATA_DIR = "/home/gulizhu/MDP/benchmark_data/coverage_faithfulness"
COVERAGE_CSV      = f"{DATA_DIR}/coverage_dataset.csv"
FAITHFUL_CSV      = f"{DATA_DIR}/faithfulness_dataset.csv"
CHUNK_INDEX_CSV   = f"{DATA_DIR}/chunk_index.csv"

# Pipeline notebook path (imported as a module)
PIPELINE_NOTEBOOK = "/mnt/data/rag_pipeline_biollm_hybrid_embeddings.ipynb"

# Choose which retriever/embedding variants to evaluate
EMBEDDING_VARIANTS = [
    {"name": "hybrid_bm25_bge_small", "use_bm25": True,  "embedding": "bge-small-en"},
    {"name": "bge_small_only",        "use_bm25": False, "embedding": "bge-small-en"},
    {"name": "bioclinicalbert",       "use_bm25": False, "embedding": "bioclinicalbert"},
]

TOP_K = 10
SEED = 42


In [2]:

# ==== Imports & setup ====
import os, re, json, random, hashlib, importlib
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm

random.seed(SEED)
np.random.seed(SEED)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

try:
    from rank_bm25 import BM25Okapi
    HAS_BM25 = True
except Exception:
    HAS_BM25 = False

print("DATA_DIR:", DATA_DIR)
print("Pipeline notebook:", PIPELINE_NOTEBOOK)


DATA_DIR: /home/gulizhu/MDP/benchmark_data/coverage_faithfulness
Pipeline notebook: /mnt/data/rag_pipeline_biollm_hybrid_embeddings.ipynb


In [3]:

# ==== Load datasets ====
cov = pd.read_csv(COVERAGE_CSV)
faith = pd.read_csv(FAITHFUL_CSV)
chunk_df = pd.read_csv(CHUNK_INDEX_CSV)

print("coverage rows:", len(cov))
print("faithfulness rows:", len(faith))
print("chunks:", len(chunk_df))

display(cov.head(3))
display(faith.head(3))
display(chunk_df.head(3))


coverage rows: 458
faithfulness rows: 919
chunks: 139943


Unnamed: 0,qid,question,answer,gt_doc_id,gt_chunk_id,hit_doc@K,hit_chunk@K,context_recall@K
0,Q::b7b927d779,What is the purpose of the assistance mentioned?,Rehabilitation,WHO::def5effffe,WHO::def5effffe::CH0075,0,0,0.0
1,Q::a4aedf3d70,What does the acronym NIH stand for?,National Institutes of Health,WHO::ba091c3aa0,WHO::ba091c3aa0::CH0022,0,0,0.3333
2,Q::1a6c0caae5,What is the full name of the NIDCD?,National Institute on Deafness and Other Commu...,WHO::ba091c3aa0,WHO::ba091c3aa0::CH0022,0,0,0.1667


Unnamed: 0,qid,question,answer,label_faithful,evidence_chunk_id,faithfulness_score,note
0,Q::b7b927d779,What is the purpose of the assistance mentioned?,Rehabilitation,1,WHO::def5effffe::CH0075,1.0,
1,Q::b7b927d779,What is the purpose of the assistance mentioned?,Rehabilitation,0,WHO::1a5d8db1de::CH0035,0.0,top1_evidence_eval
2,Q::a4aedf3d70,What does the acronym NIH stand for?,National Institutes of Health,1,WHO::ba091c3aa0::CH0022,0.0,


Unnamed: 0,chunk_id,doc_id,source,title,url,chars,text
0,WHO::8d7aa84649::CH0000,WHO::8d7aa84649,WHO,Common goods for health,https://www.who.int/health-topics/common-goods...,1057,Common goods for health are population-based f...
1,WHO::8d7aa84649::CH0001,WHO::8d7aa84649,WHO,Common goods for health,https://www.who.int/health-topics/common-goods...,1020,gislation (ex. environmental regulations and g...
2,WHO::8d7aa84649::CH0002,WHO::8d7aa84649,WHO,Common goods for health,https://www.who.int/health-topics/common-goods...,950,"ge of legal instruments (such as laws, decrees..."


In [5]:

# ==== Import the pipeline notebook as a module ====
import sys, subprocess, pkgutil
if pkgutil.find_loader("import_ipynb") is None:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-U", "import-ipynb"])

import import_ipynb
try:
    import rag_pipeline_biollm_hybrid_embeddings as pipe
    PIPELINE_IMPORTED = True
except Exception as e:
    print("⚠️ Pipeline import failed:", e)
    PIPELINE_IMPORTED = False

PIPELINE_IMPORTED


Textbook chunks: 4759
Knowledge base size: 6044
Index(['question'], dtype='object')
                                            question
0  What is the role of a pathologist in cancer di...
1  Which biomarkers are key in the analysis of br...
2  How does a pathologist prepare and analyze a t...
3  What are key features that a pathologist looks...
4  What is immunohistochemistry and how is it use...
Saved -> rag_results_with_embeddings.csv
Saved -> rag_results_retriever_embedding_matrix.csv


True

In [6]:

# ==== Build retrievers for each embedding variant ====

CORPUS_TEXTS = chunk_df["text"].tolist()
CORPUS_IDS   = chunk_df["chunk_id"].tolist()

class LocalHybridRetriever:
    def __init__(self, texts, ids, use_bm25=False):
        self.ids = ids
        self.texts = texts
        self.vectorizer = TfidfVectorizer(max_features=60000, ngram_range=(1,2))
        self.tf = self.vectorizer.fit_transform(texts)
        self.use_bm25 = bool(use_bm25 and HAS_BM25)
        if self.use_bm25:
            self.bm25 = BM25Okapi([t.split() for t in texts])
        else:
            self.bm25 = None

    def search(self, query: str, k: int = 10):
        tf_q = self.vectorizer.transform([query])
        tf_scores = cosine_similarity(tf_q, self.tf).ravel()
        if self.bm25 is not None:
            bm = self.bm25.get_scores(query.split())
            def norm(a): a = np.asarray(a); return (a-a.min())/(a.max()-a.min()+1e-9)
            hybrid = 0.5*norm(tf_scores) + 0.5*norm(bm)
            order = np.argsort(-hybrid)[:k]
            return [(self.ids[i], float(hybrid[i])) for i in order]
        else:
            order = np.argsort(-tf_scores)[:k]
            return [(self.ids[i], float(tf_scores[i])) for i in order]

def build_retriever_with_pipeline(embedding_name: str, use_bm25: bool):
    if PIPELINE_IMPORTED and hasattr(pipe, "build_retriever_from_corpus"):
        try:
            retr = pipe.build_retriever_from_corpus(
                texts=CORPUS_TEXTS, ids=CORPUS_IDS,
                embedding=embedding_name, use_bm25=use_bm25
            )
            return retr, "pipeline"
        except Exception as e:
            print(f"⚠️ pipeline build_retriever_from_corpus failed for {embedding_name}: {e}")

    return LocalHybridRetriever(CORPUS_TEXTS, CORPUS_IDS, use_bm25=use_bm25), "local_fallback"

RETRIEVERS = {}
for cfg in EMBEDDING_VARIANTS:
    retr, origin = build_retriever_with_pipeline(cfg["embedding"], cfg["use_bm25"])
    RETRIEVERS[cfg["name"]] = {"retriever": retr, "origin": origin, "cfg": cfg}

RETRIEVERS.keys()


dict_keys(['hybrid_bm25_bge_small', 'bge_small_only', 'bioclinicalbert'])

In [7]:

# ==== Evaluate Coverage metrics for each retriever ====

def token_set(s):
    return {t.lower() for t in re.findall(r"\b\w+\b", str(s)) if len(t)>3}

chunk_text = dict(zip(chunk_df["chunk_id"], chunk_df["text"]))

def eval_coverage_for_retriever(retr, K=10):
    rows = []
    for _, r in cov.iterrows():
        qid, q, ans = r["qid"], r["question"], r["answer"]
        gt_doc, gt_chunk = r["gt_doc_id"], r["gt_chunk_id"]
        top = retr.search(q, k=K)
        ids = [cid for cid,_ in top]
        hit_doc   = int(any(str(cid).startswith(gt_doc) for cid in ids))
        hit_chunk = int(gt_chunk in ids)
        ctx = " \n\n".join([chunk_text.get(cid, "") for cid in ids])
        A = token_set(ans); C = token_set(ctx)
        recall = len(A & C) / (len(A)+1e-9)
        rows.append({"qid": qid, "hit_doc@K": hit_doc, "hit_chunk@K": hit_chunk, "context_recall@K": recall})
    df = pd.DataFrame(rows)
    return {
        "doc_hit": float(df["hit_doc@K"].mean()),
        "chunk_hit": float(df["hit_chunk@K"].mean()),
        "ctx_recall": float(df["context_recall@K"].mean()),
        "detail": df,
    }

coverage_summary = []
coverage_details = {}

for name, obj in RETRIEVERS.items():
    print(f"Evaluating Coverage for: {name} (origin={obj['origin']})")
    res = eval_coverage_for_retriever(obj["retriever"], K=TOP_K)
    coverage_details[name] = res["detail"]
    coverage_summary.append({
        "retriever": name,
        "origin": obj["origin"],
        "K": TOP_K,
        "Doc-Hit@K": res["doc_hit"],
        "Chunk-Hit@K": res["chunk_hit"],
        "ContextRecall@K": res["ctx_recall"],
    })

coverage_table = pd.DataFrame(coverage_summary).sort_values(["Doc-Hit@K","Chunk-Hit@K","ContextRecall@K"], ascending=False)
display(coverage_table)


Evaluating Coverage for: hybrid_bm25_bge_small (origin=local_fallback)
Evaluating Coverage for: bge_small_only (origin=local_fallback)
Evaluating Coverage for: bioclinicalbert (origin=local_fallback)


Unnamed: 0,retriever,origin,K,Doc-Hit@K,Chunk-Hit@K,ContextRecall@K
0,hybrid_bm25_bge_small,local_fallback,10,0.353712,0.124454,0.362049
1,bge_small_only,local_fallback,10,0.347162,0.10262,0.312591
2,bioclinicalbert,local_fallback,10,0.347162,0.10262,0.312591


In [10]:

# ==== Faithfulness evaluation ====

def get_evidence_text_by_chunk_id(cid: str) -> str:
    return chunk_text.get(cid, "")

def token_recall(pred, evid):
    A = {t.lower() for t in re.findall(r"\b\w+\b", str(pred)) if len(t)>3}
    E = {t.lower() for t in re.findall(r"\b\w+\b", str(evid)) if len(t)>3}
    return len(A & E) / (len(A)+1e-9)

PIPELINE_HAS_JUDGE = PIPELINE_IMPORTED and hasattr(pipe, "nli_judge")

def judge_faithfulness(answer: str, evidence: str) -> float:
    if PIPELINE_HAS_JUDGE:
        try:
            return float(pipe.nli_judge(answer, evidence))
        except Exception:
            return float(token_recall(answer, evidence))
    else:
        return float(token_recall(answer, evidence))

gold = faith[faith["note"].isna()].copy()

scores = []
labels = []
from tqdm import tqdm
for _, r in tqdm(gold.iterrows(), total=len(gold)):
    evid = get_evidence_text_by_chunk_id(str(r["evidence_chunk_id"]))
    s = judge_faithfulness(str(r["answer"]), evid)
    scores.append(float(s))
    labels.append(int(r["label_faithful"]))

gold["judge_score"] = scores

from sklearn.metrics import roc_auc_score, accuracy_score, f1_score
auc = roc_auc_score(labels, scores) if len(set(labels))>1 else float("nan")
acc = accuracy_score(labels, [s>=0.5 for s in scores])
f1  = f1_score(labels, [s>=0.5 for s in scores])

print({"AUC": auc, "ACC": acc, "F1": f1})
display(gold.head(10)[["qid","question","answer","label_faithful","evidence_chunk_id","judge_score"]])


100%|██████████| 461/461 [00:00<00:00, 10775.60it/s]

{'AUC': 0.784934497816594, 'ACC': 0.7245119305856833, 'F1': 0.8394437420986094}





Unnamed: 0,qid,question,answer,label_faithful,evidence_chunk_id,judge_score
0,Q::b7b927d779,What is the purpose of the assistance mentioned?,Rehabilitation,1,WHO::def5effffe::CH0075,1.0
2,Q::a4aedf3d70,What does the acronym NIH stand for?,National Institutes of Health,1,WHO::ba091c3aa0::CH0022,0.333333
4,Q::1a6c0caae5,What is the full name of the NIDCD?,National Institute on Deafness and Other Commu...,1,WHO::ba091c3aa0::CH0022,1.0
6,Q::4d04fecd09,What is the full name of the organization abbr...,Centers for Disease Control and Prevention,1,WHO::c746a8289b::CH0048,1.0
8,Q::be034ca29c,How long do these symptoms typically last?,A couple of days.,1,WHO::5d456f490d::CH0041,1.0
10,Q::6a276532ab,Who has a higher risk of getting cancer in the...,Someone who has had cancer in one testicle.,1,WHO::8c8bdba1fe::CH0003,0.666667
12,Q::bc30540a16,What is important to do regularly for the othe...,Check it regularly.,1,WHO::8c8bdba1fe::CH0003,1.0
14,Q::181a4da1f6,What is one key measure to prevent diarrhoea r...,Access to safe drinking-water.,1,WHO::82af9760db::CH0002,1.0
16,Q::f0aac8e0b2,What type of sanitation helps prevent diarrhoea?,Use of improved sanitation.,1,WHO::82af9760db::CH0002,1.0
18,Q::e7c93c3f5f,What is the subject of the passage?,Policy frameworks for good urban governance.,1,WHO::efe533a7d1::CH0025,1.0


In [11]:

# ==== Save detailed outputs ====
OUT_DIR = Path(DATA_DIR) / "eval_outputs"
OUT_DIR.mkdir(parents=True, exist_ok=True)

coverage_table.to_csv(OUT_DIR / "coverage_summary_by_retriever.csv", index=False)
for name, df in coverage_details.items():
    df.to_csv(OUT_DIR / f"coverage_detail_{name}.csv", index=False)

gold.to_csv(OUT_DIR / "faithfulness_gold_with_scores.csv", index=False)

print("Saved to:", OUT_DIR)


Saved to: /home/gulizhu/MDP/benchmark_data/coverage_faithfulness/eval_outputs
