<a href="https://colab.research.google.com/github/RegNLP/ContextAware-Regulatory-GraphRAG-ObliQAMP/blob/main/3_Experiment_1_Baseline_Retriever_Performance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Experiment 1: Baseline Retriever Performance**
Goal: Find the best initial retrieval method before any complex re-ranking. This is your foundational baseline.

Method:

Disable Stage 2 (Graph Re-ranking) and Stage 3 (Cross-Encoder).

For each of your 8 embedding sets (e.g., mpnet with passage_only, e5-large with parent, etc.):

Run A: Evaluate using only dense retrieval (the sentence transformer).

Run B: Evaluate using hybrid retrieval (dense + BM25 fused with RRF).

Record the performance (nDCG@10, MAP@10) for all 16 runs (8 embedding sets x 2 methods).

Outcome: You'll identify the best-performing embedding model and context strategy. You'll also know if combining it with BM25 provides a significant boost. The winner of this experiment becomes your "champion retriever" for the next stage.

**Experiment 2: Value of Graph Re-ranking**
Goal: Isolate and measure the benefit of using the knowledge graph structure.

Method:

Use the champion retriever identified in Experiment 1 (e.g., hybrid retrieval with e5-large + parent_child context).

Enable Stage 2 (Graph Re-ranking).

Keep Stage 3 (Cross-Encoder) disabled.

Run the pipeline and evaluate the results.

Outcome: By comparing these results to the champion's score from Experiment 1, you can quantify the exact performance change (hopefully an improvement) from the graph-based score bonuses. This tells you if your contextual linking strategy is effective.

**Experiment 3: Full Pipeline with Cross-Encoders**
Goal: Measure the impact of the final, high-precision re-ranking and find the best overall model combination.

Method:

Use the best setup from Experiment 2 (Champion Retriever + Graph Re-ranking).

Enable Stage 3 (Cross-Encoder Re-ranking).

Run the complete, three-stage pipeline for each of your cross-encoder models (MiniLM, MPNet, etc.).

Outcome: This will give you the final performance for your complete system. You can compare the different cross-encoders to see which one provides the most significant boost, leading you to the definitive, best-performing pipeline for your task.



In [66]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [67]:
'''# ==============================================================================
# Experiment 1: Baseline Retriever Evaluation - BM25, Dense, Hybrid
# ==============================================================================

import os
import json
import torch
import pickle
import numpy as np
import pandas as pd
import networkx as nx
import pytrec_eval
from tqdm import tqdm
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util

# --- Configuration ---
BASE_PATH = "/content/drive/MyDrive/Colab Notebooks/RIRAG-MultiPassage-NLLP/"
GRAPH_PATH = os.path.join(BASE_PATH, "graph.gpickle")
TEST_SET_PATH = os.path.join(BASE_PATH, "QADataset", "ObliQA_MultiPassage_test.json")
QREL_PATH = os.path.join(BASE_PATH, "qrels.trec")
EMBEDDINGS_FOLDER = os.path.join(BASE_PATH, "embeddings")
RESULTS_OUTPUT_PATH = os.path.join(BASE_PATH, "experiment_1_full_results.csv")

K = 100
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Embedding Models and Contexts ---
EMBEDDING_MODELS = {
    "mpnet": "sentence-transformers/all-mpnet-base-v2",
    "e5-large": "intfloat/e5-large-v2"
}
CONTEXT_CONFIGS = ["passage_only", "parent", "parent_child", "full_neighborhood"]

# --- Load Graph ---
print("Loading graph...")
with open(GRAPH_PATH, "rb") as f:
    G = pickle.load(f)

# --- Load Test Set ---
with open(TEST_SET_PATH, "r", encoding="utf-8") as f:
    test_data = json.load(f)
print(f"Loaded {len(test_data)} test questions.")

# --- Generate QREL file if needed ---
if not os.path.exists(QREL_PATH):
    print("QREL file not found. Generating...")
    with open(QREL_PATH, "w", encoding="utf-8") as f:
        for item in test_data:
            qid = item["QuestionID"]
            for passage in item["Passages"]:
                doc_id = passage["DocumentID"]
                passage_id = passage["PassageID"]
                uid = f"{doc_id}|||{passage_id}"
                f.write(f"{qid} 0 {uid} 1\n")
    print(f"✅ QREL saved to: {QREL_PATH}")
else:
    print("QREL file found. Skipping generation.")

# --- Load QRELs ---
qrel = {}
with open(QREL_PATH, "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().split()
        qid, _, uid, rel = parts[0], parts[1], " ".join(parts[2:-1]), int(parts[-1])
        qrel.setdefault(qid, {})[uid] = rel
print(f"Loaded QRELs for {len(qrel)} queries.")

# --- Prepare BM25 ---
print("Preparing BM25 index...")
all_passage_uids = [n for n, d in G.nodes(data=True) if d.get("type") == "Passage"]
corpus_texts = [G.nodes[uid].get("text", "") for uid in all_passage_uids]
tokenized_corpus = [text.split() for text in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)
print("BM25 ready.")

# --- Evaluation Function ---
def evaluate_run(run_dict, qrel_dict, metrics={"recall_10", "map_cut_10", "ndcg_cut_10"}):
    evaluator = pytrec_eval.RelevanceEvaluator(qrel_dict, metrics)
    results = evaluator.evaluate(run_dict)
    agg = {metric: pytrec_eval.compute_aggregated_measure(metric, [r.get(metric, 0.0) for r in results.values()]) for metric in metrics}
    return agg

# --- Reciprocal Rank Fusion ---
def reciprocal_rank_fusion(ranked_lists, k=60):
    fused = {}
    for lst in ranked_lists:
        for rank, uid in enumerate(lst):
            fused[uid] = fused.get(uid, 0) + 1 / (k + rank + 1)
    return sorted(fused.items(), key=lambda x: x[1], reverse=True)

# --- Store All Results ---
all_results = []

# --- BM25 Retrieval ---
print("\n=== BM25 Retrieval ===")
bm25_run = {}
for q in tqdm(test_data, desc="BM25"):
    qid = q["QuestionID"]
    query = q["Question"]
    scores = bm25.get_scores(query.split())
    top_idxs = np.argsort(scores)[::-1][:K]
    bm25_run[qid] = {}
    for idx in top_idxs:
        uid = all_passage_uids[idx]
        node = G.nodes[uid]
        doc_id = node.get("document_id", "")
        passage_id = node.get("passage_id", "")
        if doc_id and passage_id:
            combined_uid = f"{doc_id}|||{passage_id}"
            bm25_run[qid][combined_uid] = float(scores[idx])
bm25_metrics = evaluate_run(bm25_run, qrel)
all_results.append({
    "Model": "BM25", "Context": "N/A", "Method": "Lexical Only",
    "Recall@10": bm25_metrics["recall_10"],
    "MAP@10": bm25_metrics["map_cut_10"],
    "nDCG@10": bm25_metrics["ndcg_cut_10"]
})
# --- Save Results ---
df = pd.DataFrame(all_results)
df = df.sort_values(by=["Model", "Context", "Method"])
print("\n📊 Final Evaluation Results:")
print(df.to_string(index=False, float_format="%.4f"))

df.to_csv(RESULTS_OUTPUT_PATH, index=False)
print(f"\n✅ Results saved to: {RESULTS_OUTPUT_PATH}")'''


Loading graph...
Loaded 447 test questions.
QREL file found. Skipping generation.
Loaded QRELs for 447 queries.
Preparing BM25 index...
BM25 ready.

=== BM25 Retrieval ===


BM25: 100%|██████████| 447/447 [00:41<00:00, 10.77it/s]


📊 Final Evaluation Results:
Model Context       Method  Recall@10  MAP@10  nDCG@10
 BM25     N/A Lexical Only     0.3474  0.2611   0.3409

✅ Results saved to: /content/drive/MyDrive/Colab Notebooks/RIRAG-MultiPassage-NLLP/experiment_1_full_results.csv





In [74]:
# ==============================================================================
# Experiment 1: Baseline Retriever Evaluation - BM25, Dense, Hybrid
# ==============================================================================

import os
import json
import torch
import pickle
import numpy as np
import pandas as pd
import networkx as nx
import pytrec_eval
from tqdm import tqdm
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util

# --- Configuration ---
BASE_PATH = "/content/drive/MyDrive/Colab Notebooks/RIRAG-MultiPassage-NLLP/"
GRAPH_PATH = os.path.join(BASE_PATH, "graph.gpickle")
TEST_SET_PATH = os.path.join(BASE_PATH, "QADataset", "ObliQA_MultiPassage_test.json")
QREL_PATH = os.path.join(BASE_PATH, "qrels.trec")
EMBEDDINGS_FOLDER = os.path.join(BASE_PATH, "embeddings")
RESULTS_OUTPUT_PATH = os.path.join(BASE_PATH, "experiment_1_full_results.csv")

K = 100
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Embedding Models and Contexts ---
EMBEDDING_MODELS = {
    "mpnet": "sentence-transformers/all-mpnet-base-v2",
    "e5-large": "intfloat/e5-large-v2"
}
CONTEXT_CONFIGS = ["passage_only", "parent", "parent_child", "full_neighborhood"]

# --- Load Graph ---
print("Loading graph...")
with open(GRAPH_PATH, "rb") as f:
    G = pickle.load(f)

# --- Load Test Set ---
with open(TEST_SET_PATH, "r", encoding="utf-8") as f:
    test_data = json.load(f)
print(f"Loaded {len(test_data)} test questions.")

# --- Generate QREL file if needed ---
if not os.path.exists(QREL_PATH):
    print("QREL file not found. Generating...")
    with open(QREL_PATH, "w", encoding="utf-8") as f:
        for item in test_data:
            qid = item["QuestionID"]
            for passage in item["Passages"]:
                doc_id = passage["DocumentID"]
                passage_id = passage["PassageID"]
                uid = f"{doc_id}|||{passage_id}"
                f.write(f"{qid} 0 {uid} 1\n")
    print(f"✅ QREL saved to: {QREL_PATH}")
else:
    print("QREL file found. Skipping generation.")

# --- Load QRELs ---
qrel = {}
with open(QREL_PATH, "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().split()
        qid, _, uid, rel = parts[0], parts[1], " ".join(parts[2:-1]), int(parts[-1])
        qrel.setdefault(qid, {})[uid] = rel
print(f"Loaded QRELs for {len(qrel)} queries.")

# --- Prepare BM25 ---
print("Preparing BM25 index...")
all_passage_uids = [n for n, d in G.nodes(data=True) if d.get("type") == "Passage"]
corpus_texts = [G.nodes[uid].get("text", "") for uid in all_passage_uids]
tokenized_corpus = [text.split() for text in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)
print("BM25 ready.")

# --- Evaluation Function ---
def evaluate_run(run_dict, qrel_dict, metrics={"recall_10", "map_cut_10", "ndcg_cut_10"}):
    evaluator = pytrec_eval.RelevanceEvaluator(qrel_dict, metrics)
    results = evaluator.evaluate(run_dict)
    agg = {metric: pytrec_eval.compute_aggregated_measure(metric, [r.get(metric, 0.0) for r in results.values()]) for metric in metrics}
    return agg

# --- Reciprocal Rank Fusion ---
def reciprocal_rank_fusion(ranked_lists, k=60):
    fused = {}
    for lst in ranked_lists:
        for rank, uid in enumerate(lst):
            fused[uid] = fused.get(uid, 0) + 1 / (k + rank + 1)
    return sorted(fused.items(), key=lambda x: x[1], reverse=True)

# --- Store All Results ---
all_results = []

# --- BM25 Retrieval ---
print("\n=== BM25 Retrieval ===")
bm25_run = {}
for q in tqdm(test_data, desc="BM25"):
    qid = q["QuestionID"]
    query = q["Question"]
    scores = bm25.get_scores(query.split())
    top_idxs = np.argsort(scores)[::-1][:K]
    bm25_run[qid] = {}
    for idx in top_idxs:
        uid = all_passage_uids[idx]
        node = G.nodes[uid]
        doc_id = node.get("document_id", "")
        passage_id = node.get("passage_id", "")
        if doc_id and passage_id:
            combined_uid = f"{doc_id}|||{passage_id}"
            bm25_run[qid][combined_uid] = float(scores[idx])
bm25_metrics = evaluate_run(bm25_run, qrel)
all_results.append({
    "Model": "BM25", "Context": "N/A", "Method": "Lexical Only",
    "Recall@10": bm25_metrics["recall_10"],
    "MAP@10": bm25_metrics["map_cut_10"],
    "nDCG@10": bm25_metrics["ndcg_cut_10"]
})

# --- Dense + Hybrid Retrieval ---
for model_key, model_path in EMBEDDING_MODELS.items():
    print(f"\n=== Dense Evaluation: {model_key} ===")
    query_encoder = SentenceTransformer(model_path, device=device)

    for context_key in CONTEXT_CONFIGS:
        print(f"→ Context: {context_key}")
        emb_path = os.path.join(EMBEDDINGS_FOLDER, model_key, context_key, "embeddings.pkl")
        id_path = os.path.join(EMBEDDINGS_FOLDER, model_key, context_key, "passage_ids.json")

        try:
            with open(emb_path, "rb") as f:
                passage_embeddings = pickle.load(f)
            with open(id_path, "r") as f:
                passage_ids = json.load(f)
        except FileNotFoundError:
            print(f"⚠️ Missing embeddings for {model_key} / {context_key}, skipping...")
            continue

        dense_run = {}
        hybrid_run = {}

        embeddings_tensor = torch.tensor(passage_embeddings).to(device)

        for q in tqdm(test_data, desc=f"{model_key}/{context_key}"):
            qid = q["QuestionID"]
            query = q["Question"]
            query_emb = query_encoder.encode(query, convert_to_tensor=True, device=device)
            cos_scores = util.pytorch_cos_sim(query_emb, embeddings_tensor)[0]
            top_results = torch.topk(cos_scores, k=min(K, len(passage_ids)))

            dense_run[qid] = {}
            dense_uids = []
            for idx, score in zip(top_results.indices, top_results.values):
                uid = passage_ids[idx]
                node = G.nodes[uid]
                doc_id = node.get("document_id", "")
                passage_id = node.get("passage_id", "")
                if doc_id and passage_id:
                    combined_uid = f"{doc_id}|||{passage_id}"
                    dense_run[qid][combined_uid] = float(score.item())
                    dense_uids.append(uid)

            # Hybrid with BM25
            tokenized_query = query.split()
            bm25_scores = bm25.get_scores(tokenized_query)
            top_bm25 = np.argsort(bm25_scores)[::-1][:K]
            bm25_uids = [all_passage_uids[i] for i in top_bm25]

            fused_uids_with_scores = reciprocal_rank_fusion([dense_uids, bm25_uids])
            hybrid_run[qid] = {}
            for rank, (uid, score) in enumerate(fused_uids_with_scores[:K]):
                node = G.nodes[uid]
                doc_id = node.get("document_id", "")
                passage_id = node.get("passage_id", "")
                if doc_id and passage_id:
                    combined_uid = f"{doc_id}|||{passage_id}"
                    hybrid_run[qid][combined_uid] = float(score)

        # Evaluate Dense
        dense_metrics = evaluate_run(dense_run, qrel)
        all_results.append({
            "Model": model_key, "Context": context_key, "Method": "Dense",
            "Recall@10": dense_metrics["recall_10"],
            "MAP@10": dense_metrics["map_cut_10"],
            "nDCG@10": dense_metrics["ndcg_cut_10"]
        })

        # Evaluate Hybrid
        hybrid_metrics = evaluate_run(hybrid_run, qrel)
        all_results.append({
            "Model": model_key, "Context": context_key, "Method": "Hybrid",
            "Recall@10": hybrid_metrics["recall_10"],
            "MAP@10": hybrid_metrics["map_cut_10"],
            "nDCG@10": hybrid_metrics["ndcg_cut_10"]
        })

# --- Save Results ---
df = pd.DataFrame(all_results)
df = df.sort_values(by=["Model", "Context", "Method"])
print("\n📊 Final Evaluation Results:")
print(df.to_string(index=False, float_format="%.4f"))

df.to_csv(RESULTS_OUTPUT_PATH, index=False)
print(f"\n✅ Results saved to: {RESULTS_OUTPUT_PATH}")


Loading graph...
Loaded 447 test questions.
QREL file not found. Generating...
✅ QREL saved to: /content/drive/MyDrive/Colab Notebooks/RIRAG-MultiPassage-NLLP/qrels.trec
Loaded QRELs for 447 queries.
Preparing BM25 index...
BM25 ready.

=== BM25 Retrieval ===


BM25: 100%|██████████| 447/447 [00:46<00:00,  9.61it/s]



=== Dense Evaluation: mpnet ===
→ Context: passage_only


mpnet/passage_only: 100%|██████████| 447/447 [00:54<00:00,  8.22it/s]


→ Context: parent


mpnet/parent: 100%|██████████| 447/447 [00:53<00:00,  8.40it/s]


→ Context: parent_child


mpnet/parent_child: 100%|██████████| 447/447 [00:53<00:00,  8.42it/s]


→ Context: full_neighborhood


mpnet/full_neighborhood: 100%|██████████| 447/447 [01:19<00:00,  5.63it/s]



=== Dense Evaluation: e5-large ===
→ Context: passage_only


  return forward_call(*args, **kwargs)
e5-large/passage_only: 100%|██████████| 447/447 [00:59<00:00,  7.53it/s]


→ Context: parent


e5-large/parent: 100%|██████████| 447/447 [01:01<00:00,  7.30it/s]


→ Context: parent_child


e5-large/parent_child: 100%|██████████| 447/447 [01:00<00:00,  7.43it/s]


→ Context: full_neighborhood


e5-large/full_neighborhood: 100%|██████████| 447/447 [00:58<00:00,  7.62it/s]



📊 Final Evaluation Results:
   Model           Context       Method  Recall@10  MAP@10  nDCG@10
    BM25               N/A Lexical Only     0.3474  0.2611   0.3409
e5-large full_neighborhood        Dense     0.3759  0.2586   0.3371
e5-large full_neighborhood       Hybrid     0.4163  0.3052   0.3933
e5-large            parent        Dense     0.3852  0.2710   0.3499
e5-large            parent       Hybrid     0.4118  0.3043   0.3919
e5-large      parent_child        Dense     0.3703  0.2566   0.3343
e5-large      parent_child       Hybrid     0.4155  0.3069   0.3950
e5-large      passage_only        Dense     0.3911  0.2818   0.3613
e5-large      passage_only       Hybrid     0.4148  0.3053   0.3931
   mpnet full_neighborhood        Dense     0.3404  0.2311   0.3038
   mpnet full_neighborhood       Hybrid     0.4051  0.2872   0.3744
   mpnet            parent        Dense     0.3474  0.2454   0.3177
   mpnet            parent       Hybrid     0.4033  0.2909   0.3776
   mpnet      paren

In [75]:
## Import up sound alert dependencies
from IPython.display import Audio, display

def allDone():
  #display(Audio(url='https://www.myinstants.com/media/sounds/anime-wow-sound-effect.mp3', autoplay=True))
  display(Audio(url='https://www.myinstants.com/media/sounds/money-soundfx.mp3', autoplay=True))
## Insert whatever audio file you want above

allDone()

**Observations from Experiment 1: Baseline Retriever Performance**
The results from the initial baseline experiment provide several clear and important insights into the effectiveness of different retrieval strategies for your dataset.

**Observation 1: Hybrid Retrieval is a Clear Winner**
In every single test case, the Hybrid method (combining dense semantic search with lexical BM25) significantly outperformed both the Dense (semantic only) and the Lexical Only (BM25) methods. This holds true across all metrics, including nDCG@10, Recall, and MAP. This strongly indicates that for your specific regulatory documents, combining keyword relevance with semantic meaning is the most effective initial retrieval strategy.

**Observation 2: e5-large is the Superior Embedding Model**
Across all context strategies and retrieval methods, the e5-large model consistently produced better results than the mpnet model. This suggests that e5-large has a better grasp of the nuances of your legal and regulatory text, leading to better recall and overall ranking quality.

**Observation 3: Graph Context is Beneficial (with a Trade-off)**
When evaluating the top-performing e5-large model across all three key metrics, a clear picture emerges:

* Best Overall Ranking Quality (nDCG@10 & MAP@10): The parent_child context achieved the highest scores for both nDCG@10 (0.3950) and MAP@10 (0.3069). Since both of these metrics heavily reward placing correct documents at the very top of the ranked list, this configuration is the most precise.

* Best for Finding Documents (Recall@10): The full_neighborhood context achieved the highest recall (0.4163). This indicates it was the most effective at finding all relevant passages and placing them somewhere within the top 10 results, even if not perfectly ordered.

**Conclusion:** Identifying the Champion and Runner-Up
Based on a balanced view of all three metrics, we can identify a clear champion for the next stage of experiments.

* Champion Retriever: e5-large model with parent_child context, using the Hybrid method. It is the winner on two of the three most important metrics (nDCG@10 and MAP@10), making it the best-performing retriever for delivering highly relevant results at the top of the list.

* Strong Runner-Up: e5-large model with full_neighborhood context, using the Hybrid method. While its ranking precision is a fraction lower, it excels at recall, making it an excellent and very close alternative.

Since the performance of these two is extremely close, they both represent excellent starting points for our more advanced re-ranking experiments.

**Update After Experiment 2:** The Limits of Heuristic Graph Re-ranking
The results from the enriched graph re-ranking experiment are in, and they provide a crucial insight: the graph-based re-ranking strategies did not improve performance.

Experiment 1 Champion (Hybrid Retriever): nDCG@10 of 0.3950

Experiment 2 Best (Graph Re-ranked): nDCG@10 of 0.3400

This performance drop is a valuable finding. It suggests that the initial hybrid retrieval stage is already very effective. The simple, additive graph bonuses (parent, citation, etc.) were not nuanced enough to improve upon this strong baseline and, in some cases, likely introduced noise that harmed the ranking.

This is not a failure, but a confirmation that to achieve the next level of precision, a more powerful re-ranking method is required. This leads us directly to the next logical step in our plan.