<a href="https://colab.research.google.com/github/RegNLP/ContextAware-Regulatory-GraphRAG-ObliQAMP/blob/main/04_Experiment_1_Baseline_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# ==============================================================================
# Baseline Retriever Evaluation: Pre-trained vs. Fine-Tuned vs. Advanced
#
# Purpose:
# 1. Evaluate all retriever models: pre-trained, standard fine-tuned, and
#    advanced fine-tuned (with hard negatives).
# 2. Apply instruction prefixes during inference for relevant models.
# 3. Compare BM25, Dense, and Hybrid retrieval methods to identify the
#    definitive champion retriever.
# 4. Save both a CSV summary and detailed JSON outputs for each run.
# ==============================================================================

# --- Essential Installations ---
!pip install -q -U sentence-transformers transformers datasets rank_bm25 pytrec_eval

import os
import json
import torch
import pickle
import numpy as np
import pandas as pd
import networkx as nx
import pytrec_eval
from tqdm import tqdm
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util

# --- Configuration ---
BASE_PATH = "/content/drive/MyDrive/Colab Notebooks/RIRAG-MultiPassage-NLLP/"
GRAPH_PATH = os.path.join(BASE_PATH, "graph.gpickle")
TEST_SET_PATH = os.path.join(BASE_PATH, "QADataset", "ObliQA_MultiPassage_test.json")
QREL_PATH = os.path.join(BASE_PATH, "qrels.trec")
EMBEDDINGS_FOLDER = os.path.join(BASE_PATH, "embeddings_full_comparison")

# --- Output Folders ---
RESULTS_CSV_OUTPUT_PATH = os.path.join(BASE_PATH, "experiment_1_final_retriever_comparison_results.csv")
RESULTS_JSON_OUTPUT_FOLDER = os.path.join(BASE_PATH, "experiment_1_retrieval_results_json")
os.makedirs(RESULTS_JSON_OUTPUT_FOLDER, exist_ok=True)

# --- Model Input Folders ---
FINETUNED_RETRIEVER_FOLDER = os.path.join(BASE_PATH, "fine_tuned_retrievers")
ADVANCED_FINETUNED_FOLDER = os.path.join(BASE_PATH, "fine_tuned_retrievers_advanced")

K = 100
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Models and Contexts ---
MODELS_TO_EVALUATE = {
    "e5-large-v2_FT_Advanced": os.path.join(ADVANCED_FINETUNED_FOLDER, "e5-large-v2"),
    "all-mpnet-base-v2_FT_Advanced": os.path.join(ADVANCED_FINETUNED_FOLDER, "all-mpnet-base-v2"),
    "bge-base-en-v1.5_FT_Advanced": os.path.join(ADVANCED_FINETUNED_FOLDER, "bge-base-en-v1.5"),
    "e5-large-v2_FT": os.path.join(FINETUNED_RETRIEVER_FOLDER, "e5-large-v2"),
    "all-mpnet-base-v2_FT": os.path.join(FINETUNED_RETRIEVER_FOLDER, "all-mpnet-base-v2"),
    "bge-base-en-v1.5_FT": os.path.join(FINETUNED_RETRIEVER_FOLDER, "bge-base-en-v1.5"),
    "e5-large-v2_Pretrained": "intfloat/e5-large-v2",
    "all-mpnet-base-v2_Pretrained": "sentence-transformers/all-mpnet-base-v2",
    "bge-base-en-v1.5_Pretrained": "BAAI/bge-base-en-v1.5",
}
CONTEXT_CONFIGS = ["passage_only", "parent", "parent_child", "parent_child_cites", "full_neighborhood"]

# --- Load Graph and Data ---
print("Loading graph and test data...")
with open(GRAPH_PATH, "rb") as f:
    G = pickle.load(f)
with open(TEST_SET_PATH, "r", encoding="utf-8") as f:
    test_data = json.load(f)
# Create a quick lookup for question text by QID
qid_to_question = {q["QuestionID"]: q["Question"] for q in test_data}
print(f"Loaded {len(test_data)} test questions.")

# --- Generate and Load QRELs ---
if not os.path.exists(QREL_PATH):
    print("QREL file not found. Generating...")
    with open(QREL_PATH, "w", encoding="utf-8") as f:
        for item in test_data:
            qid = item["QuestionID"]
            for passage in item["Passages"]:
                uid = f"{passage['DocumentID']}|||{passage['PassageID']}"
                f.write(f"{qid} 0 {uid} 1\n")
    print(f"✅ QREL saved to: {QREL_PATH}")
else:
    print("QREL file found.")

qrel = {}
with open(QREL_PATH, "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().split()
        qid, _, uid, rel = parts[0], parts[1], " ".join(parts[2:-1]), int(parts[-1])
        qrel.setdefault(qid, {})[uid] = rel
print(f"Loaded QRELs for {len(qrel)} queries.")

# --- Prepare BM25 ---
print("Preparing BM25 index...")
all_passage_uids = [n for n, d in G.nodes(data=True) if d.get("type") == "Passage"]
# Create a map from combined UID to internal UID for text lookup
uid_map = {f"{G.nodes[uid].get('document_id')}|||{G.nodes[uid].get('passage_id')}": uid for uid in all_passage_uids}
corpus_texts = [G.nodes[uid].get("text", "") for uid in all_passage_uids]
tokenized_corpus = [text.split() for text in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)
print("BM25 ready.")

# --- Helper Functions ---
def add_instruction_to_query(query, model_key):
    if "e5" in model_key:
        return f"query: {query}"
    if "bge" in model_key:
        return f"Represent this sentence for searching relevant passages: {query}"
    return query

def evaluate_run(run_dict, qrel_dict, metrics={"recall_10", "map_cut_10", "ndcg_cut_10"}):
    evaluator = pytrec_eval.RelevanceEvaluator(qrel_dict, metrics)
    results = evaluator.evaluate(run_dict)
    agg = {metric: pytrec_eval.compute_aggregated_measure(metric, [r.get(metric, 0.0) for r in results.values()]) for metric in metrics}
    return agg

def reciprocal_rank_fusion(ranked_lists, k=60):
    fused = {}
    for lst in ranked_lists:
        for rank, uid in enumerate(lst):
            fused[uid] = fused.get(uid, 0) + 1 / (k + rank + 1)
    return sorted(fused.items(), key=lambda x: x[1], reverse=True)

def format_run_for_json(run_dict, qid_to_question_map, uid_to_internal_uid_map, graph, top_n=10):
    output_list = []
    for qid, passages in run_dict.items():
        sorted_passages = sorted(passages.items(), key=lambda item: item[1], reverse=True)

        retrieved_passages_text = []
        retrieved_scores = []
        retrieved_ids = []

        # CORRECTED: Only take the top_n results for the JSON output
        for combined_uid, score in sorted_passages[:top_n]:
            internal_uid = uid_to_internal_uid_map.get(combined_uid)
            if internal_uid:
                retrieved_passages_text.append(graph.nodes[internal_uid].get("text", ""))
                retrieved_scores.append(score)
                retrieved_ids.append(internal_uid) # Using internal UID as the primary ID

        output_list.append({
            "QuestionID": qid,
            "Question": qid_to_question_map.get(qid, ""),
            "RetrievedPassages": retrieved_passages_text,
            "RetrievedScores": retrieved_scores,
            "RetrievedIDs": retrieved_ids
        })
    return output_list

# --- Main Evaluation ---
all_results = []

# 1. BM25 Baseline
print("\n=== BM25 Retrieval ===")
bm25_run = {}
for q in tqdm(test_data, desc="BM25"):
    qid, query = q["QuestionID"], q["Question"]
    scores = bm25.get_scores(query.split())
    top_idxs = np.argsort(scores)[::-1][:K]
    bm25_run[qid] = {f"{G.nodes[all_passage_uids[idx]].get('document_id')}|||{G.nodes[all_passage_uids[idx]].get('passage_id')}": float(scores[idx]) for idx in top_idxs}
bm25_metrics = evaluate_run(bm25_run, qrel)
all_results.append({"Model": "BM25", "Context": "N/A", "Method": "Lexical Only", **bm25_metrics})
# Save JSON output for BM25
bm25_json = format_run_for_json(bm25_run, qid_to_question, uid_map, G)
with open(os.path.join(RESULTS_JSON_OUTPUT_FOLDER, "BM25_results.json"), "w") as f:
    json.dump(bm25_json, f, indent=4)

# 2. Dense and Hybrid Models
for model_key, model_path in MODELS_TO_EVALUATE.items():
    print(f"\n=== Evaluating Model: {model_key} ===")
    query_encoder = SentenceTransformer(model_path, device=device)
    for context_key in CONTEXT_CONFIGS:
        print(f"→ Context: {context_key}")
        emb_path = os.path.join(EMBEDDINGS_FOLDER, model_key, context_key, "embeddings.pkl")
        id_path = os.path.join(EMBEDDINGS_FOLDER, model_key, context_key, "passage_ids.json")

        try:
            with open(emb_path, "rb") as f: passage_embeddings = pickle.load(f)
            with open(id_path, "r") as f: passage_ids = json.load(f)
        except FileNotFoundError:
            print(f"⚠️ Embeddings not found for {model_key}/{context_key}. Skipping.")
            continue

        dense_run, hybrid_run = {}, {}
        embeddings_tensor = torch.tensor(passage_embeddings).to(device)

        for q in tqdm(test_data, desc=f"{model_key}/{context_key}"):
            qid, query = q["QuestionID"], q["Question"]

            instructed_query = add_instruction_to_query(query, model_key)
            query_emb = query_encoder.encode(instructed_query, convert_to_tensor=True, device=device)

            cos_scores = util.pytorch_cos_sim(query_emb, embeddings_tensor)[0]
            top_results = torch.topk(cos_scores, k=min(K, len(passage_ids)))

            dense_run[qid], dense_uids = {}, []
            for idx, score in zip(top_results.indices, top_results.values):
                uid = passage_ids[idx]
                node = G.nodes[uid]
                combined_uid = f"{node.get('document_id')}|||{node.get('passage_id')}"
                dense_run[qid][combined_uid] = float(score.item())
                dense_uids.append(uid)

            bm25_scores = bm25.get_scores(query.split())
            top_bm25 = np.argsort(bm25_scores)[::-1][:K]
            bm25_uids = [all_passage_uids[i] for i in top_bm25]

            fused_uids = reciprocal_rank_fusion([dense_uids, bm25_uids])[:K]
            hybrid_run[qid] = {f"{G.nodes[uid].get('document_id')}|||{G.nodes[uid].get('passage_id')}": score for uid, score in fused_uids}

        # Evaluate and save results for Dense run
        dense_metrics = evaluate_run(dense_run, qrel)
        all_results.append({"Model": model_key, "Context": context_key, "Method": "Dense", **dense_metrics})
        dense_json = format_run_for_json(dense_run, qid_to_question, uid_map, G)
        with open(os.path.join(RESULTS_JSON_OUTPUT_FOLDER, f"{model_key}_{context_key}_Dense_results.json"), "w") as f:
            json.dump(dense_json, f, indent=4)

        # Evaluate and save results for Hybrid run
        hybrid_metrics = evaluate_run(hybrid_run, qrel)
        all_results.append({"Model": model_key, "Context": context_key, "Method": "Hybrid", **hybrid_metrics})
        hybrid_json = format_run_for_json(hybrid_run, qid_to_question, uid_map, G)
        with open(os.path.join(RESULTS_JSON_OUTPUT_FOLDER, f"{model_key}_{context_key}_Hybrid_results.json"), "w") as f:
            json.dump(hybrid_json, f, indent=4)

# --- Save and Display Final Results ---
df = pd.DataFrame(all_results)
df = df.rename(columns={"recall_10": "Recall@10", "map_cut_10": "MAP@10"})
df = df.sort_values(by="nDCG@10", ascending=False)

print("\n📊 Final Evaluation Results:")
print(df.to_string(index=False, float_format="%.4f"))

df.to_csv(RESULTS_CSV_OUTPUT_PATH, index=False)
print(f"\n✅ CSV summary saved to: {RESULTS_CSV_OUTPUT_PATH}")
print(f"✅ Detailed JSON results saved to: {RESULTS_JSON_OUTPUT_FOLDER}")

print("\n--- 🏆 Best Performing Configuration ---")
print(df.iloc[0])


In [None]:
## Import up sound alert dependencies
from IPython.display import Audio, display

def allDone():
  #display(Audio(url='https://www.myinstants.com/media/sounds/anime-wow-sound-effect.mp3', autoplay=True))
  display(Audio(url='https://www.myinstants.com/media/sounds/money-soundfx.mp3', autoplay=True))
## Insert whatever audio file you want above

allDone()

**Observations from Experiment 1: Baseline Retriever Performance**
The results from the initial baseline experiment provide several clear and important insights into the effectiveness of different retrieval strategies for your dataset.

**Observation 1: Hybrid Retrieval is a Clear Winner**
In every single test case, the Hybrid method (combining dense semantic search with lexical BM25) significantly outperformed both the Dense (semantic only) and the Lexical Only (BM25) methods. This holds true across all metrics, including nDCG@10, Recall, and MAP. This strongly indicates that for your specific regulatory documents, combining keyword relevance with semantic meaning is the most effective initial retrieval strategy.

**Observation 2: e5-large is the Superior Embedding Model**
Across all context strategies and retrieval methods, the e5-large model consistently produced better results than the mpnet model. This suggests that e5-large has a better grasp of the nuances of your legal and regulatory text, leading to better recall and overall ranking quality.

**Observation 3: Graph Context is Beneficial (with a Trade-off)**
When evaluating the top-performing e5-large model across all three key metrics, a clear picture emerges:

* Best Overall Ranking Quality (nDCG@10 & MAP@10): The parent_child context achieved the highest scores for both nDCG@10 (0.3950) and MAP@10 (0.3069). Since both of these metrics heavily reward placing correct documents at the very top of the ranked list, this configuration is the most precise.

* Best for Finding Documents (Recall@10): The full_neighborhood context achieved the highest recall (0.4163). This indicates it was the most effective at finding all relevant passages and placing them somewhere within the top 10 results, even if not perfectly ordered.

**Conclusion:** Identifying the Champion and Runner-Up
Based on a balanced view of all three metrics, we can identify a clear champion for the next stage of experiments.

* Champion Retriever: e5-large model with parent_child context, using the Hybrid method. It is the winner on two of the three most important metrics (nDCG@10 and MAP@10), making it the best-performing retriever for delivering highly relevant results at the top of the list.

* Strong Runner-Up: e5-large model with full_neighborhood context, using the Hybrid method. While its ranking precision is a fraction lower, it excels at recall, making it an excellent and very close alternative.

Since the performance of these two is extremely close, they both represent excellent starting points for our more advanced re-ranking experiments.

**Update After Experiment 2:** The Limits of Heuristic Graph Re-ranking
The results from the enriched graph re-ranking experiment are in, and they provide a crucial insight: the graph-based re-ranking strategies did not improve performance.

Experiment 1 Champion (Hybrid Retriever): nDCG@10 of 0.3950

Experiment 2 Best (Graph Re-ranked): nDCG@10 of 0.3400

This performance drop is a valuable finding. It suggests that the initial hybrid retrieval stage is already very effective. The simple, additive graph bonuses (parent, citation, etc.) were not nuanced enough to improve upon this strong baseline and, in some cases, likely introduced noise that harmed the ranking.

This is not a failure, but a confirmation that to achieve the next level of precision, a more powerful re-ranking method is required. This leads us directly to the next logical step in our plan.