<a href="https://colab.research.google.com/github/RegNLP/ContextAware-Regulatory-GraphRAG-ObliQAMP/blob/main/04_Experiment_1_Baseline_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

This section evaluates multiple retriever models—pretrained, fine-tuned, and advanced fine-tuned—using both lexical and dense retrieval strategies over a graph-structured regulatory dataset. The goal is to identify the most effective retriever configuration for multi-passage question answering.

**✅ Evaluation Steps**
* 1️⃣ Setup and Configuration
  * Install essential libraries: sentence-transformers, rank_bm25, pytrec_eval, etc.
  * Define paths for:
    * Document graph (graph.gpickle)
    * Test dataset (ObliQA_MultiPassage_test.json)
    * Precomputed embeddings
    * Output folders for JSON and CSV results

* 2️⃣ Load Data
  * Load the regulatory graph structure with passages as nodes.
  * Load the test question set (JSON), where each question is linked to relevant passages.
  * Generate or load QRELs in TREC format (mapping of question IDs to relevant passage IDs).

* 3️⃣ Prepare BM25 Retriever
  * Extract passage texts from graph nodes.
  * Tokenize each passage into words.
  * Use the BM25Okapi class to initialize a lexical BM25 retriever over the passage corpus.

* 4️⃣ Define Evaluation Helpers
  * add_instruction_to_query: Adds model-specific query prompts (e.g., "query: " for E5).
  * evaluate_run: Evaluates a retrieval run using Recall@10 and MAP@10 with pytrec_eval.
  * reciprocal_rank_fusion: Combines multiple rankings into one (used for hybrid retrieval).
  * format_run_for_json: Converts internal retrieval results into readable JSON format, including passage text, scores, and internal IDs.

* 5️⃣ Run BM25 Retrieval

  * For each question:
    * Use BM25 to score all passages.
    * Select the top-𝑘 (e.g., 100) passages by score.
    * Evaluate BM25 results using pytrec_eval.

  * Save:
    * Summary metrics (Recall@10, MAP@10)
    * Detailed retrieved passages (as JSON)

* 6️⃣ Dense and Hybrid Retrieval (Per Model and Context)

  * For each retriever model (e.g., e5-large-v2, all-mpnet-base-v2, bge-base-en-v1.5) and each graph context variant (e.g., passage_only, parent, parent_child, parent_child_cites, full_neighborhood), we evaluate both Dense and Hybrid retrieval strategies:

    * a. Load Embeddings
        
      * Load passage embeddings (embeddings.pkl) and corresponding passage IDs (passage_ids.json) for the current model-context configuration.
      
      * If missing, skip the current configuration.

    * b. Encode Queries
      
      * Add an instruction prefix to the query, if required by the model (e.g., E5 or BGE).

      * Encode the query into a dense embedding using the selected SentenceTransformer.

    * c. Dense Retrieval
      
      * Compute cosine similarity between the query embedding and all passage embeddings.

      * Select the top-𝑘 most similar passages.

      * Format and store the run (passage IDs and similarity scores).

    * d. Hybrid Retrieval (Reciprocal Rank Fusion)
      
      * Get top-𝑘 BM25 matches for the same query.

      * Combine the BM25 and Dense results using Reciprocal Rank Fusion (RRF):

          RRF Score=∑𝑟∈retrievers1𝑘+rank𝑟RRF Score= r∈retrievers∑k+rank r1​

      * Select the top-𝑘 fused results as the Hybrid retrieval output.

    * e. Evaluate and Save

      * Evaluate both Dense and Hybrid results using:

        * Recall@10: Percentage of queries for which at least one relevant passage was retrieved in the top 10.

        * MAP@10: Average precision of relevant documents among the top 10.

      * Save:

        * Metrics to summary list

        * Retrieved passage details (texts, scores, passage IDs) to JSON files

* 7️⃣ Aggregate and Report Final Results

  After evaluating all model-context-method combinations:

    * a. Combine All Metrics

      * Collect all results (BM25, Dense, Hybrid) into a pandas DataFrame.

      * Each row represents a unique configuration:

        Retriever model

        Graph context setting

        Retrieval method

        Recall@10 and MAP@10 scores

    * b. Sort and Rank
      * Sort the DataFrame in descending order of Recall@10 and MAP@10 to rank retrievers.

      * The top row represents the best-performing configuration across all methods.

    * c. Save Final Outputs
      * Save the summary DataFrame as a CSV file (experiment_1_final_retriever_comparison_results.csv)

      * Print the complete table in the notebook for inspection.

      * Highlight and print the best-performing configuration.

      * All per-query results (passage texts, scores) are also saved as JSON files in the output folder.

In [None]:
# ==============================================================================
# Final Retriever Evaluation: Pre-trained vs. Fine-Tuned vs. Advanced
#
# Purpose:
# 1. Evaluate all retriever models: pre-trained, standard fine-tuned, and
#    advanced fine-tuned (with hard negatives).
# 2. Apply instruction prefixes during inference for relevant models.
# 3. Compare BM25, Dense, and Hybrid retrieval methods to identify the
#    definitive champion retriever.
# 4. Save both a CSV summary and detailed JSON outputs for each run.
# ==============================================================================

# --- Essential Installations ---
!pip install -q -U sentence-transformers transformers datasets rank_bm25 pytrec_eval

import os
import json
import torch
import pickle
import numpy as np
import pandas as pd
import networkx as nx
import pytrec_eval
from tqdm import tqdm
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util

# --- Configuration ---
BASE_PATH = "/content/drive/MyDrive/Colab Notebooks/RIRAG-MultiPassage-NLLP/"
GRAPH_PATH = os.path.join(BASE_PATH, "graph.gpickle")
TEST_SET_PATH = os.path.join(BASE_PATH, "QADataset", "ObliQA_MultiPassage_test.json")
QREL_PATH = os.path.join(BASE_PATH, "qrels.trec")
EMBEDDINGS_FOLDER = os.path.join(BASE_PATH, "embeddings_full_comparison")

# --- Output Folders ---
RESULTS_CSV_OUTPUT_PATH = os.path.join(BASE_PATH, "experiment_1_final_retriever_comparison_results.csv")
RESULTS_JSON_OUTPUT_FOLDER = os.path.join(BASE_PATH, "experiment_1_retrieval_results_json")
os.makedirs(RESULTS_JSON_OUTPUT_FOLDER, exist_ok=True)

# --- Model Input Folders ---
FINETUNED_RETRIEVER_FOLDER = os.path.join(BASE_PATH, "fine_tuned_retrievers")
ADVANCED_FINETUNED_FOLDER = os.path.join(BASE_PATH, "fine_tuned_retrievers_advanced")

K = 100
device = "cuda" if torch.cuda.is_available() else "cpu"

# --- Models and Contexts ---
MODELS_TO_EVALUATE = {
    "e5-large-v2_FT_Advanced": os.path.join(ADVANCED_FINETUNED_FOLDER, "e5-large-v2"),
    "all-mpnet-base-v2_FT_Advanced": os.path.join(ADVANCED_FINETUNED_FOLDER, "all-mpnet-base-v2"),
    "bge-base-en-v1.5_FT_Advanced": os.path.join(ADVANCED_FINETUNED_FOLDER, "bge-base-en-v1.5"),
    "e5-large-v2_FT": os.path.join(FINETUNED_RETRIEVER_FOLDER, "e5-large-v2"),
    "all-mpnet-base-v2_FT": os.path.join(FINETUNED_RETRIEVER_FOLDER, "all-mpnet-base-v2"),
    "bge-base-en-v1.5_FT": os.path.join(FINETUNED_RETRIEVER_FOLDER, "bge-base-en-v1.5"),
    "e5-large-v2_Pretrained": "intfloat/e5-large-v2",
    "all-mpnet-base-v2_Pretrained": "sentence-transformers/all-mpnet-base-v2",
    "bge-base-en-v1.5_Pretrained": "BAAI/bge-base-en-v1.5",
}
CONTEXT_CONFIGS = ["passage_only", "parent", "parent_child", "parent_child_cites", "full_neighborhood"]

# --- Load Graph and Data ---
print("Loading graph and test data...")
with open(GRAPH_PATH, "rb") as f:
    G = pickle.load(f)
with open(TEST_SET_PATH, "r", encoding="utf-8") as f:
    test_data = json.load(f)
# Create a quick lookup for question text by QID
qid_to_question = {q["QuestionID"]: q["Question"] for q in test_data}
print(f"Loaded {len(test_data)} test questions.")

# --- Generate and Load QRELs ---
if not os.path.exists(QREL_PATH):
    print("QREL file not found. Generating...")
    with open(QREL_PATH, "w", encoding="utf-8") as f:
        for item in test_data:
            qid = item["QuestionID"]
            for passage in item["Passages"]:
                uid = f"{passage['DocumentID']}|||{passage['PassageID']}"
                f.write(f"{qid} 0 {uid} 1\n")
    print(f"✅ QREL saved to: {QREL_PATH}")
else:
    print("QREL file found.")

qrel = {}
with open(QREL_PATH, "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().split()
        qid, _, uid, rel = parts[0], parts[1], " ".join(parts[2:-1]), int(parts[-1])
        qrel.setdefault(qid, {})[uid] = rel
print(f"Loaded QRELs for {len(qrel)} queries.")

# --- Prepare BM25 ---
print("Preparing BM25 index...")
all_passage_uids = [n for n, d in G.nodes(data=True) if d.get("type") == "Passage"]
# Create a map from combined UID to internal UID for text lookup
uid_map = {f"{G.nodes[uid].get('document_id')}|||{G.nodes[uid].get('passage_id')}": uid for uid in all_passage_uids}
corpus_texts = [G.nodes[uid].get("text", "") for uid in all_passage_uids]
tokenized_corpus = [text.split() for text in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)
print("BM25 ready.")

# --- Helper Functions ---
def add_instruction_to_query(query, model_key):
    if "e5" in model_key:
        return f"query: {query}"
    if "bge" in model_key:
        return f"Represent this sentence for searching relevant passages: {query}"
    return query

def evaluate_run(run_dict, qrel_dict, metrics={"recall_10", "map_cut_10"}):
    evaluator = pytrec_eval.RelevanceEvaluator(qrel_dict, metrics)
    results = evaluator.evaluate(run_dict)
    agg = {metric: pytrec_eval.compute_aggregated_measure(metric, [r.get(metric, 0.0) for r in results.values()]) for metric in metrics}
    return agg

def reciprocal_rank_fusion(ranked_lists, k=60):
    fused = {}
    for lst in ranked_lists:
        for rank, uid in enumerate(lst):
            fused[uid] = fused.get(uid, 0) + 1 / (k + rank + 1)
    return sorted(fused.items(), key=lambda x: x[1], reverse=True)

def format_run_for_json(run_dict, qid_to_question_map, uid_to_internal_uid_map, graph, top_n=10):
    output_list = []
    for qid, passages in run_dict.items():
        sorted_passages = sorted(passages.items(), key=lambda item: item[1], reverse=True)

        retrieved_passages_text = []
        retrieved_scores = []
        retrieved_ids = []

        for combined_uid, score in sorted_passages[:top_n]:
            internal_uid = uid_to_internal_uid_map.get(combined_uid)
            if internal_uid:
                retrieved_passages_text.append(graph.nodes[internal_uid].get("text", ""))
                retrieved_scores.append(score)
                retrieved_ids.append(internal_uid)

        output_list.append({
            "QuestionID": qid,
            "Question": qid_to_question_map.get(qid, ""),
            "RetrievedPassages": retrieved_passages_text,
            "RetrievedScores": retrieved_scores,
            "RetrievedIDs": retrieved_ids
        })
    return output_list

# --- Main Evaluation ---
all_results = []

# 1. BM25 Baseline
print("\n=== BM25 Retrieval ===")
bm25_run = {}
for q in tqdm(test_data, desc="BM25"):
    qid, query = q["QuestionID"], q["Question"]
    scores = bm25.get_scores(query.split())
    top_idxs = np.argsort(scores)[::-1][:K]
    bm25_run[qid] = {f"{G.nodes[all_passage_uids[idx]].get('document_id')}|||{G.nodes[all_passage_uids[idx]].get('passage_id')}": float(scores[idx]) for idx in top_idxs}
bm25_metrics = evaluate_run(bm25_run, qrel)
all_results.append({"Model": "BM25", "Context": "N/A", "Method": "Lexical Only", **bm25_metrics})
bm25_json = format_run_for_json(bm25_run, qid_to_question, uid_map, G)
with open(os.path.join(RESULTS_JSON_OUTPUT_FOLDER, "BM25_results.json"), "w") as f:
    json.dump(bm25_json, f, indent=4)

# 2. Dense and Hybrid Models
for model_key, model_path in MODELS_TO_EVALUATE.items():
    print(f"\n=== Evaluating Model: {model_key} ===")
    query_encoder = SentenceTransformer(model_path, device=device)
    for context_key in CONTEXT_CONFIGS:
        print(f"→ Context: {context_key}")
        emb_path = os.path.join(EMBEDDINGS_FOLDER, model_key, context_key, "embeddings.pkl")
        id_path = os.path.join(EMBEDDINGS_FOLDER, model_key, context_key, "passage_ids.json")

        try:
            with open(emb_path, "rb") as f: passage_embeddings = pickle.load(f)
            with open(id_path, "r") as f: passage_ids = json.load(f)
        except FileNotFoundError:
            print(f"⚠️ Embeddings not found for {model_key}/{context_key}. Skipping.")
            continue

        dense_run, hybrid_run = {}, {}
        embeddings_tensor = torch.tensor(passage_embeddings).to(device)

        for q in tqdm(test_data, desc=f"{model_key}/{context_key}"):
            qid, query = q["QuestionID"], q["Question"]

            instructed_query = add_instruction_to_query(query, model_key)
            query_emb = query_encoder.encode(instructed_query, convert_to_tensor=True, device=device)

            cos_scores = util.pytorch_cos_sim(query_emb, embeddings_tensor)[0]
            top_results = torch.topk(cos_scores, k=min(K, len(passage_ids)))

            dense_run[qid], dense_uids = {}, []
            for idx, score in zip(top_results.indices, top_results.values):
                uid = passage_ids[idx]
                node = G.nodes[uid]
                combined_uid = f"{node.get('document_id')}|||{node.get('passage_id')}"
                dense_run[qid][combined_uid] = float(score.item())
                dense_uids.append(uid)

            bm25_scores = bm25.get_scores(query.split())
            top_bm25 = np.argsort(bm25_scores)[::-1][:K]
            bm25_uids = [all_passage_uids[i] for i in top_bm25]

            fused_uids = reciprocal_rank_fusion([dense_uids, bm25_uids])[:K]
            hybrid_run[qid] = {f"{G.nodes[uid].get('document_id')}|||{G.nodes[uid].get('passage_id')}": score for uid, score in fused_uids}

        dense_metrics = evaluate_run(dense_run, qrel)
        all_results.append({"Model": model_key, "Context": context_key, "Method": "Dense", **dense_metrics})
        dense_json = format_run_for_json(dense_run, qid_to_question, uid_map, G)
        with open(os.path.join(RESULTS_JSON_OUTPUT_FOLDER, f"{model_key}_{context_key}_Dense_results.json"), "w") as f:
            json.dump(dense_json, f, indent=4)

        hybrid_metrics = evaluate_run(hybrid_run, qrel)
        all_results.append({"Model": model_key, "Context": context_key, "Method": "Hybrid", **hybrid_metrics})
        hybrid_json = format_run_for_json(hybrid_run, qid_to_question, uid_map, G)
        with open(os.path.join(RESULTS_JSON_OUTPUT_FOLDER, f"{model_key}_{context_key}_Hybrid_results.json"), "w") as f:
            json.dump(hybrid_json, f, indent=4)

# --- Save and Display Final Results ---
df = pd.DataFrame(all_results)
df = df.rename(columns={"recall_10": "Recall@10", "map_cut_10": "MAP@10"})
df = df.sort_values(by=["Recall@10", "MAP@10"], ascending=False)

print("\n📊 Final Evaluation Results:")
print(df.to_string(index=False, float_format="%.4f"))

df.to_csv(RESULTS_CSV_OUTPUT_PATH, index=False)
print(f"\n✅ CSV summary saved to: {RESULTS_CSV_OUTPUT_PATH}")
print(f"✅ Detailed JSON results saved to: {RESULTS_JSON_OUTPUT_FOLDER}")

print("\n--- 🏆 Best Performing Configuration ---")
print(df.iloc[0])


📊 Final Evaluation Results:
                        Model            Context       Method  MAP@10  Recall@10
               e5-large-v2_FT parent_child_cites       Hybrid  0.3319     0.4497
               e5-large-v2_FT  full_neighborhood       Hybrid  0.3318     0.4497
               e5-large-v2_FT       parent_child       Hybrid  0.3316     0.4497
               e5-large-v2_FT       passage_only       Hybrid  0.3288     0.4465
               e5-large-v2_FT             parent       Hybrid  0.3304     0.4459
               e5-large-v2_FT       passage_only        Dense  0.3158     0.4445
               e5-large-v2_FT             parent        Dense  0.3064     0.4433
         all-mpnet-base-v2_FT parent_child_cites       Hybrid  0.3208     0.4403
         all-mpnet-base-v2_FT  full_neighborhood       Hybrid  0.3205     0.4392
         all-mpnet-base-v2_FT       parent_child       Hybrid  0.3195     0.4385
         all-mpnet-base-v2_FT             parent       Hybrid  0.3207     0.4381
               e5-large-v2_FT       parent_child        Dense  0.3003     0.4345
          bge-base-en-v1.5_FT parent_child_cites       Hybrid  0.3180     0.4324
          bge-base-en-v1.5_FT  full_neighborhood       Hybrid  0.3179     0.4324
               e5-large-v2_FT  full_neighborhood        Dense  0.2996     0.4323
               e5-large-v2_FT parent_child_cites        Dense  0.2990     0.4323
         all-mpnet-base-v2_FT       passage_only       Hybrid  0.3197     0.4323
          bge-base-en-v1.5_FT       parent_child       Hybrid  0.3181     0.4312
          bge-base-en-v1.5_FT             parent       Hybrid  0.3188     0.4286
          bge-base-en-v1.5_FT       passage_only       Hybrid  0.3152     0.4227
          bge-base-en-v1.5_FT             parent        Dense  0.2857     0.4207
          bge-base-en-v1.5_FT       passage_only        Dense  0.2919     0.4196
       e5-large-v2_Pretrained       passage_only       Hybrid  0.3087     0.4176
      e5-large-v2_FT_Advanced parent_child_cites       Hybrid  0.3097     0.4174
       e5-large-v2_Pretrained parent_child_cites       Hybrid  0.3097     0.4174
       e5-large-v2_Pretrained  full_neighborhood       Hybrid  0.3088     0.4174
      e5-large-v2_FT_Advanced  full_neighborhood       Hybrid  0.3086     0.4174
      e5-large-v2_FT_Advanced             parent       Hybrid  0.3085     0.4174
       e5-large-v2_Pretrained             parent       Hybrid  0.3084     0.4174
      e5-large-v2_FT_Advanced       parent_child       Hybrid  0.3084     0.4167
       e5-large-v2_Pretrained       parent_child       Hybrid  0.3084     0.4167
      e5-large-v2_FT_Advanced       passage_only       Hybrid  0.3091     0.4165
  bge-base-en-v1.5_Pretrained parent_child_cites       Hybrid  0.2996     0.4153
 bge-base-en-v1.5_FT_Advanced parent_child_cites       Hybrid  0.2995     0.4142
  bge-base-en-v1.5_Pretrained  full_neighborhood       Hybrid  0.2993     0.4138
 bge-base-en-v1.5_FT_Advanced  full_neighborhood       Hybrid  0.2991     0.4127
  bge-base-en-v1.5_Pretrained       passage_only       Hybrid  0.2979     0.4116
 bge-base-en-v1.5_FT_Advanced       passage_only       Hybrid  0.2971     0.4116
  bge-base-en-v1.5_Pretrained       parent_child       Hybrid  0.2998     0.4101
all-mpnet-base-v2_FT_Advanced       passage_only       Hybrid  0.2905     0.4094
 all-mpnet-base-v2_Pretrained       passage_only       Hybrid  0.2905     0.4094
         all-mpnet-base-v2_FT       passage_only        Dense  0.2877     0.4094
          bge-base-en-v1.5_FT  full_neighborhood        Dense  0.2747     0.4091
 bge-base-en-v1.5_FT_Advanced       parent_child       Hybrid  0.2996     0.4090
          bge-base-en-v1.5_FT parent_child_cites        Dense  0.2744     0.4080
all-mpnet-base-v2_FT_Advanced             parent       Hybrid  0.2908     0.4066
          bge-base-en-v1.5_FT       parent_child        Dense  0.2722     0.4065
  bge-base-en-v1.5_Pretrained             parent       Hybrid  0.2989     0.4056
all-mpnet-base-v2_FT_Advanced       parent_child       Hybrid  0.2874     0.4051
 all-mpnet-base-v2_Pretrained  full_neighborhood       Hybrid  0.2871     0.4051
all-mpnet-base-v2_FT_Advanced  full_neighborhood       Hybrid  0.2869     0.4051
 bge-base-en-v1.5_FT_Advanced             parent       Hybrid  0.2985     0.4045
 all-mpnet-base-v2_Pretrained       parent_child       Hybrid  0.2870     0.4040
 all-mpnet-base-v2_Pretrained             parent       Hybrid  0.2903     0.4033
 all-mpnet-base-v2_Pretrained parent_child_cites       Hybrid  0.2869     0.4029
all-mpnet-base-v2_FT_Advanced parent_child_cites       Hybrid  0.2867     0.4029
         all-mpnet-base-v2_FT             parent        Dense  0.2817     0.4009
         all-mpnet-base-v2_FT parent_child_cites        Dense  0.2825     0.3975
         all-mpnet-base-v2_FT  full_neighborhood        Dense  0.2819     0.3949
         all-mpnet-base-v2_FT       parent_child        Dense  0.2789     0.3938
      e5-large-v2_FT_Advanced       passage_only        Dense  0.2890     0.3930
       e5-large-v2_Pretrained       passage_only        Dense  0.2892     0.3928
      e5-large-v2_FT_Advanced             parent        Dense  0.2826     0.3886
       e5-large-v2_Pretrained             parent        Dense  0.2823     0.3886
      e5-large-v2_FT_Advanced       parent_child        Dense  0.2740     0.3866
       e5-large-v2_Pretrained       parent_child        Dense  0.2738     0.3866
      e5-large-v2_FT_Advanced parent_child_cites        Dense  0.2749     0.3855
       e5-large-v2_Pretrained parent_child_cites        Dense  0.2746     0.3855
       e5-large-v2_Pretrained  full_neighborhood        Dense  0.2746     0.3855
      e5-large-v2_FT_Advanced  full_neighborhood        Dense  0.2747     0.3844
  bge-base-en-v1.5_Pretrained       passage_only        Dense  0.2751     0.3795
 bge-base-en-v1.5_FT_Advanced       passage_only        Dense  0.2748     0.3787
  bge-base-en-v1.5_Pretrained             parent        Dense  0.2690     0.3742
 bge-base-en-v1.5_FT_Advanced             parent        Dense  0.2687     0.3735
 all-mpnet-base-v2_Pretrained       passage_only        Dense  0.2552     0.3708
all-mpnet-base-v2_FT_Advanced       passage_only        Dense  0.2549     0.3708
  bge-base-en-v1.5_Pretrained  full_neighborhood        Dense  0.2542     0.3668
 bge-base-en-v1.5_FT_Advanced  full_neighborhood        Dense  0.2539     0.3660
  bge-base-en-v1.5_Pretrained       parent_child        Dense  0.2516     0.3657
 bge-base-en-v1.5_FT_Advanced       parent_child        Dense  0.2514     0.3649
 bge-base-en-v1.5_FT_Advanced parent_child_cites        Dense  0.2541     0.3634
  bge-base-en-v1.5_Pretrained parent_child_cites        Dense  0.2541     0.3630
all-mpnet-base-v2_FT_Advanced             parent        Dense  0.2456     0.3485
                         BM25                N/A Lexical Only  0.2611     0.3474
 all-mpnet-base-v2_Pretrained             parent        Dense  0.2454     0.3474
 all-mpnet-base-v2_Pretrained parent_child_cites        Dense  0.2271     0.3387
all-mpnet-base-v2_FT_Advanced parent_child_cites        Dense  0.2268     0.3387
 all-mpnet-base-v2_Pretrained  full_neighborhood        Dense  0.2269     0.3383
all-mpnet-base-v2_FT_Advanced  full_neighborhood        Dense  0.2267     0.3383
all-mpnet-base-v2_FT_Advanced       parent_child        Dense  0.2255     0.3357
 all-mpnet-base-v2_Pretrained       parent_child        Dense  0.2254     0.3357

✅ CSV summary saved to: /content/drive/MyDrive/Colab Notebooks/RIRAG-MultiPassage-NLLP/experiment_1_final_retriever_comparison_results.csv
✅ Detailed JSON results saved to: /content/drive/MyDrive/Colab Notebooks/RIRAG-MultiPassage-NLLP/experiment_1_retrieval_results_json

--- 🏆 Best Performing Configuration ---
Model            e5-large-v2_FT
Context      parent_child_cites
Method                   Hybrid
MAP@10                 0.331936
Recall@10              0.449664
Name: 38, dtype: object

In [None]:
## Import up sound alert dependencies
from IPython.display import Audio, display

def allDone():
  #display(Audio(url='https://www.myinstants.com/media/sounds/anime-wow-sound-effect.mp3', autoplay=True))
  display(Audio(url='https://www.myinstants.com/media/sounds/money-soundfx.mp3', autoplay=True))
## Insert whatever audio file you want above

allDone()

Key Observations from the Final Retriever Evaluation
Standard Fine-Tuning (_FT) Provides a Huge Boost: This is the most significant finding. The standard fine-tuned models (e5-large-v2_FT, all-mpnet-base-v2_FT, etc.) dramatically outperform their _Pretrained counterparts.

e5-large-v2_FT (Recall@10: 0.4497) is a massive improvement over e5-large-v2_Pretrained (Recall@10: 0.4176).

This proves that adapting the retriever to your specific regulatory domain is a highly effective strategy.

Advanced Fine-Tuning (_FT_Advanced) Did Not Help: This is a very interesting and important "negative result" for your paper. The advanced fine-tuning with hard negatives and instruction tuning did not improve performance over the standard fine-tuning. In fact, the e5-large-v2_FT_Advanced model's performance is almost identical to the e5-large-v2_Pretrained model. This suggests that for your dataset, the simpler fine-tuning approach was more effective.

Hybrid is Still King: For all the top-performing models, the Hybrid method is consistently better than the Dense-only method, confirming our earlier finding.

Context Matters, but Less Than Fine-Tuning: The various context strategies (parent_child_cites, full_neighborhood, etc.) all perform very similarly for the best models. The most important factor by far was the standard fine-tuning.

Conclusion: We Have a New Champion
The results are definitive. The best-performing retriever is:

🏆 New Champion Retriever: The e5-large-v2_FT (standard fine-tuned) model, using the parent_child_cites context and the Hybrid method.

This configuration achieved the highest scores across both Recall@10 (0.4497) and MAP@10 (0.3319).

