<a href="https://colab.research.google.com/github/RegNLP/ContextAware-Regulatory-GraphRAG-ObliQAMP/blob/main/06_Experiment_3_Cross_Encoder_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

⚙️ Experiment 3: Full Pipeline Evaluation with Cross-Encoder Re-ranking
This experiment identifies the best-performing end-to-end pipeline for multi-passage question answering over regulatory documents. It builds on the strongest hybrid retrievers from Experiment 1 and applies Cross-Encoder re-ranking to further optimize the top retrieved passages.

✅ Evaluation Steps
1️⃣ Setup and Configuration
* Install required packages: sentence-transformers, transformers, rank_bm25, pytrec_eval, etc.

* Define input/output paths for:

  * Graph structure (graph.gpickle)

  * Test dataset (ObliQA_MultiPassage_test.json)

  * Fine-tuned retrievers and cross-encoders

  * Precomputed embeddings

  * Output folders for CSV and JSON files

2️⃣ Load Static Components
Load the regulatory graph and extract passage nodes.

Load the test question set.

Load gold standard QRELs in TREC format.

Build the BM25 index over all passage texts.

3️⃣ Champion Retriever Selection
Use two top hybrid retrievers from Experiment 1:

e5-large-v2_FT_parent_child_cites

e5-large-v2_FT_Advanced_parent

These retrievers generate initial top-100 candidates using hybrid retrieval (dense + BM25 + Reciprocal Rank Fusion).

4️⃣ Initial Retrieval
For each question:

Encode query using the retriever’s embedding model (with prefix like "query: ").

Compute cosine similarity with passage embeddings.

Retrieve top-100 dense matches.

Retrieve top-100 BM25 matches.

Fuse rankings using Reciprocal Rank Fusion (RRF).

Keep top-25 fused candidates for re-ranking.

5️⃣ Cross-Encoder Re-ranking
For each Cross-Encoder model (both fine-tuned and pretrained):

* Score each of the 25 candidate passages using the Cross-Encoder model.

* Keep top-20 passages based on relevance scores.

Evaluate with:

* Recall@10: % of questions with at least 1 correct passage in top 10

* MAP@10: Mean average precision of top 10 results

Cross-Encoder models evaluated:

* ✅ Fine-tuned: MiniLM_FT, MPNet_FT, MSMarco_FT, BERT_FT

* 🧪 Pretrained: MiniLM, MSMarco, BERT

6️⃣ Save Results
Save each re-ranked run as a JSON file (passages, scores, IDs).

Aggregate metrics across all configurations into a summary CSV.

Sort results by Recall@10 and MAP@10 to identify the top pipeline.

🎯 Objective
To pinpoint the most effective retriever + re-ranker combination for regulatory QA — demonstrating the impact of learning-based re-ranking over strong graph-aware retrievers.

In [None]:
# ==============================================================================
# Experiment 3: Final Pipeline Evaluation
#
# Purpose:
# 1. Use the best pre-trained, standard fine-tuned, and advanced fine-tuned
#    hybrid retrievers to generate initial candidate passages.
# 2. Re-rank these candidates using both fine-tuned and pre-trained
#    Cross-Encoder models.
# 3. Evaluate the final results to identify the definitive best-performing
#    end-to-end pipeline.
# 4. Save both a CSV summary and detailed JSON outputs for each run.
# ==============================================================================

# --- Essential Installations ---
!pip install -q -U sentence-transformers transformers datasets rank_bm25 pytrec_eval

import os
import json
import torch
import pickle
import numpy as np
import pandas as pd
import networkx as nx
import pytrec_eval
from tqdm import tqdm
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# --- Configuration ---
BASE_PATH = "/content/drive/MyDrive/Colab Notebooks/RIRAG-MultiPassage-NLLP/"
GRAPH_PATH = os.path.join(BASE_PATH, "graph.gpickle")
TEST_SET_PATH = os.path.join(BASE_PATH, "QADataset", "ObliQA_MultiPassage_test.json")
QREL_PATH = os.path.join(BASE_PATH, "qrels.trec")

# --- Model & Embedding Input Folders ---
FINETUNED_RETRIEVER_FOLDER = os.path.join(BASE_PATH, "fine_tuned_retrievers")
ADVANCED_FINETUNED_FOLDER = os.path.join(BASE_PATH, "fine_tuned_retrievers_advanced")
EMBEDDINGS_FOLDER = os.path.join(BASE_PATH, "embeddings_full_comparison")
CROSS_ENCODER_FOLDER = os.path.join(BASE_PATH, "fine_tuned_cross_encoders")

# --- Output Paths ---
RESULTS_CSV_OUTPUT_PATH = os.path.join(BASE_PATH, "experiment_3_final_pipeline_results.csv")
RESULTS_JSON_OUTPUT_FOLDER = os.path.join(BASE_PATH, "experiment_3_retrieval_results_json")
os.makedirs(RESULTS_JSON_OUTPUT_FOLDER, exist_ok=True)


# --- Champion Retriever Configurations (from final retriever evaluation) ---
CHAMPION_CONFIGS = [
    {
        "name": "e5-large-v2_FT_parent_child_cites",
        "model_key": "e5-large-v2_FT",
        "model_path": os.path.join(FINETUNED_RETRIEVER_FOLDER, "e5-large-v2"),
        "context_key": "parent_child_cites"
    },
    {
        "name": "e5-large-v2_FT_Advanced_parent",
        "model_key": "e5-large-v2_FT_Advanced",
        "model_path": os.path.join(ADVANCED_FINETUNED_FOLDER, "e5-large-v2"),
        "context_key": "parent"
    }
]

# --- Cross-Encoder Models to Evaluate ---
CROSS_ENCODERS_TO_EVALUATE = {
    "MiniLM_FT": os.path.join(CROSS_ENCODER_FOLDER, "MiniLM_CrossEncoder"),
    "MPNet_FT": os.path.join(CROSS_ENCODER_FOLDER, "MPNet_CrossEncoder"),
    "MSMarco_FT": os.path.join(CROSS_ENCODER_FOLDER, "MSMarco_CrossEncoder"),
    "BERT_FT": os.path.join(CROSS_ENCODER_FOLDER, "BERT_CrossEncoder"),
    "MiniLM_Pretrained": "cross-encoder/ms-marco-MiniLM-L-6-v2",
    "MSMarco_Pretrained": "cross-encoder/ms-marco-TinyBERT-L-2-v2",
    "BERT_Pretrained": "bert-base-uncased"
}

# --- Experiment Parameters ---
K_INITIAL = 100
K_RERANK = 25
K_FINAL = 20

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# --- Load Static Components ---
print("Loading static components...")
with open(GRAPH_PATH, "rb") as f: G = pickle.load(f)
with open(TEST_SET_PATH, "r", encoding="utf-8") as f: test_data = json.load(f)
qid_to_question = {q["QuestionID"]: q["Question"] for q in test_data}
print(f"Loaded {len(test_data)} test questions.")

# Load QRELs
qrel = {}
with open(QREL_PATH, "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().split()
        qid, _, uid, rel = parts[0], parts[1], " ".join(parts[2:-1]), int(parts[-1])
        qrel.setdefault(qid, {})[uid] = rel
print(f"Loaded QRELs for {len(qrel)} queries.")

# Prepare BM25
print("Preparing BM25 index...")
all_passage_uids = [n for n, d in G.nodes(data=True) if d.get("type") == "Passage"]
uid_map = {f"{G.nodes[uid].get('document_id')}|||{G.nodes[uid].get('passage_id')}": uid for uid in all_passage_uids}
corpus_texts = [G.nodes[uid].get("text", "") for uid in all_passage_uids]
tokenized_corpus = [text.split() for text in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)
print("BM25 ready.")


# --- Helper Functions ---
def add_instruction_to_query(query, model_name):
    if "e5" in model_name or "bge" in model_name:
        return f"query: {query}"
    return query

def evaluate_run(run_dict, qrel_dict, metrics={"recall_10", "map_cut_10"}):
    evaluator = pytrec_eval.RelevanceEvaluator(qrel_dict, metrics)
    results = evaluator.evaluate(run_dict)
    agg = {metric: pytrec_eval.compute_aggregated_measure(metric, [r.get(metric, 0.0) for r in results.values()]) for metric in metrics}
    return agg

def reciprocal_rank_fusion(ranked_lists, k=60):
    fused = {}
    for lst in ranked_lists:
        for rank, uid in enumerate(lst):
            fused[uid] = fused.get(uid, 0) + 1 / (k + rank + 1)
    return sorted(fused.keys(), key=lambda item: fused[item], reverse=True)

def format_run_for_json(run_dict, qid_to_question_map, uid_to_internal_uid_map, graph, top_n=10):
    output_list = []
    for qid, passages in run_dict.items():
        sorted_passages = sorted(passages.items(), key=lambda item: item[1], reverse=True)

        retrieved_passages_text, retrieved_scores, retrieved_ids = [], [], []
        for combined_uid, score in sorted_passages[:top_n]:
            internal_uid = uid_to_internal_uid_map.get(combined_uid)
            if internal_uid:
                retrieved_passages_text.append(graph.nodes[internal_uid].get("text", ""))
                retrieved_scores.append(score)
                retrieved_ids.append(internal_uid)

        output_list.append({
            "QuestionID": qid, "Question": qid_to_question_map.get(qid, ""),
            "RetrievedPassages": retrieved_passages_text, "RetrievedScores": retrieved_scores,
            "RetrievedIDs": retrieved_ids
        })
    return output_list

# --- Main Experiment Loop ---
all_results = []

for config in CHAMPION_CONFIGS:
    champion_name = config["name"]
    model_path = config["model_path"]
    model_key = config["model_key"]
    context_key = config["context_key"]

    print("\n" + "="*80)
    print(f"--- TESTING CHAMPION RETRIEVER: {champion_name} ---")
    print("="*80)

    # Load Embeddings & Query Encoder
    print(f"Loading embeddings from: {EMBEDDINGS_FOLDER}")
    emb_path = os.path.join(EMBEDDINGS_FOLDER, model_key, context_key, "embeddings.pkl")
    id_path = os.path.join(EMBEDDINGS_FOLDER, model_key, context_key, "passage_ids.json")
    try:
        with open(emb_path, "rb") as f: passage_embeddings = pickle.load(f)
        with open(id_path, "r") as f: passage_ids = json.load(f)
        embeddings_tensor = torch.tensor(passage_embeddings).to(device)
        query_encoder = SentenceTransformer(model_path, device=device)
        print("Champion retriever components loaded successfully.")
    except FileNotFoundError:
        print(f"FATAL ERROR: Embeddings not found at {emb_path}. Skipping this champion.")
        continue

    # Pre-calculate initial retrievals
    print("Pre-calculating initial hybrid retrievals...")
    initial_retrievals = {}
    for q in tqdm(test_data, desc=f"Initial Retrieval for {champion_name}"):
        qid, query = q["QuestionID"], q["Question"]
        instructed_query = add_instruction_to_query(query, champion_name)
        query_emb = query_encoder.encode(instructed_query, convert_to_tensor=True, device=device)
        cos_scores = util.pytorch_cos_sim(query_emb, embeddings_tensor)[0]
        top_dense = torch.topk(cos_scores, k=min(K_INITIAL, len(passage_ids)))
        dense_uids = [passage_ids[idx] for idx in top_dense.indices]

        bm25_scores = bm25.get_scores(query.split())
        top_bm25 = np.argsort(bm25_scores)[::-1][:K_INITIAL]
        bm25_uids = [all_passage_uids[i] for i in top_bm25]

        fused_uids = reciprocal_rank_fusion([dense_uids, bm25_uids])
        initial_retrievals[qid] = fused_uids[:K_RERANK]

    for ce_name, ce_path in CROSS_ENCODERS_TO_EVALUATE.items():
        print("\n" + "-"*80)
        print(f"--- Evaluating Cross-Encoder: {ce_name} (with {champion_name}) ---")
        print("-" * 80)

        try:
            ce_tokenizer = AutoTokenizer.from_pretrained(ce_path)
            ce_model = AutoModelForSequenceClassification.from_pretrained(ce_path).to(device)
            ce_model.eval()
        except Exception as e:
            print(f"⚠️  Could not load model from {ce_path}. Skipping. Error: {e}")
            continue

        final_run = {}
        for q in tqdm(test_data, desc=f"Re-ranking with {ce_name}"):
            qid, query = q["QuestionID"], q["Question"]
            candidates_uids = initial_retrievals[qid]

            ce_input_pairs = [[query, G.nodes[uid].get("text", "")] for uid in candidates_uids]

            reranked_candidates = []
            with torch.no_grad():
                inputs = ce_tokenizer(ce_input_pairs, padding=True, truncation=True, return_tensors="pt", max_length=512).to(device)
                logits = ce_model(**inputs).logits

                if logits.shape[1] > 1:
                    scores = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
                else:
                    scores = logits.squeeze().cpu().numpy()

            if scores.ndim == 0:
                scores = [scores.item()]

            for i, uid in enumerate(candidates_uids):
                reranked_candidates.append({"internal_uid": uid, "score": scores[i]})

            reranked_candidates = sorted(reranked_candidates, key=lambda x: x["score"], reverse=True)

            final_run[qid] = {}
            for cand in reranked_candidates[:K_FINAL]:
                node = G.nodes[cand["internal_uid"]]
                combined_uid = f"{node.get('document_id')}|||{node.get('passage_id')}"
                final_run[qid][combined_uid] = float(cand["score"])

        ce_metrics = evaluate_run(final_run, qrel)
        all_results.append({
            "Retriever": champion_name, "Cross-Encoder": ce_name, **ce_metrics
        })

        json_output_path = os.path.join(RESULTS_JSON_OUTPUT_FOLDER, f"{champion_name}_{ce_name}_results.json")
        json_data = format_run_for_json(final_run, qid_to_question, uid_map, G)
        with open(json_output_path, 'w') as f:
            json.dump(json_data, f, indent=4)

# --- Save and Display Final Results ---
df = pd.DataFrame(all_results)
# CORRECTED: Rename columns and select only the ones we want
df = df.rename(columns={"recall_10": "Recall@10", "map_cut_10": "MAP@10"})
# Select and reorder columns for the final output
final_df = df[["Retriever", "Cross-Encoder", "Recall@10", "MAP@10"]]
# CORRECTED: Sort by Recall and MAP
final_df = final_df.sort_values(by=["Recall@10", "MAP@10"], ascending=False)

print("\n📊 Final Evaluation Results:")
print(final_df.to_string(index=False, float_format="%.4f"))

final_df.to_csv(RESULTS_CSV_OUTPUT_PATH, index=False)
print(f"\n✅ CSV summary saved to: {RESULTS_CSV_OUTPUT_PATH}")
print(f"✅ Detailed JSON results saved to: {RESULTS_JSON_OUTPUT_FOLDER}")

print("\n--- 🏆 Best Performing Full Pipeline ---")
print(final_df.iloc[0])


📊 Final Evaluation Results:
                        Retriever      Cross-Encoder  Recall@10  MAP@10
e5-large-v2_FT_parent_child_cites            BERT_FT     0.4580  0.3180
e5-large-v2_FT_parent_child_cites         MSMarco_FT     0.4519  0.3345
e5-large-v2_FT_parent_child_cites           MPNet_FT     0.4516  0.3543
e5-large-v2_FT_parent_child_cites MSMarco_Pretrained     0.4507  0.3248
e5-large-v2_FT_parent_child_cites  MiniLM_Pretrained     0.4502  0.3378
e5-large-v2_FT_parent_child_cites          MiniLM_FT     0.4499  0.2762
   e5-large-v2_FT_Advanced_parent            BERT_FT     0.4434  0.3205
   e5-large-v2_FT_Advanced_parent           MPNet_FT     0.4427  0.3478
   e5-large-v2_FT_Advanced_parent         MSMarco_FT     0.4407  0.3276
   e5-large-v2_FT_Advanced_parent          MiniLM_FT     0.4388  0.3075
   e5-large-v2_FT_Advanced_parent  MiniLM_Pretrained     0.4304  0.3242
   e5-large-v2_FT_Advanced_parent MSMarco_Pretrained     0.4271  0.3102
e5-large-v2_FT_parent_child_cites    BERT_Pretrained     0.4151  0.2053
   e5-large-v2_FT_Advanced_parent    BERT_Pretrained     0.1580  0.0315

In [None]:
## Import up sound alert dependencies
from IPython.display import Audio, display

def allDone():
  #display(Audio(url='https://www.myinstants.com/media/sounds/anime-wow-sound-effect.mp3', autoplay=True))
  display(Audio(url='https://www.myinstants.com/media/sounds/money-soundfx.mp3', autoplay=True))
## Insert whatever audio file you want above

allDone()