# Information Retrieval Evaluation Pipeline

This notebook implements a comprehensive evaluation pipeline for information retrieval systems using FAISS indices and BGE-M3 embeddings. The pipeline includes:
- Retrieval evaluation with multiple metrics
- Visualization of results
- Detailed performance analysis by categories
- Report generation

First, let's set up our environment and install required dependencies.

In [9]:
!pip install torch transformers sentence-transformers rank_bm25 nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [10]:
!pip install faiss-cpu FlagEmbedding tqdm numpy matplotlib seaborn

# from google.colab import drive
# drive.mount('/content/drive')

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




## Import Required Libraries

Let's import all the necessary Python libraries and configure the environment.

In [1]:

import multiprocessing
import faiss
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import os
from FlagEmbedding import BGEM3FlagModel
from app.utils.retrievers.evaluator import RetrievalEvaluator
from app.utils.retrievers.multi_step import QueryRewriterRetrieverSubcategory, Law2StepRetriever
from app.utils.retrievers.retriever import Retriever
from langchain_openai import ChatOpenAI

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
faiss.omp_set_num_threads(12)

## Define Data Classes and Helper Functions

First, let's define our evaluation metrics dataclass and implement the core evaluation class.

## Visualization Functions

Let's implement functions to visualize our evaluation results.

## Run Evaluation Pipeline

Now let's run the evaluation pipeline. First, we need to:
1. Initialize the BGE-M3 model
2. Set up paths for the FAISS index and QA dataset
3. Create an evaluator instance
4. Run the evaluation
5. Generate visualizations and reports

In [3]:
# Configurations
retriever_model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)
metric_type = 'ip'
metric_suffix = 'V2' if metric_type == 'l2' else ''

index_path = 'data/m3_legal_faiss_brief.index'
qa_dataset_path ='data/evaluation_data/relational_qa_dataset.json'
documents_path = 'data/saudi_laws_scraped.json'
k_values=[1, 3, 5, 10, 20, 40, 80, 100]
# Load QA dataset
with open(qa_dataset_path, 'r', encoding='utf-8') as f:
    data = json.load(f)
    # We assume qa_pairs is a list of dicts, e.g.:
    # [{"query": "...", "relevant_ids": [123, 456]}, ...]
    qa_pairs = data.get('qa_pairs', [])
print(f"✓ Loaded {len(qa_pairs)} QA pairs")

print(f"Using {metric_type.upper()} metric with index: {index_path}")

retriever = Retriever(
    faiss_index_path=index_path,
    documents_path=documents_path,
    embeddings_model=retriever_model
)
# llm = ChatOpenAI(_(self
#         openai_api_base ="http://localhost:11434/v1",
#         api_key ="good",
#         model="hf.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M",
#         temperature=0,
#         max_tokens=1000
#     )
# qw_retriever = QueryRewriterRetrieverSubcategory(
#         llm=llm,
#         faiss_index_path=index_path,
#         documents_path=documents_path,
#         embeddings_model=retriever_model,
# )

law_2_step_retriever = Law2StepRetriever(
    laws_faiss_index_path='data/m3_legal_faiss_laws.index',
    articles_faiss_index_path=index_path,
    documents_path=documents_path,
    embeddings_model=retriever_model,
)

evaluator_dense = RetrievalEvaluator(
    name="dense",
    approach_desc="Dense with IP Metric",
    model_name="BAAI/bge-m3",
    retrieve_function =retriever.dense,
    qa_pairs=qa_pairs,
    k_values=k_values,
    is_baseline=True,
)


evaluator_hybrid = RetrievalEvaluator(
        name="hybrid_dense_sparse",
        approach_desc="Hybrid (Dense + Sparse)",
        model_name="bge-small + BM25",
        retrieve_function =retriever.hybrid,
        qa_pairs=qa_pairs,
        k_values=k_values,
)

# evaluator_qw_retriever = RetrievalEvaluator(
#         name="",
#         approach_desc="Query Rewriter+ filtering + Hybrid Retriever",
#         model_name="Qwen3-bge-small + BM25",
#         retrieve_function =qw_retriever.retrieve,
#         qa_pairs=qa_pairs,
#         k_values=k_values,
# )

evaluator_law_2_step = RetrievalEvaluator(
        name="law_2_step_retriever",
        approach_desc="Law 2-Step + Hybrid Retriever",
        model_name="BAAI/bge-m3 + BM25",
        retrieve_function =law_2_step_retriever.retrieve,
        qa_pairs=qa_pairs,
        k_values=k_values,
)


Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 111156.47it/s]


✓ Loaded 373 QA pairs
Using IP metric with index: data/m3_legal_faiss_brief.index
Initializing Retriever...
Loading documents and metadata...
✓ Loaded 16371 documents with metadata
✓ BM25 initialized successfully!
Initializing Retriever...
Loading documents and metadata...
✓ Loaded 16371 documents with metadata
✓ BM25 initialized successfully!


In [4]:
# reslts = evaluator_qw_retriever.evaluate_all()

In [5]:
# evaluator_qw_retriever.save_results('.')
# from app.utils.retrievers.evaluator import compare_approaches

# compare_approaches([evaluator_qw_retriever], metric_type="ex")

In [6]:
from app.utils.retrievers.evaluator import compare_approaches
approaches = [evaluator_dense, evaluator_hybrid, evaluator_law_2_step]
EVAL_DIR = 'evaluation_pipeline/assets'
METRIC_TYPE = 'retrieval'
METRIC_SUFFIX = "relational_V0"
# 3. Run evaluations
print(f"\nStarting evaluation for {len(approaches)} approaches...")
for approach in approaches:
    approach.evaluate_all()


# 4. Save results (plot, report, details)
print("\nGenerating visualizations and saving results...")
for approach in approaches:
    approach.save_results(
        evaluation_dir=EVAL_DIR,
        metric_suffix=METRIC_SUFFIX,
        # METRIC_TYPE=METRIC_TYPE
    )

print("\n✓ Evaluation workflow completed successfully!")
# 5. Compare all approaches
compare_approaches(
    approaches,
    METRIC_TYPE
)


Starting evaluation for 3 approaches...
Evaluating 373 queries for 'dense'...


  0%|          | 0/373 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 373/373 [00:09<00:00, 38.55it/s]


Evaluating 373 queries for 'hybrid_dense_sparse'...


100%|██████████| 373/373 [00:10<00:00, 34.77it/s]


Evaluating 373 queries for 'law_2_step_retriever'...


100%|██████████| 373/373 [03:45<00:00,  1.66it/s]



Generating visualizations and saving results...
✓ Saved plot to evaluation_pipeline/assets/evaluation_metrics_dense_relational_V0.png
✓ Saved all results for 'dense' to evaluation_pipeline/assets
✓ Saved plot to evaluation_pipeline/assets/evaluation_metrics_hybrid_dense_sparse_relational_V0.png
✓ Saved all results for 'hybrid_dense_sparse' to evaluation_pipeline/assets
✓ Saved plot to evaluation_pipeline/assets/evaluation_metrics_law_2_step_retriever_relational_V0.png
✓ Saved all results for 'law_2_step_retriever' to evaluation_pipeline/assets

✓ Evaluation workflow completed successfully!

## METRIC COMPARISON: RETRIEVAL | BASELINE: Dense with IP Metric

| Metric        | Dense with IP Metric | Hybrid (Dense + Sparse) | Law 2-Step + Hybrid Retriever |
| -------------- | --------------------- | ------------------------ | ------------------------------ |
| Recall@1      | 0.0846               | 0.0892 (+5.5%)          | 0.0804 (-4.9%)                |
| Recall@3      | 0.1832          