## Embedding Model Evaluation

### 1 Comparison Strategy

To select the optimal embedding model for the EU AI Act RAG pipeline, we benchmarked three distinct architectures representing different trade-offs between speed, size, and semantic capability.

* **The Baseline: `all-MiniLM-L6-v2` (22M Params)**
* *Role:* The industry standard for lightweight retrieval.
* *Hypothesis:* While extremely fast, we expect MiniLM to struggle with the dense legal nuances of the AI Act, serving as a "performance floor" to validate the need for larger models.


* **The Modern Specialist: `snowflake-arctic-embed-m` (110M Params)**
* *Role:* A mid-sized model specifically optimized for enterprise retrieval and RAG workloads.
* *Hypothesis:* This model balances modern training techniques (Matryoshka representation) with low inference latency.


* **The Heavyweight: `bge-m3` (567M Params)**
* *Role:* The current state-of-the-art for open-source dense retrieval.
* *Hypothesis:* With support for multi-linguality and long contexts (8192 tokens), BGE-M3 typically offers maximum accuracy but at a significantly higher computational cost.



### 2 Benchmark Results

We evaluated these models on our **Golden Test Set** using **Recall@K** (Hit Rate) and **MRR** (Mean Reciprocal Rank).

| Model | Recall@1 | Recall@5 | Recall@10 | MRR |
| --- | --- | --- | --- | --- |
| **minilm** | 48.7% | 79.8% | 87.1% | 0.618 |
| **snowflake-m** | **70.1%** | 92.6% | 96.7% | **0.799** |
| **bge-m3** | 69.7% | **93.2%** | **97.5%** | 0.797 |

### 3 Interpretation & Final Verdict

**1. MiniLM is Insufficient for Legal RAG**
The baseline confirms that legal texts require high-capacity models. MiniLM achieved a **Recall@1 of only 48.7%**, meaning that in half of all queries, the top result was irrelevant. Using this model would force the LLM to read deeper into the retrieval list (Top-10), increasing latency and "lost-in-the-middle" hallucinations.

**2. The Surprise Winner: Snowflake-Arctic**
While BGE-M3 is the larger model (5x larger than Snowflake), **`snowflake-m` actually outperformed it on the most critical metric: MRR (0.799 vs 0.797).**

* **Precision Focus:** Snowflake achieved a **70.1% Recall@1**, meaning the correct legal clause appeared as the absolute first result more often than with BGE-M3 (69.7%).
* **Recall Parity:** BGE-M3 only overtook Snowflake at Recall@10 (97.5% vs 96.7%), but the difference is marginal.

**3. Final Decision**
We will proceed with **`snowflake-arctic-embed-m`**.
It offers the highest ranking quality (MRR) for our specific dataset while being significantly lighter and faster than BGE-M3. This ensures our pipeline remains responsive without sacrificing the precision needed for legal interpretation.

In [1]:
import pandas as pd
import sys
from pathlib import Path
project_root = Path.cwd().parent
sys.path.append(str(project_root))
from src.utils.helper import load_json
from src.retrieval.embedding_pipeline import EmbeddingWithDB
from src.retrieval.evaluate_retrieval import evaluate_retrieval

chunks_path = project_root / "data/json/ai_act_chunks_split.json"
test_set_path = project_root / "data/test_set/golden_test_set.json"
model_names = ["minilm", "snowflake-m", "bge-m3"] 

if __name__ == "__main__":
    
    # 1. Load Data
    chunks = load_json(str(chunks_path))
    golden_set = load_json(str(test_set_path))
    
    # 2. Initialize Engine
    engine = EmbeddingWithDB(collection_name="benchmark_test")
    
    all_results = []

    # 3. Iterate Models
    for model_name in model_names:
        print(f"TESTING MODEL: {model_name}")
        
        engine.load_model(model_name)
        
        engine.embed_and_store(chunks, reset_db=True)
        
        metrics = evaluate_retrieval(engine, golden_set)
        
        metrics["Model"] = model_name
        all_results.append(metrics)
        
        print(f"Result: {metrics}")

    print("\n\nFINAL LEADERBOARD")
    df = pd.DataFrame(all_results)
    
    cols = ["Model"] + [c for c in df.columns if c != "Model"]
    df = df[cols]
    
    print(df.to_markdown(index=False))

  from .autonotebook import tqdm as notebook_tqdm


Device Detected: MPS [Apple Silicon GPU (Metal Performance Shaders)]
Vector Database loaded. Collection 'benchmark_test' has 1518 chunks.
TESTING MODEL: minilm
Loading minilm (sentence-transformers/all-MiniLM-L6-v2)...
Loaded minilm
Resetting Database Collection...
Generating Embeddings for 1518 chunks...


Batches: 100%|██████████| 48/48 [00:01<00:00, 26.26it/s]


Stored 1518 chunks.
Evaluating 747 queries...


100%|██████████| 747/747 [00:06<00:00, 114.27it/s]


Result: {'Recall@1': 0.487, 'Precision@1': 0.487, 'NDCG@1': 0.487, 'Recall@3': 0.716, 'Precision@3': 0.239, 'NDCG@3': 0.622, 'Recall@5': 0.798, 'Precision@5': 0.16, 'NDCG@5': 0.655, 'Recall@10': 0.871, 'Precision@10': 0.087, 'NDCG@10': 0.679, 'MRR': 0.618, 'Model': 'minilm'}
TESTING MODEL: snowflake-m
Loading snowflake-m (Snowflake/snowflake-arctic-embed-m)...
Loaded snowflake-m
Resetting Database Collection...
Generating Embeddings for 1518 chunks...


Batches: 100%|██████████| 48/48 [00:04<00:00,  9.62it/s]


Stored 1518 chunks.
Evaluating 747 queries...


100%|██████████| 747/747 [00:10<00:00, 70.11it/s]


Result: {'Recall@1': 0.701, 'Precision@1': 0.701, 'NDCG@1': 0.701, 'Recall@3': 0.878, 'Precision@3': 0.293, 'NDCG@3': 0.807, 'Recall@5': 0.926, 'Precision@5': 0.185, 'NDCG@5': 0.827, 'Recall@10': 0.967, 'Precision@10': 0.097, 'NDCG@10': 0.84, 'MRR': 0.799, 'Model': 'snowflake-m'}
TESTING MODEL: bge-m3
Loading bge-m3 (BAAI/bge-m3)...
Loaded bge-m3
Resetting Database Collection...
Generating Embeddings for 1518 chunks...


Batches: 100%|██████████| 48/48 [00:14<00:00,  3.20it/s]


Stored 1518 chunks.
Evaluating 747 queries...


100%|██████████| 747/747 [00:16<00:00, 46.22it/s]

Result: {'Recall@1': 0.697, 'Precision@1': 0.697, 'NDCG@1': 0.697, 'Recall@3': 0.878, 'Precision@3': 0.293, 'NDCG@3': 0.805, 'Recall@5': 0.932, 'Precision@5': 0.186, 'NDCG@5': 0.827, 'Recall@10': 0.975, 'Precision@10': 0.097, 'NDCG@10': 0.841, 'MRR': 0.797, 'Model': 'bge-m3'}


FINAL LEADERBOARD
| Model       |   Recall@1 |   Precision@1 |   NDCG@1 |   Recall@3 |   Precision@3 |   NDCG@3 |   Recall@5 |   Precision@5 |   NDCG@5 |   Recall@10 |   Precision@10 |   NDCG@10 |   MRR |
|:------------|-----------:|--------------:|---------:|-----------:|--------------:|---------:|-----------:|--------------:|---------:|------------:|---------------:|----------:|------:|
| minilm      |      0.487 |         0.487 |    0.487 |      0.716 |         0.239 |    0.622 |      0.798 |         0.16  |    0.655 |       0.871 |          0.087 |     0.679 | 0.618 |
| snowflake-m |      0.701 |         0.701 |    0.701 |      0.878 |         0.293 |    0.807 |      0.926 |         0.185 |    0.827 |       


