# **Evaluation des modèles baseline (avant fine-tuning)**

### *2 versions pour évaluer le dataset :*

- Concept (tolérante)
- Def-only (stricte)

#### *Concept :*

But -> “Est-ce que le modèle regroupe bien toutes les formulations du même concept (cluster_id) ?”
Exemple : query = “ASE”, Si le top-1 renvoie le contexte lié à ASE au lieu de la def ASE → c’est compté bon.

Donc, ca mesure la capacité de clustering sémantique autour d’un concept.

#### *Def-only :*

But : “Est-ce que le modèle retrouve la bonne définition quand je pose une question ?”
Donc cette fois, c'est uniquement le document pair_type="def" du même cluster_id.

Exemple : query = “ASE”, Si le top-1 renvoie la context ASE → c’est faux
il faut que ça renvoie la définition d'ASE.

Ça mesure ce qui est le plus proche d’un RAG propre
(ramener une source stable/encyclopédique plutôt qu’une phrase d’usage).

#### **Evaluation sur le dataset de test (10% du dataset complet)**

In [1]:
import json
import time
import gc
import random
from pathlib import Path
from typing import List, Dict, Tuple, Optional, Set
from collections import defaultdict

import numpy as np
import pandas as pd
import torch

from sentence_transformers import SentenceTransformer
from transformers import AutoModel, AutoTokenizer, AutoConfig
from typing import Optional

from IPython.display import display

In [None]:
# CONFIG
INPUT_JSONL = Path("bercy_test_10.jsonl")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

K_EVAL = 20
MRR_K = 10
NDCG_K = 10

BATCH_SIZE = 32
MAX_LENGTH = 512
BLOCK_SIZE = 128 if DEVICE == "cuda" else 32

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

LOCAL_DIR = Path("./final_models")

TASK = "Retrieve the definition or context of an administrative acronym or term."
INSTRUCTION_PREFIX = f"Instruct: {TASK}\nQuery: "
PASSAGE_PREFIX = "passage: "

# MODELS (Modèles d'embeddings)
CANDIDATS = [
    {"name": "Solon-Large", "id": "OrdalieTech/SOLON-embeddings-large-0.1"},
    {"name": "Solon-Large-FT-Config1", "id": str(LOCAL_DIR / "solon_large_finetuned_config1_merged")},
    {"name": "Solon-Large-FT-Config2", "id": str(LOCAL_DIR / "solon_large_finetuned_config2_merged")},
    {"name": "Solon-Large-FT-Config3", "id": str(LOCAL_DIR / "solon_large_finetuned_config3_merged")},
    {"name": "Solon-Large-FT-Config4", "id": str(LOCAL_DIR / "solon_large_finetuned_config4_merged")},
    {"name": "E5-Large-instruct", "id": "intfloat/multilingual-e5-large-instruct"},
    {"name": "E5-Large-FT-Config1", "id": str(LOCAL_DIR / "e5_large_finetuned_config1_merged")},
    {"name": "E5-Large-FT-Config2", "id": str(LOCAL_DIR / "e5_large_finetuned_config2_merged")},
    {"name": "E5-Large-FT-Config3", "id": str(LOCAL_DIR / "e5_large_finetuned_config3_merged")},
    {"name": "E5-Large-FT-Config4", "id": str(LOCAL_DIR / "e5_large_finetuned_config4_merged")},
]

def model_prefixes(model_name: str) -> Tuple[str, str]:
    # 1) instruct + fine-tuned instruct
    if ("E5-Large-instruct" in model_name) or ("E5-Large-FT" in model_name):
        return INSTRUCTION_PREFIX, ""   # docs sans prefix pour instruct

    # 2) E5 non-instruc
    if model_name == "E5-Large":
        return "query: ", "passage: "

    return "", ""

def norm_space(s: str) -> str:
    return " ".join(str(s).strip().split())

def cuda_empty_cache():
    if DEVICE == "cuda":
        torch.cuda.empty_cache()

class SnowflakeWrapper:
    """
    Wrapper pour utiliser Snowflake (HF transformers) avec une API encode() proche SentenceTransformer.
    Mean pooling + normalisation L2.
    """
    def __init__(self, model_name: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)

        if hasattr(config, "use_memory_efficient_attention"):
            config.use_memory_efficient_attention = False
        if hasattr(config, "unpad_inputs"):
            config.unpad_inputs = False

        self.model = AutoModel.from_pretrained(
            model_name,
            config=config,
            add_pooling_layer=False,
            trust_remote_code=True
        ).to(DEVICE)
        self.model.eval()

    @staticmethod
    def _mean_pooling(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state)
        summed = (last_hidden_state * mask).sum(dim=1)
        denom = mask.sum(dim=1).clamp(min=1e-9)
        return summed / denom

    def encode(
        self,
        sentences: List[str],
        prompt_name: Optional[str] = None,
        batch_size: int = 32,
        prefix_query: str = "query: ",
        prefix_doc: str = ""
    ) -> torch.Tensor:
        prefix = prefix_query if prompt_name == "query" else prefix_doc
        inputs = [prefix + s for s in sentences]

        all_embeddings = []
        for i in range(0, len(inputs), batch_size):
            batch_texts = inputs[i:i + batch_size]
            batch_tokens = self.tokenizer(
                batch_texts,
                padding=True,
                truncation=True,
                return_tensors="pt",
                max_length=MAX_LENGTH
            ).to(DEVICE)

            with torch.inference_mode():
                outputs = self.model(**batch_tokens)
                emb = self._mean_pooling(outputs[0], batch_tokens["attention_mask"])
                emb = torch.nn.functional.normalize(emb, p=2, dim=1)
                all_embeddings.append(emb)

        return torch.cat(all_embeddings, dim=0) if all_embeddings else torch.empty((0, 0), device=DEVICE)

# 1) LOAD JSONL
rows = []
with INPUT_JSONL.open("r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        rows.append(json.loads(line))
print(f"Loaded rows: {len(rows)}")

# 2) BUILD DOCS + QUERIES
DOCS: List[str] = []
DOC_META: List[dict] = []
DOC_INDEX_BY_CLUSTER_AND_TYPE: Dict[Tuple[str, str], int] = {}  # (cluster_id, pair_type) -> doc_idx
DOC_INDICES_BY_CLUSTER: Dict[str, List[int]] = defaultdict(list)

QUERIES: List[str] = []
Q_META: List[dict] = []

# Docs: one doc for each (cluster_id, def) and (cluster_id, context)
for r in rows:
    cid = r.get("cluster_id")
    ptype = r.get("pair_type")
    pos = norm_space(r.get("positive", ""))
    if not cid or ptype not in {"def", "context"} or not pos:
        continue

    key = (cid, ptype)
    if key in DOC_INDEX_BY_CLUSTER_AND_TYPE:
        continue

    doc_idx = len(DOCS)
    DOCS.append(pos)
    DOC_META.append({"cluster_id": cid, "pair_type": ptype})
    DOC_INDEX_BY_CLUSTER_AND_TYPE[key] = doc_idx
    DOC_INDICES_BY_CLUSTER[cid].append(doc_idx)

print(f"Docs built (def+context): {len(DOCS)}")

# Queries:
# - from def/context: anchor keyword + "C'est quoi"
# - from qa: anchor (question)
for r in rows:
    cid = r.get("cluster_id")
    ptype = r.get("pair_type")
    anchor = norm_space(r.get("anchor", ""))

    if not cid or not anchor:
        continue

    if ptype in {"def", "context"}:
        QUERIES.append(anchor)
        Q_META.append({"cluster_id": cid, "kind": "anchor_keyword", "source_pair_type": ptype})

        QUERIES.append(f"C'est quoi {anchor} ?")
        Q_META.append({"cluster_id": cid, "kind": "anchor_cestquoi", "source_pair_type": ptype})

    elif ptype == "qa":
        QUERIES.append(anchor)
        Q_META.append({"cluster_id": cid, "kind": "qa_question", "source_pair_type": "qa"})

print(f"Queries built: {len(QUERIES)}")

# 3) GROUND TRUTH (2 versions)
GT_CONCEPT: List[Set[int]] = []
GT_DEFONLY: List[Set[int]] = []

missing_concept = 0
missing_def = 0

for qm in Q_META:
    cid = qm["cluster_id"]

    rel_concept = set(DOC_INDICES_BY_CLUSTER.get(cid, []))  # def + context
    if not rel_concept:
        missing_concept += 1
    GT_CONCEPT.append(rel_concept)

    def_idx = DOC_INDEX_BY_CLUSTER_AND_TYPE.get((cid, "def"))
    if def_idx is None:
        missing_def += 1
        GT_DEFONLY.append(set())
    else:
        GT_DEFONLY.append({def_idx})

print(f"Missing concept GT: {missing_concept}")
print(f"Missing def-only GT: {missing_def}")

# 4) RETRIEVAL TOPK (blockwise)
def compute_topk_blockwise(emb_q: torch.Tensor, emb_d: torch.Tensor, k_eval: int, block_size: int) -> torch.Tensor:
    """
    emb_q: [Nq, dim], emb_d: [Nd, dim]
    returns indices: [Nq, K]
    """
    Nq = emb_q.size(0)
    K = min(k_eval, emb_d.size(0))
    emb_d_t = emb_d.T

    out = []
    for i in range(0, Nq, block_size):
        q_block = emb_q[i:i + block_size]
        scores = torch.matmul(q_block, emb_d_t)  # cosine if normalized
        topk = torch.topk(scores, k=K, dim=1).indices
        out.append(topk.detach().cpu())
        del scores, topk, q_block
        cuda_empty_cache()

    return torch.cat(out, dim=0)

def compute_metrics_from_topk(
    topk_indices: torch.Tensor,
    ground_truth: List[Set[int]],
    mrr_k: int,
    ndcg_k: int
) -> Dict[str, float]:
    Nq, K = topk_indices.shape
    k5 = min(5, K)
    k10 = min(10, K)
    k20 = min(20, K)
    mrr_k = min(mrr_k, K)
    ndcg_k = min(ndcg_k, K)

    r1 = r5 = r10 = r20 = 0
    mrr = 0.0
    ndcg = 0.0

    for i in range(Nq):
        relevant = ground_truth[i]
        ranked = topk_indices[i].tolist()

        if relevant:
            if ranked[0] in relevant:
                r1 += 1
            if any(idx in relevant for idx in ranked[:k5]):
                r5 += 1
            if any(idx in relevant for idx in ranked[:k10]):
                r10 += 1
            if any(idx in relevant for idx in ranked[:k20]):
                r20 += 1

            rr = 0.0
            for rank_pos, doc_idx in enumerate(ranked[:mrr_k], start=1):
                if doc_idx in relevant:
                    rr = 1.0 / rank_pos
                    break
            mrr += rr

            dcg = 0.0
            for rank_pos, doc_idx in enumerate(ranked[:ndcg_k], start=1):
                if doc_idx in relevant:
                    dcg += 1.0 / np.log2(rank_pos + 1)

            rel_count = min(len(relevant), ndcg_k)
            idcg = 0.0
            for rank_pos in range(1, rel_count + 1):
                idcg += 1.0 / np.log2(rank_pos + 1)

            ndcg += (dcg / idcg) if idcg > 0 else 0.0
        else:
            # si GT vide pour cette query : on ignore (ou alors ça pénalise)
            # ici on pénalise implicitement car rr=0, recalls=0, ndcg=0
            pass

    return {
        "N_queries": Nq,
        "R@1 (%)": (r1 / Nq) * 100,
        "R@5 (%)": (r5 / Nq) * 100,
        "R@10 (%)": (r10 / Nq) * 100,
        "R@20 (%)": (r20 / Nq) * 100,
        f"MRR@{mrr_k}": mrr / Nq,
        f"nDCG@{ndcg_k}": ndcg / Nq,
    }

# 5) ENCODE HELPERS
def encode_st(model: SentenceTransformer, texts: List[str], prefix: str) -> torch.Tensor:
    emb = model.encode(
        [prefix + t for t in texts],
        batch_size=BATCH_SIZE,
        show_progress_bar=True,
        convert_to_tensor=True,
        normalize_embeddings=True,   # cosine = dot product
    )
    return emb


# 6) LOAD & ENCODE MODEL HELPERS
def load_model_smart(cand):
    t0 = time.perf_counter()
    model = None
    err = None

    try:
        if "Snowflake" in cand["name"]:
            model = SnowflakeWrapper(cand["id"])
        else:
            kw = {"torch_dtype": torch.float16} if DEVICE == "cuda" else {}
            trust = cand.get("trust", False)
            model = SentenceTransformer(cand["id"], trust_remote_code=trust, device=DEVICE, model_kwargs=kw)
    except Exception as e:
        err = e

    t1 = time.perf_counter()
    return model, (t1 - t0), err


def encode_any(model, model_name: str, texts: List[str], prefix: str, is_query: bool) -> torch.Tensor:
    # Snowflake : wrapper transformers
    if "Snowflake" in model_name:
        return model.encode(
            texts,
            prompt_name="query" if is_query else None,
            batch_size=BATCH_SIZE,
            prefix_query=prefix,
            prefix_doc=prefix
        )
    # SentenceTransformer
    return model.encode(
        [prefix + t for t in texts],
        batch_size=BATCH_SIZE,
        show_progress_bar=True,
        convert_to_tensor=True,
        normalize_embeddings=True
    )


# 7) RUN BENCHMARK
results_concept = []
results_defonly = []

for cand in CANDIDATS:
    gc.collect()
    cuda_empty_cache()

    name = cand["name"]
    mid = cand["id"]
    trust = cand.get("trust", False)

    print(f"\nLoading: {name} ({mid}) ...")
    model, load_s, err = load_model_smart(cand)
    if err is not None or model is None:
        print(f"Load error for {name}: {err}")
        continue

    pq, pd_ = model_prefixes(name)

    # Encode queries
    t0 = time.perf_counter()
    emb_q = encode_any(model, name, QUERIES, pq, is_query=True)
    t1 = time.perf_counter()

    # Encode docs
    emb_d = encode_any(model, name, DOCS, pd_, is_query=False)
    t2 = time.perf_counter()

    # Retrieval
    t3 = time.perf_counter()
    topk = compute_topk_blockwise(emb_q.to(DEVICE), emb_d.to(DEVICE), k_eval=K_EVAL, block_size=BLOCK_SIZE)
    t4 = time.perf_counter()

    # Metrics
    m_concept = compute_metrics_from_topk(topk, GT_CONCEPT, mrr_k=MRR_K, ndcg_k=NDCG_K)
    m_defonly = compute_metrics_from_topk(topk, GT_DEFONLY, mrr_k=MRR_K, ndcg_k=NDCG_K)

    # Timings
    encq_s = t1 - t0
    encd_s = t2 - t1
    retr_s = t4 - t3
    total_s = t4 - t0

    row_common = {
        "Modèle": name,
        "Load(s)": load_s,
        "EncQ(s)": encq_s,
        "EncD(s)": encd_s,
        "Retr(s)": retr_s,
        "TotalCompute(s)": total_s,
        "Nq": m_concept["N_queries"],
        "Nd": len(DOCS),
        "K": min(K_EVAL, len(DOCS)),
    }

    results_concept.append({**row_common, **{k: v for k, v in m_concept.items() if k != "N_queries"}})
    results_defonly.append({**row_common, **{k: v for k, v in m_defonly.items() if k != "N_queries"}})

    print(
        f"{name} | Concept R@1={results_concept[-1]['R@1 (%)']:.1f}% "
        f"| DefOnly R@1={results_defonly[-1]['R@1 (%)']:.1f}% "
        f"| Total={total_s:.1f}s"
    )

    del model, emb_q, emb_d
    gc.collect()
    cuda_empty_cache()

# 8) FONCTIONS DISPLAY
def row_gradient(val, row_min, row_max):
    if row_max == row_min:
        pct = 0
    else:
        pct = (val - row_min) / (row_max - row_min)
    pct = int(pct * 100)

    return (
        f"background: linear-gradient(90deg, "
        f"#1f3c88 {pct}%, "
        f"#1e1e1e {pct}%);"
        f"color: white;"
    )

def apply_rowwise_gradient(df, cols):
    styles = pd.DataFrame("", index=df.index, columns=df.columns)
    for idx in df.index:
        row_vals = df.loc[idx, cols].astype(float)
        rmin, rmax = row_vals.min(), row_vals.max()
        for col in cols:
            styles.loc[idx, col] = row_gradient(df.loc[idx, col], rmin, rmax)
    return styles

def display_ranked_table(df_sorted: pd.DataFrame, title: str):
    # colonnes attendues
    gradient_cols = [
        "R@1 (%)",
        "R@5 (%)",
        "R@10 (%)",
        "R@20 (%)",
        f"MRR@{MRR_K}",
        f"nDCG@{NDCG_K}",
        "TotalCompute(s)",
    ]
    gradient_cols = [c for c in gradient_cols if c in df_sorted.columns]

    df_display = df_sorted[["Modèle"] + gradient_cols].copy().reset_index(drop=True)

    fmt = {
        "R@1 (%)": "{:.1f}",
        "R@5 (%)": "{:.1f}",
        "R@10 (%)": "{:.1f}",
        "R@20 (%)": "{:.1f}",
        f"MRR@{MRR_K}": "{:.3f}",
        f"nDCG@{NDCG_K}": "{:.3f}",
        "TotalCompute(s)": "{:.3f}",
    }
    fmt = {k: v for k, v in fmt.items() if k in df_display.columns}

    styled = (
        df_display.style
        .format(fmt)
        .hide(axis="index")
        .set_caption(title)
        .set_properties(**{
            "background-color": "#1e1e1e",
            "color": "white",
            "border-color": "#333333",
            "text-align": "center",
            "font-size": "12pt"
        })
        .set_table_styles([
            {"selector": "th", "props": [
                ("background-color", "#111111"),
                ("color", "white"),
                ("border-color", "#333333"),
                ("text-align", "center")
            ]},
            {"selector": "td", "props": [
                ("border-color", "#333333")
            ]},
            {"selector": "caption", "props": [
                ("caption-side", "top"),
                ("color", "white"),
                ("font-size", "14pt"),
                ("font-weight", "bold")
            ]}
        ])
        .apply(lambda _: apply_rowwise_gradient(df_display, gradient_cols), axis=None)
    )

    # Bleu uniquement sur la colonne Modèle
    styled = styled.set_properties(
        subset=["Modèle"],
        **{
            "background-color": "#1f3c88",
            "color": "white",
            "font-weight": "bold"
        }
    )

    display(styled)

# 9) DISPLAY TABLES
df_concept = pd.DataFrame(results_concept)
df_defonly = pd.DataFrame(results_defonly)

df_concept_sorted = df_concept.sort_values(by="R@1 (%)", ascending=False).reset_index(drop=True)
df_defonly_sorted = df_defonly.sort_values(by="R@1 (%)", ascending=False).reset_index(drop=True)

display_ranked_table(df_concept_sorted, "Benchmark CONCEPT (tolérant) - même cluster_id (def ou context)")
display_ranked_table(df_defonly_sorted, "Benchmark DEF-ONLY (strict) - def du même cluster_id uniquement")

Loaded rows: 468
Docs built (def+context): 312
Queries built: 780
Missing concept GT: 0
Missing def-only GT: 0

Loading: Solon-Large (OrdalieTech/SOLON-embeddings-large-0.1) ...


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Solon-Large | Concept R@1=99.1% | DefOnly R@1=78.1% | Total=0.8s

Loading: Solon-Large-FT-Config1 (final_models\solon_large_finetuned_config1_merged) ...


The tokenizer you are loading from 'final_models\solon_large_finetuned_config1_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Solon-Large-FT-Config1 | Concept R@1=81.9% | DefOnly R@1=50.9% | Total=0.9s

Loading: Solon-Large-FT-Config2 (final_models\solon_large_finetuned_config2_merged) ...


The tokenizer you are loading from 'final_models\solon_large_finetuned_config2_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Solon-Large-FT-Config2 | Concept R@1=90.5% | DefOnly R@1=51.0% | Total=0.8s

Loading: Solon-Large-FT-Config3 (final_models\solon_large_finetuned_config3_merged) ...


The tokenizer you are loading from 'final_models\solon_large_finetuned_config3_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Solon-Large-FT-Config3 | Concept R@1=65.1% | DefOnly R@1=40.1% | Total=0.8s

Loading: Solon-Large-FT-Config4 (final_models\solon_large_finetuned_config4_merged) ...


The tokenizer you are loading from 'final_models\solon_large_finetuned_config4_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Solon-Large-FT-Config4 | Concept R@1=72.8% | DefOnly R@1=37.6% | Total=0.8s

Loading: E5-Large (intfloat/multilingual-e5-large) ...


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

E5-Large | Concept R@1=99.9% | DefOnly R@1=80.0% | Total=0.9s

Loading: E5-Large-instruct (intfloat/multilingual-e5-large-instruct) ...


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

E5-Large-instruct | Concept R@1=100.0% | DefOnly R@1=91.7% | Total=0.9s

Loading: E5-Large-FT-Config1 (final_models\e5_large_finetuned_config1_merged) ...


The tokenizer you are loading from 'final_models\e5_large_finetuned_config1_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

E5-Large-FT-Config1 | Concept R@1=99.6% | DefOnly R@1=88.5% | Total=0.8s

Loading: E5-Large-FT-Config2 (final_models\e5_large_finetuned_config2_merged) ...


The tokenizer you are loading from 'final_models\e5_large_finetuned_config2_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

E5-Large-FT-Config2 | Concept R@1=98.3% | DefOnly R@1=75.3% | Total=0.9s

Loading: E5-Large-FT-Config3 (final_models\e5_large_finetuned_config3_merged) ...


The tokenizer you are loading from 'final_models\e5_large_finetuned_config3_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

E5-Large-FT-Config3 | Concept R@1=99.9% | DefOnly R@1=89.5% | Total=0.8s

Loading: E5-Large-FT-Config4 (final_models\e5_large_finetuned_config4_merged) ...


The tokenizer you are loading from 'final_models\e5_large_finetuned_config4_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

E5-Large-FT-Config4 | Concept R@1=99.4% | DefOnly R@1=76.9% | Total=0.8s


Modèle,R@1 (%),R@5 (%),R@10 (%),R@20 (%),MRR@10,nDCG@10,TotalCompute(s)
E5-Large-instruct,100.0,100.0,100.0,100.0,1.0,0.989,0.887
E5-Large,99.9,100.0,100.0,100.0,0.999,0.993,0.864
E5-Large-FT-Config3,99.9,100.0,100.0,100.0,0.999,0.987,0.812
E5-Large-FT-Config1,99.6,100.0,100.0,100.0,0.998,0.975,0.82
E5-Large-FT-Config4,99.4,100.0,100.0,100.0,0.996,0.978,0.784
Solon-Large,99.1,100.0,100.0,100.0,0.994,0.981,0.813
E5-Large-FT-Config2,98.3,100.0,100.0,100.0,0.99,0.966,0.853
Solon-Large-FT-Config2,90.5,97.9,98.7,99.4,0.934,0.89,0.82
Solon-Large-FT-Config1,81.9,96.4,97.2,99.1,0.88,0.822,0.875
Solon-Large-FT-Config4,72.8,89.0,94.6,97.6,0.798,0.738,0.799


Modèle,R@1 (%),R@5 (%),R@10 (%),R@20 (%),MRR@10,nDCG@10,TotalCompute(s)
E5-Large-instruct,91.7,99.2,99.9,100.0,0.955,0.966,0.887
E5-Large-FT-Config3,89.5,99.2,99.4,99.9,0.942,0.956,0.812
E5-Large-FT-Config1,88.5,99.0,99.4,99.4,0.936,0.951,0.82
E5-Large,80.0,98.5,99.1,99.6,0.891,0.917,0.864
Solon-Large,78.1,98.5,98.8,99.6,0.88,0.908,0.813
E5-Large-FT-Config4,76.9,97.3,97.8,99.0,0.869,0.897,0.784
E5-Large-FT-Config2,75.3,97.7,98.2,99.1,0.862,0.893,0.853
Solon-Large-FT-Config2,51.0,90.5,93.8,96.8,0.688,0.751,0.82
Solon-Large-FT-Config1,50.9,88.2,91.9,94.0,0.672,0.734,0.875
Solon-Large-FT-Config3,40.1,75.4,81.0,86.3,0.558,0.62,0.85


### **Evaluation sur le dataset complet (4500 lignes)**

In [None]:
TASK = "Retrieve the definition or context of an administrative acronym or term."
INSTRUCT_Q = f"Instruct: {TASK}\nQuery: "

In [None]:
# CONFIG
INPUT_JSONL = Path("Dataset_Bercy_4k_lines.jsonl")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

K_EVAL = 20
MRR_K = 10
NDCG_K = 10

BATCH_SIZE = 32
MAX_LENGTH = 512
BLOCK_SIZE = 128 if DEVICE == "cuda" else 32

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# MODELS (Modèles d'embeddings)
CANDIDATS = [
    {"name": "Solon-Large", "id": "OrdalieTech/SOLON-embeddings-large-0.1"},
    {"name": "Solon-Large-FT-Config1", "id": str(LOCAL_DIR / "solon_large_finetuned_config1_merged")},
    {"name": "Solon-Large-FT-Config2", "id": str(LOCAL_DIR / "solon_large_finetuned_config2_merged")},
    {"name": "Solon-Large-FT-Config3", "id": str(LOCAL_DIR / "solon_large_finetuned_config3_merged")},
    {"name": "Solon-Large-FT-Config4", "id": str(LOCAL_DIR / "solon_large_finetuned_config4_merged")},
    {"name": "E5-Large", "id": "intfloat/multilingual-e5-large"},
    {"name": "E5-Large-instruct", "id": "intfloat/multilingual-e5-large-instruct"},
    {"name": "E5-Large-FT-Config1", "id": str(LOCAL_DIR / "e5_large_finetuned_config1_merged")},
    {"name": "E5-Large-FT-Config2", "id": str(LOCAL_DIR / "e5_large_finetuned_config2_merged")},
    {"name": "E5-Large-FT-Config3", "id": str(LOCAL_DIR / "e5_large_finetuned_config3_merged")},
    {"name": "E5-Large-FT-Config4", "id": str(LOCAL_DIR / "e5_large_finetuned_config4_merged")},
]

def prefixes_for(model_name: str):
    if "E5-Large-instruct" in model_name or "E5-Large-FT" in model_name:
        return INSTRUCT_Q, ""
    if "E5" in model_name:
        return "query: ", "passage: "
    if "Snowflake" in model_name:
        return "query: ", ""
    return "", ""

def norm_space(s: str) -> str:
    return " ".join(str(s).strip().split())

def cuda_empty_cache():
    if DEVICE == "cuda":
        torch.cuda.empty_cache()

class SnowflakeWrapper:
    """
    Wrapper pour utiliser Snowflake (HF transformers) avec une API encode() proche SentenceTransformer.
    Mean pooling + normalisation L2.
    """
    def __init__(self, model_name: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)

        if hasattr(config, "use_memory_efficient_attention"):
            config.use_memory_efficient_attention = False
        if hasattr(config, "unpad_inputs"):
            config.unpad_inputs = False

        self.model = AutoModel.from_pretrained(
            model_name,
            config=config,
            add_pooling_layer=False,
            trust_remote_code=True
        ).to(DEVICE)
        self.model.eval()

    @staticmethod
    def _mean_pooling(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state)
        summed = (last_hidden_state * mask).sum(dim=1)
        denom = mask.sum(dim=1).clamp(min=1e-9)
        return summed / denom

    def encode(
        self,
        sentences: List[str],
        prompt_name: Optional[str] = None,
        batch_size: int = 32,
        prefix_query: str = "query: ",
        prefix_doc: str = ""
    ) -> torch.Tensor:
        prefix = prefix_query if prompt_name == "query" else prefix_doc
        inputs = [prefix + s for s in sentences]

        all_embeddings = []
        for i in range(0, len(inputs), batch_size):
            batch_texts = inputs[i:i + batch_size]
            batch_tokens = self.tokenizer(
                batch_texts,
                padding=True,
                truncation=True,
                return_tensors="pt",
                max_length=MAX_LENGTH
            ).to(DEVICE)

            with torch.inference_mode():
                outputs = self.model(**batch_tokens)
                emb = self._mean_pooling(outputs[0], batch_tokens["attention_mask"])
                emb = torch.nn.functional.normalize(emb, p=2, dim=1)
                all_embeddings.append(emb)

        return torch.cat(all_embeddings, dim=0) if all_embeddings else torch.empty((0, 0), device=DEVICE)

# 1) LOAD JSONL
rows = []
with INPUT_JSONL.open("r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        rows.append(json.loads(line))
print(f"Loaded rows: {len(rows)}")

# 2) BUILD DOCS + QUERIES
DOCS: List[str] = []
DOC_META: List[dict] = []
DOC_INDEX_BY_CLUSTER_AND_TYPE: Dict[Tuple[str, str], int] = {}  # (cluster_id, pair_type) -> doc_idx
DOC_INDICES_BY_CLUSTER: Dict[str, List[int]] = defaultdict(list)

QUERIES: List[str] = []
Q_META: List[dict] = []

# Docs: one doc for each (cluster_id, def) and (cluster_id, context)
for r in rows:
    cid = r.get("cluster_id")
    ptype = r.get("pair_type")
    pos = norm_space(r.get("positive", ""))
    if not cid or ptype not in {"def", "context"} or not pos:
        continue

    key = (cid, ptype)
    if key in DOC_INDEX_BY_CLUSTER_AND_TYPE:
        continue

    doc_idx = len(DOCS)
    DOCS.append(pos)
    DOC_META.append({"cluster_id": cid, "pair_type": ptype})
    DOC_INDEX_BY_CLUSTER_AND_TYPE[key] = doc_idx
    DOC_INDICES_BY_CLUSTER[cid].append(doc_idx)

print(f"Docs built (def+context): {len(DOCS)}")

# Queries:
# - from def/context: anchor keyword + "C'est quoi"
# - from qa: anchor (question)
for r in rows:
    cid = r.get("cluster_id")
    ptype = r.get("pair_type")
    anchor = norm_space(r.get("anchor", ""))

    if not cid or not anchor:
        continue

    if ptype in {"def", "context"}:
        QUERIES.append(anchor)
        Q_META.append({"cluster_id": cid, "kind": "anchor_keyword", "source_pair_type": ptype})

        QUERIES.append(f"C'est quoi {anchor} ?")
        Q_META.append({"cluster_id": cid, "kind": "anchor_cestquoi", "source_pair_type": ptype})

    elif ptype == "qa":
        QUERIES.append(anchor)
        Q_META.append({"cluster_id": cid, "kind": "qa_question", "source_pair_type": "qa"})

print(f"Queries built: {len(QUERIES)}")

# 3) GROUND TRUTH (2 versions)
GT_CONCEPT: List[Set[int]] = []
GT_DEFONLY: List[Set[int]] = []

missing_concept = 0
missing_def = 0

for qm in Q_META:
    cid = qm["cluster_id"]

    rel_concept = set(DOC_INDICES_BY_CLUSTER.get(cid, []))  # def + context
    if not rel_concept:
        missing_concept += 1
    GT_CONCEPT.append(rel_concept)

    def_idx = DOC_INDEX_BY_CLUSTER_AND_TYPE.get((cid, "def"))
    if def_idx is None:
        missing_def += 1
        GT_DEFONLY.append(set())
    else:
        GT_DEFONLY.append({def_idx})

print(f"Missing concept GT: {missing_concept}")
print(f"Missing def-only GT: {missing_def}")

# 4) RETRIEVAL TOPK (blockwise)
def compute_topk_blockwise(emb_q: torch.Tensor, emb_d: torch.Tensor, k_eval: int, block_size: int) -> torch.Tensor:
    """
    emb_q: [Nq, dim], emb_d: [Nd, dim]
    returns indices: [Nq, K]
    """
    Nq = emb_q.size(0)
    K = min(k_eval, emb_d.size(0))
    emb_d_t = emb_d.T

    out = []
    for i in range(0, Nq, block_size):
        q_block = emb_q[i:i + block_size]
        scores = torch.matmul(q_block, emb_d_t)  # cosine if normalized
        topk = torch.topk(scores, k=K, dim=1).indices
        out.append(topk.detach().cpu())
        del scores, topk, q_block
        cuda_empty_cache()

    return torch.cat(out, dim=0)

def compute_metrics_from_topk(
    topk_indices: torch.Tensor,
    ground_truth: List[Set[int]],
    mrr_k: int,
    ndcg_k: int
) -> Dict[str, float]:
    Nq, K = topk_indices.shape
    k5 = min(5, K)
    k10 = min(10, K)
    k20 = min(20, K)
    mrr_k = min(mrr_k, K)
    ndcg_k = min(ndcg_k, K)

    r1 = r5 = r10 = r20 = 0
    mrr = 0.0
    ndcg = 0.0

    for i in range(Nq):
        relevant = ground_truth[i]
        ranked = topk_indices[i].tolist()

        if relevant:
            if ranked[0] in relevant:
                r1 += 1
            if any(idx in relevant for idx in ranked[:k5]):
                r5 += 1
            if any(idx in relevant for idx in ranked[:k10]):
                r10 += 1
            if any(idx in relevant for idx in ranked[:k20]):
                r20 += 1

            rr = 0.0
            for rank_pos, doc_idx in enumerate(ranked[:mrr_k], start=1):
                if doc_idx in relevant:
                    rr = 1.0 / rank_pos
                    break
            mrr += rr

            dcg = 0.0
            for rank_pos, doc_idx in enumerate(ranked[:ndcg_k], start=1):
                if doc_idx in relevant:
                    dcg += 1.0 / np.log2(rank_pos + 1)

            rel_count = min(len(relevant), ndcg_k)
            idcg = 0.0
            for rank_pos in range(1, rel_count + 1):
                idcg += 1.0 / np.log2(rank_pos + 1)

            ndcg += (dcg / idcg) if idcg > 0 else 0.0
        else:
            # si GT vide pour cette query : on ignore (ou alors ça pénalise)
            # ici on pénalise implicitement car rr=0, recalls=0, ndcg=0
            pass

    return {
        "N_queries": Nq,
        "R@1 (%)": (r1 / Nq) * 100,
        "R@5 (%)": (r5 / Nq) * 100,
        "R@10 (%)": (r10 / Nq) * 100,
        "R@20 (%)": (r20 / Nq) * 100,
        f"MRR@{mrr_k}": mrr / Nq,
        f"nDCG@{ndcg_k}": ndcg / Nq,
    }

# 5) ENCODE HELPERS
def encode_st(model: SentenceTransformer, texts: List[str], prefix: str) -> torch.Tensor:
    emb = model.encode(
        [prefix + t for t in texts],
        batch_size=BATCH_SIZE,
        show_progress_bar=True,
        convert_to_tensor=True,
        normalize_embeddings=True,   # cosine = dot product
    )
    return emb


# 6) LOAD & ENCODE MODEL HELPERS
def load_model_smart(cand):
    t0 = time.perf_counter()
    model = None
    err = None

    try:
        if "Snowflake" in cand["name"]:
            model = SnowflakeWrapper(cand["id"])
        else:
            kw = {"torch_dtype": torch.float16} if DEVICE == "cuda" else {}
            trust = cand.get("trust", False)
            model = SentenceTransformer(cand["id"], trust_remote_code=trust, device=DEVICE, model_kwargs=kw)
    except Exception as e:
        err = e

    t1 = time.perf_counter()
    return model, (t1 - t0), err


def encode_any(model, model_name: str, texts: List[str], prefix: str, is_query: bool) -> torch.Tensor:
    # Snowflake : wrapper transformers
    if "Snowflake" in model_name:
        return model.encode(
            texts,
            prompt_name="query" if is_query else None,
            batch_size=BATCH_SIZE,
            prefix_query=prefix,
            prefix_doc=prefix
        )
    # SentenceTransformer
    return model.encode(
        [prefix + t for t in texts],
        batch_size=BATCH_SIZE,
        show_progress_bar=True,
        convert_to_tensor=True,
        normalize_embeddings=True
    )


# 7) RUN BENCHMARK
results_concept = []
results_defonly = []

for cand in CANDIDATS:
    gc.collect()
    cuda_empty_cache()

    name = cand["name"]
    mid = cand["id"]
    trust = cand.get("trust", False)

    print(f"\nLoading: {name} ({mid}) ...")
    model, load_s, err = load_model_smart(cand)
    if err is not None or model is None:
        print(f"Load error for {name}: {err}")
        continue

    pq, pd_ = model_prefixes(name)

    # Encode queries
    t0 = time.perf_counter()
    emb_q = encode_any(model, name, QUERIES, pq, is_query=True)
    t1 = time.perf_counter()

    # Encode docs
    emb_d = encode_any(model, name, DOCS, pd_, is_query=False)
    t2 = time.perf_counter()

    # Retrieval
    t3 = time.perf_counter()
    topk = compute_topk_blockwise(emb_q.to(DEVICE), emb_d.to(DEVICE), k_eval=K_EVAL, block_size=BLOCK_SIZE)
    t4 = time.perf_counter()

    # Metrics
    m_concept = compute_metrics_from_topk(topk, GT_CONCEPT, mrr_k=MRR_K, ndcg_k=NDCG_K)
    m_defonly = compute_metrics_from_topk(topk, GT_DEFONLY, mrr_k=MRR_K, ndcg_k=NDCG_K)

    # Timings
    encq_s = t1 - t0
    encd_s = t2 - t1
    retr_s = t4 - t3
    total_s = t4 - t0

    row_common = {
        "Modèle": name,
        "Load(s)": load_s,
        "EncQ(s)": encq_s,
        "EncD(s)": encd_s,
        "Retr(s)": retr_s,
        "TotalCompute(s)": total_s,
        "Nq": m_concept["N_queries"],
        "Nd": len(DOCS),
        "K": min(K_EVAL, len(DOCS)),
    }

    results_concept.append({**row_common, **{k: v for k, v in m_concept.items() if k != "N_queries"}})
    results_defonly.append({**row_common, **{k: v for k, v in m_defonly.items() if k != "N_queries"}})

    print(
        f"{name} | Concept R@1={results_concept[-1]['R@1 (%)']:.1f}% "
        f"| DefOnly R@1={results_defonly[-1]['R@1 (%)']:.1f}% "
        f"| Total={total_s:.1f}s"
    )

    del model, emb_q, emb_d
    gc.collect()
    cuda_empty_cache()

# 8) FONCTIONS DISPLAY
def row_gradient(val, row_min, row_max):
    if row_max == row_min:
        pct = 0
    else:
        pct = (val - row_min) / (row_max - row_min)
    pct = int(pct * 100)

    return (
        f"background: linear-gradient(90deg, "
        f"#1f3c88 {pct}%, "
        f"#1e1e1e {pct}%);"
        f"color: white;"
    )

def apply_rowwise_gradient(df, cols):
    styles = pd.DataFrame("", index=df.index, columns=df.columns)
    for idx in df.index:
        row_vals = df.loc[idx, cols].astype(float)
        rmin, rmax = row_vals.min(), row_vals.max()
        for col in cols:
            styles.loc[idx, col] = row_gradient(df.loc[idx, col], rmin, rmax)
    return styles

def display_ranked_table(df_sorted: pd.DataFrame, title: str):
    # colonnes attendues
    gradient_cols = [
        "R@1 (%)",
        "R@5 (%)",
        "R@10 (%)",
        "R@20 (%)",
        f"MRR@{MRR_K}",
        f"nDCG@{NDCG_K}",
        "TotalCompute(s)",
    ]
    gradient_cols = [c for c in gradient_cols if c in df_sorted.columns]

    df_display = df_sorted[["Modèle"] + gradient_cols].copy().reset_index(drop=True)

    fmt = {
        "R@1 (%)": "{:.1f}",
        "R@5 (%)": "{:.1f}",
        "R@10 (%)": "{:.1f}",
        "R@20 (%)": "{:.1f}",
        f"MRR@{MRR_K}": "{:.3f}",
        f"nDCG@{NDCG_K}": "{:.3f}",
        "TotalCompute(s)": "{:.3f}",
    }
    fmt = {k: v for k, v in fmt.items() if k in df_display.columns}

    styled = (
        df_display.style
        .format(fmt)
        .hide(axis="index")
        .set_caption(title)
        .set_properties(**{
            "background-color": "#1e1e1e",
            "color": "white",
            "border-color": "#333333",
            "text-align": "center",
            "font-size": "12pt"
        })
        .set_table_styles([
            {"selector": "th", "props": [
                ("background-color", "#111111"),
                ("color", "white"),
                ("border-color", "#333333"),
                ("text-align", "center")
            ]},
            {"selector": "td", "props": [
                ("border-color", "#333333")
            ]},
            {"selector": "caption", "props": [
                ("caption-side", "top"),
                ("color", "white"),
                ("font-size", "14pt"),
                ("font-weight", "bold")
            ]}
        ])
        .apply(lambda _: apply_rowwise_gradient(df_display, gradient_cols), axis=None)
    )

    # Bleu uniquement sur la colonne Modèle
    styled = styled.set_properties(
        subset=["Modèle"],
        **{
            "background-color": "#1f3c88",
            "color": "white",
            "font-weight": "bold"
        }
    )

    display(styled)

# 9) DISPLAY TABLES
df_concept = pd.DataFrame(results_concept)
df_defonly = pd.DataFrame(results_defonly)

df_concept_sorted = df_concept.sort_values(by="R@1 (%)", ascending=False).reset_index(drop=True)
df_defonly_sorted = df_defonly.sort_values(by="R@1 (%)", ascending=False).reset_index(drop=True)

display_ranked_table(df_concept_sorted, "Benchmark CONCEPT (tolérant) - même cluster_id (def ou context)")
display_ranked_table(df_defonly_sorted, "Benchmark DEF-ONLY (strict) - def du même cluster_id uniquement")

Loaded rows: 4500
Docs built (def+context): 3000
Queries built: 7500
Missing concept GT: 0
Missing def-only GT: 0

Loading: Solon-Large (OrdalieTech/SOLON-embeddings-large-0.1) ...


Batches:   0%|          | 0/235 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Solon-Large | Concept R@1=95.7% | DefOnly R@1=76.2% | Total=9.3s

Loading: Solon-Large-FT-Config1 (final_models\solon_large_finetuned_config1_merged) ...


The tokenizer you are loading from 'final_models\solon_large_finetuned_config1_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/235 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Solon-Large-FT-Config1 | Concept R@1=50.9% | DefOnly R@1=28.0% | Total=7.9s

Loading: Solon-Large-FT-Config2 (final_models\solon_large_finetuned_config2_merged) ...


The tokenizer you are loading from 'final_models\solon_large_finetuned_config2_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/235 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Solon-Large-FT-Config2 | Concept R@1=56.7% | DefOnly R@1=31.1% | Total=8.1s

Loading: Solon-Large-FT-Config3 (final_models\solon_large_finetuned_config3_merged) ...


The tokenizer you are loading from 'final_models\solon_large_finetuned_config3_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/235 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Solon-Large-FT-Config3 | Concept R@1=34.6% | DefOnly R@1=16.9% | Total=8.5s

Loading: Solon-Large-FT-Config4 (final_models\solon_large_finetuned_config4_merged) ...


The tokenizer you are loading from 'final_models\solon_large_finetuned_config4_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/235 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Solon-Large-FT-Config4 | Concept R@1=36.2% | DefOnly R@1=15.9% | Total=7.4s

Loading: E5-Large (intfloat/multilingual-e5-large) ...


Batches:   0%|          | 0/235 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

E5-Large | Concept R@1=98.7% | DefOnly R@1=80.8% | Total=9.3s

Loading: E5-Large-instruct (intfloat/multilingual-e5-large-instruct) ...


Batches:   0%|          | 0/235 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

E5-Large-instruct | Concept R@1=97.6% | DefOnly R@1=91.3% | Total=7.6s

Loading: E5-Large-FT-Config1 (final_models\e5_large_finetuned_config1_merged) ...


The tokenizer you are loading from 'final_models\e5_large_finetuned_config1_merged' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.


Batches:   0%|          | 0/235 [00:00<?, ?it/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

E5-Large-FT-Config1 | Concept R@1=0.3% | DefOnly R@1=0.3% | Total=8.8s

Loading: E5-Large-FT-Config2 (final_models\e5_large_finetuned_config2_merged) ...
Load error for E5-Large-FT-Config2: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory final_models\e5_large_finetuned_config2_merged.

Loading: E5-Large-FT-Config3 (final_models\e5_large_finetuned_config3_merged) ...
Load error for E5-Large-FT-Config3: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory final_models\e5_large_finetuned_config3_merged.

Loading: E5-Large-FT-Config4 (final_models\e5_large_finetuned_config4_merged) ...
Load error for E5-Large-FT-Config4: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory final_models\e5_large_finetuned_config4_merged.


Modèle,R@1 (%),R@5 (%),R@10 (%),R@20 (%),MRR@10,nDCG@10,TotalCompute(s)
E5-Large,98.7,99.8,99.9,100.0,0.992,0.97,9.305
E5-Large-instruct,97.6,99.6,99.8,99.8,0.985,0.848,7.563
Solon-Large,95.7,99.1,99.5,99.7,0.972,0.921,9.34
Solon-Large-FT-Config2,56.7,89.3,95.5,97.5,0.692,0.667,8.053
Solon-Large-FT-Config1,50.9,76.2,83.9,90.0,0.617,0.577,7.928
Solon-Large-FT-Config4,36.2,66.8,76.4,82.6,0.489,0.467,7.399
Solon-Large-FT-Config3,34.6,59.3,67.8,75.1,0.448,0.437,8.502
E5-Large-FT-Config1,0.3,1.1,4.4,11.8,0.009,0.011,8.827


Modèle,R@1 (%),R@5 (%),R@10 (%),R@20 (%),MRR@10,nDCG@10,TotalCompute(s)
E5-Large-instruct,91.3,98.6,99.2,99.4,0.946,0.958,7.563
E5-Large,80.8,97.4,97.8,98.2,0.887,0.91,9.305
Solon-Large,76.2,96.7,97.6,98.1,0.859,0.889,9.34
Solon-Large-FT-Config2,31.1,75.1,85.4,90.5,0.488,0.576,8.053
Solon-Large-FT-Config1,28.0,66.3,75.8,82.3,0.443,0.519,7.928
Solon-Large-FT-Config3,16.9,50.9,61.2,68.4,0.312,0.384,8.502
Solon-Large-FT-Config4,15.9,51.8,63.8,71.9,0.309,0.388,7.399
E5-Large-FT-Config1,0.3,0.7,3.1,8.8,0.007,0.012,8.827
