## LLm as a Judge

Para evaluar la calidad de las repuestas generadas por una llm se pueden usar diferentes metricas basandose en la opinión de un modelo externo, así se evaluaran diferentes criteterios como calidad de respuesta, relevancia.

In [1]:
from langchain_ollama import ChatOllama
from pydantic import BaseModel, Field
from sentence_transformers import SentenceTransformer
import mlflow
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from Levenshtein import ratio as levenshtein_ratio
from difflib import SequenceMatcher
from tqdm.auto import tqdm
import pandas as pd




In [2]:
class BaseScorer:
    def __init__(self, model, system_prompt:str, metric_name:str=None, structure=None, user_prompt:str=None):

        self.model = model
        self.metric_name = metric_name
        self.structure = structure
        self.system_prompt = system_prompt
        self.user_prompt = user_prompt
        self.scores = []

    def score(self, responses: list, questions: list) -> list:
        model = ChatOllama(model=self.model, temperature=0.0)
        structure_model = model.with_structured_output(self.structure)

        eval_responses = []

        for response, question in zip(responses , questions):
            message = [
                    {"role": "system", "content": self.system_prompt},
                    {"role": "user", "content": f"# Instructions:\n\n{self.user_prompt}\n\n {{'Answer to evaluate' : {response}}}"}
                ]

            score = structure_model.invoke(message)
            eval_responses.append(score)

        self.scores = eval_responses

        return self.scores

    def log_metrics(self, *args):
        NotImplementedError("Este método es propio de la subclase, ya que dependerá de la información de las metricas")


# Selección de juez

Un juez debe ser imparcial, objetivo y no tener sesgos, por eso se debe evaluar los modelos disponibles y observar el que tenga menos varianza en sus respuestas. Para ellos se usa, tanto la varianza de las respuestas numericas como la diferencia entres sus justificaciones mediante embeddings.

In [3]:
available_models = [
    "gpt-oss:20b",
    # "llama3.1:8b",
    # "mistral:latest",
    # "gemma3:27b",
    # "phi3:14b",
    # "qwen3:14b",
    # "deepseek-r1:32b",
]

embedding_model = SentenceTransformer("all-mpnet-base-v2")

In [4]:
class JudgeOutput(BaseModel):
    score: str
    justification: str

select_judge_system_prompt ="""
Eres un juez imparcial y objetivo. Tu tarea es evaluar la calidad de las respuestas generadas por diferentes modelos de lenguaje.

# Estructura del texto a evaluar:
-  Pregunta: pregunta realizada por el usuario.
-  Respuesta: respuesta generada por el modelo.

# Considereaciones importantes:
- La justificación debe estar en español.
- El apartado Respuesta a evaluar no necesariamente debe seguir una estructura fija. Más sin embargo debe tener estar en Markdown y contener un Resumen al final.
- Las calificación debe ser en el rango que especifique el usuario.
"""
select_judge_system_prompt = """
## Role: Impartial and Objective Judge

You are an impartial and objective judge. Your task is to evaluate the quality of responses generated by different language models.

### Structure of the Text to Evaluate
The text is a JSON with the following structure:
```json
{{
    "Answer to evaluate": "The response generated by the model."
}}
```

### Important Considerations
- The **justification** must be written in **Spanish**.
- The **Answer** section does **not necessarily need to follow a fixed structure**, but it **must be written in Markdown** and include a **Summary** at the end.
- The **rating** must be within the **range specified by the user**.
"""

user_prompt = """
### Evaluation Criteria

Rate the response according to the following criteria:

- **Bad**: The response is very poor and does not answer the question.
- **Regular**: The response is poor; it attempts to answer the question but fails, causes confusion, and contradicts itself.
- **Well**: The response is average; it answers the question but not completely, makes incorrect assumptions, and misrepresents concepts.
- **Good**: The response is good; it answers the question completely but includes some errors.
- **Excellent**: The response is excellent; it answers the question completely and correctly.

You must also provide a **justification** for your score, explaining why you gave that rating and what aspects of the response led you to that conclusion.
The main question is to give a explanation for a fraud detection system, so the response should be focused on that topic.
"""


class JudgeScorer(BaseScorer):
    def __init__(self, model:str, system_prompt:str=select_judge_system_prompt, user_prompt:str=user_prompt):
        super().__init__(model=model, system_prompt=system_prompt, metric_name="correctness", structure=JudgeOutput, user_prompt=user_prompt)
        self.embeddings = None
        self.scores_num = []
        self.justifications = []

    def process_scores(self):
        numeric_scores = [score.score for score in self.scores]
        justifications = [score.justification for score in self.scores]

        self.embeddings = embedding_model.encode(justifications)

        self.scores_num = numeric_scores
        self.justifications = justifications


    def make_embeddings_metrics(self):
        if self.embeddings is None:
            raise ValueError("Debes ejecutar process_scores antes de calcular las métricas de embeddings.")

        distance_matrix = cosine_similarity(self.embeddings)

        upper_tri_indices = np.triu_indices_from(distance_matrix, k=1)
        upper_tri_values = distance_matrix[upper_tri_indices]

        variance = np.var(upper_tri_values)
        std_dev = np.std(upper_tri_values)

        return {
            "cosine_similarity/mean": np.mean(upper_tri_values),
            "cosine_similarity/std": std_dev,
            "cosine_similarity/var": variance
        }

    def make_similarity_metrics(self):

        n = len(self.scores_num)

        lev_matrix = np.zeros((n, n))
        diff_matrix = np.zeros((n, n))

        for i in range(n):
            for j in range(n):
                if  i<= j:
                    lev_matrix[i, j] = levenshtein_ratio(self.justifications[i], self.justifications[j])
                    diff_matrix[i, j] = SequenceMatcher(None, self.justifications[i], self.justifications[j]).ratio()

        upper_tri_indices = np.triu_indices_from(lev_matrix, k=1)

        upper_tri_lev = lev_matrix[upper_tri_indices]
        upper_tri_diff = diff_matrix[upper_tri_indices]

        return {
            "levenshtein_similarity/mean": np.mean(upper_tri_lev),
            "levenshtein_similarity/std": np.std(upper_tri_lev),
            "levenshtein_similarity/var": np.var(upper_tri_lev),
            "diff_similarity/mean": np.mean(upper_tri_diff),
            "diff_similarity/std": np.std(upper_tri_diff),
            "diff_similarity/var": np.var(upper_tri_diff)
        }

    def categorical_to_numeric(self, score: str) -> float:
        score_mapping = {
            "Bad": 0.0,
            "Regular": 1.0,
            "Well": 2.0,
            "Good": 3.0,
            "Excellent": 4.0
        }

        return score_mapping.get(score, np.nan)

    def log_metrics(self):
            mlflow.log_metrics(self.make_embeddings_metrics())
            mlflow.log_metrics(self.make_similarity_metrics())

            self.scores_num = [self.categorical_to_numeric(score) for score in self.scores_num if score is not None]

            mlflow.log_metrics(
                {
                    "scores/mean": np.mean(self.scores_num),
                    "scores/std": np.std(self.scores_num),
                    "scores/var": np.var(self.scores_num)
                }
            )

In [5]:
experiment_name = "judge_selection"
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='file:///F:/Documentos/git/fraud_ethereum_explanability/mlruns/120965715118781918', creation_time=1754540211730, experiment_id='120965715118781918', last_update_time=1754540211730, lifecycle_stage='active', name='judge_selection', tags={}>

In [6]:
prompts_df = pd.read_csv("data/prompts_example.csv")
example = prompts_df.iloc[0]["prompt"]

with open("data/system_prompt.txt", "r", encoding="utf-8") as f:
    system_prompt = f.read()

message  = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": example}
]

# test_response = ChatOllama(model="gpt-oss:20b", temperature=0.0).invoke(message)

In [7]:
# with open ("data/test_response.txt", "w", encoding="utf-8") as f:
#     f.write(test_response.content)

In [8]:
with open("data/test_response.txt", "r", encoding="utf-8") as f:
    test_response = f.read()

In [None]:
mlflow.langchain.autolog()
for model in tqdm(available_models):
    run_name=f"judge_{model}"

    with mlflow.start_run(run_name=run_name):
        print(f"Evaluando modelo: {model}")
        judge_scorer = JudgeScorer(model=model)

        # Simulamos respuestas y preguntas
        responses = [test_response] * 10
        questions = [message[0]["content"] + "\n\n" + message[1]["content"]] * 10

        scores = judge_scorer.score(responses, questions)
        judge_scorer.process_scores()

        judge_scorer.log_metrics()

        print(f"Modelo: {model} evaluado y métricas registradas.")

Caso especial como gpt

In [13]:
user_prompt = """
### Evaluation Criteria

Rate the response according to the following criteria:

- **Bad**: The response is very poor and does not answer the question.
- **Regular**: The response is poor; it attempts to answer the question but fails, causes confusion, and contradicts itself.
- **Well**: The response is average; it answers the question but not completely, makes incorrect assumptions, and misrepresents concepts.
- **Good**: The response is good; it answers the question completely but includes some errors.
- **Excellent**: The response is excellent; it answers the question completely and correctly.

You must also provide a **justification** for your score, explaining why you gave that rating and what aspects of the response led you to that conclusion.
The main question is to give a explanation for a fraud detection system, so the response should be focused on that topic.

Your response **should** be in a JSON format with the following structure:

```json
{
    "score": "...",
    "justification": "..."
}
```
"""

In [14]:
import json

mlflow.langchain.autolog()
model = "gpt-oss:20b"

run_name=f"judge_{model}"

def parse_json_response(response: str):
    response = response.split("```json")[1]

    if not response:
        return None
    response = response.split("```")[0].strip()

    try:
        response = json.loads(response)
        res = JudgeOutput(**response)
        return res
    except json.JSONDecodeError:
        return None

class JudgeGPT(JudgeScorer):
    def __init__(self, model:str, system_prompt:str=select_judge_system_prompt, user_prompt:str=user_prompt):
        super().__init__(model=model, system_prompt=system_prompt, user_prompt=user_prompt)
        self.structure = JudgeOutput

    def score(self, responses: list, questions: list) -> list:
        model = ChatOllama(model=self.model, temperature=0.0)

        eval_responses = []

        for response, question in zip(responses , questions):
            message = [
                    {"role": "system", "content": self.system_prompt},
                    {"role": "user", "content": f"# Instructions:\n\n{self.user_prompt}\n\n {{'Answer to evaluate' : {response}}}"}
                ]

            score = model.invoke(message)
            score = parse_json_response(score.content)
            eval_responses.append(score)

        self.scores = eval_responses

        return self.scores

with mlflow.start_run(run_name=run_name):
    print(f"Evaluando modelo: {model}")
    judge_scorer = JudgeGPT(model=model)

    # Simulamos respuestas y preguntas
    responses = [test_response] * 10
    questions = [message[0]["content"] + "\n\n" + message[1]["content"]] * 10

    scores = judge_scorer.score(responses, questions)
    judge_scorer.process_scores()

    judge_scorer.log_metrics()

    print(f"Modelo: {model} evaluado y métricas registradas.")


Evaluando modelo: gpt-oss:20b
```json
{
    "score": "Excellent",
    "justification": "El texto responde de manera completa y precisa a la solicitud de explicar un sistema de detección de fraude en transacciones de Ethereum. Se presenta una estructura clara con secciones bien diferenciadas: variables transformadas, variables originales, importancia del modelo y análisis de la transacción específica. Cada tabla incluye valores, comentarios y una interpretación contextualizada, lo que facilita la comprensión del funcionamiento del modelo CatBoost. Además, se concluye con un resumen ejecutivo que resume los hallazgos clave y la decisión del modelo (FLAG = 1). No se observan contradicciones ni errores conceptuales significativos, y la información está bien organizada y escrita en Markdown, cumpliendo con los requisitos de formato. Por todo ello, la respuesta merece la calificación de \"Excellent\"."
}
```
```json
{
    "score": "Excellent",
    "justification": "El texto responde de maner