# M√©tricas Agentic con Groq como Juez

Este notebook implementa las m√©tricas para evaluar respuestas de agentes:
- **pass@K**: Al menos una respuesta de K es correcta
- **pass^K**: Todas las K respuestas son correctas
- **Tool Correctness**: Evaluaci√≥n del uso correcto de herramientas

## 1. Setup y Configuraci√≥n

In [None]:
import os
import json
from typing import Any
from dataclasses import dataclass
from groq import Groq

# Configurar API key de Groq
GROQ_API_KEY = os.getenv("GROQ_API_KEY", "")
if not GROQ_API_KEY:
    print("‚ö†Ô∏è  GROQ_API_KEY no encontrada. Config√∫rala con: export GROQ_API_KEY='tu-key'")
    print("Obt√©n tu key en: https://console.groq.com/")

# Inicializar cliente Groq
client = Groq(api_key=GROQ_API_KEY)

# Modelo a usar (opciones: llama-3.3-70b-versatile, llama-3.1-70b-versatile, mixtral-8x7b-32768)
MODEL = "llama-3.3-70b-versatile"

print(f"‚úÖ Cliente Groq configurado con modelo: {MODEL}")

‚úÖ Cliente Groq configurado con modelo: llama-3.3-70b-versatile


## 2. Estructuras de Datos

In [2]:
@dataclass
class AgentResponse:
    """Respuesta de un agente a una query."""
    query: str
    answer: str
    agentic: dict[str, Any] | None = None  # Dict con tools_used, final_answer_uses_tools
    
@dataclass
class GroundTruth:
    """Respuesta esperada (ground truth)."""
    expected_answer: str
    ground_truth_agentic: dict[str, Any] | None = None  # Dict con expected_tools, tool_sequence_matters

@dataclass
class ToolCorrectnessScore:
    """Scores de evaluaci√≥n de tool correctness."""
    tool_selection_correct: float  # 0.0 - 1.0
    parameter_accuracy: float      # 0.0 - 1.0
    sequence_correct: float        # 0.0 - 1.0
    result_utilization: float      # 0.0 - 1.0
    overall_correctness: float     # Promedio ponderado
    is_correct: bool               # True si overall >= threshold

@dataclass
class AgenticMetric:
    """M√©tricas finales de evaluaci√≥n."""
    qa_id: str
    k: int
    threshold: float
    correctness_scores: list[float]  # Score de correcci√≥n para cada respuesta (0.0 - 1.0)
    pass_at_k: bool
    pass_pow_k: bool
    correct_indices: list[int]
    tool_correctness: ToolCorrectnessScore | None = None

print("‚úÖ Estructuras de datos definidas")

‚úÖ Estructuras de datos definidas


## 2A. Explicaci√≥n: Estructura de Diccionarios Agentic

### üì• `agentic` (usado por el agente)

Este diccionario describe **qu√© herramientas us√≥ el agente** en su respuesta:

```python
{
    "tools_used": [
        {
            "tool_id": "calc_001",           # ID √∫nico de la herramienta
            "tool_name": "calculator",       # Nombre de la herramienta
            "tool_step": 1,                  # Paso en la secuencia (1, 2, 3...)
            "parameters": {                  # Par√°metros pasados a la herramienta
                "operation": "multiply",
                "a": 15,
                "b": 7
            },
            "result": 105                    # Resultado retornado por la herramienta
        },
        # ... m√°s herramientas si se usaron
    ],
    "final_answer_uses_tools": True        # ¬øLa respuesta final usa los resultados?
}
```

### üì§ `ground_truth_agentic` (esperado)

Este diccionario describe **qu√© herramientas DEBER√çA usar el agente**:

```python
{
    "expected_tools": [
        {
            "tool_id": "calc_001",           # ID esperado
            "tool_name": "calculator",       # Herramienta esperada
            "tool_step": 1,                  # Paso esperado
            "parameters": {                  # Par√°metros correctos
                "operation": "multiply",
                "a": 15,
                "b": 7
            }
        },
        # ... m√°s herramientas esperadas
    ],
    "tool_sequence_matters": True          # ¬øImporta el orden? (True/False)
}
```

### üîç C√≥mo se Eval√∫a Tool Correctness

La funci√≥n `evaluate_tool_correctness()` compara ambos diccionarios y eval√∫a:

1. **Tool Selection** (0.25 peso):
   - ¬øCoinciden los `tool_name`?
   - ¬øFalta alguna herramienta o hay extras?

2. **Parameter Accuracy** (0.25 peso):
   - ¬øLos `parameters` son id√©nticos?
   - Se compara valor por valor

3. **Sequence Correct** (0.25 peso):
   - ¬øLos `tool_step` est√°n en orden correcto?
   - Solo aplica si `tool_sequence_matters=True`

4. **Result Utilization** (0.25 peso):
   - ¬ø`final_answer_uses_tools=True`?
   - ¬øLos resultados aparecen en la respuesta final?

**Overall Correctness** = Promedio ponderado de los 4 aspectos

## 3. Funciones para Llamar a Groq

Groq evaluar√° las respuestas y devolver√° scores en formato JSON.

In [3]:
def evaluate_answer_correctness(query: str, answer: str, ground_truth: str) -> float:
    """
    Usa Groq para evaluar si una respuesta es correcta.
    
    Returns:
        float: Score de correcci√≥n entre 0.0 (incorrecto) y 1.0 (totalmente correcto)
    """
    prompt = f"""You are a STRICT evaluator. Your task is to determine if an agent's answer is correct compared to the ground truth.

**Question:** {query}

**Agent's Answer:** {answer}

**Ground Truth:** {ground_truth}

Evaluate the correctness with STRICT criteria:

1. **Factual Accuracy** (most important): Is the core information factually correct?
2. **Precision**: Spelling errors, typos, or incorrect formatting should be penalized
3. **Completeness**: Does it answer what was asked?
4. **Format**: Natural language variations are acceptable ONLY if facts are perfect

IMPORTANT SCORING RULES:
- 1.0: Identical or perfectly correct with natural rephrasing (same facts, perfect spelling)
- 0.85-0.95: Correct facts with slightly more verbose explanation
- 0.65-0.75: Correct core fact BUT has typo/spelling error (e.g., "Poris" instead of "Paris")
- 0.5-0.65: Mostly correct but missing important details
- 0.3-0.5: Partially correct with significant errors
- 0.0-0.3: Wrong answer or completely incorrect

**Typo/Spelling Penalty**: 
- Single character typo in short answer: MAX score 0.75
- Multiple typos: MAX score 0.5
- Wrong word entirely: score below 0.3

**Wrong Answer**: Factually incorrect information must score below 0.3

Return ONLY a JSON object with this exact format:
{{
  "correctness_score": <float between 0.0 and 1.0>,
  "reasoning": "<brief explanation of your evaluation>"
}}

Examples:
- Q: "Capital of France?", A: "Paris", GT: "Paris" ‚Üí 1.0 (perfect match)
- Q: "Capital of France?", A: "The capital of France is Paris", GT: "Paris" ‚Üí 0.95 (correct, verbose)
- Q: "Capital of France?", A: "Poris", GT: "Paris" ‚Üí 0.7 (TYPO PENALTY - core fact known but misspelled)
- Q: "Capital of France?", A: "Pariis", GT: "Paris" ‚Üí 0.7 (TYPO PENALTY)
- Q: "Capital of France?", A: "Lyon", GT: "Paris" ‚Üí 0.0 (completely wrong city)
- Q: "Capital of France?", A: "London", GT: "Paris" ‚Üí 0.0 (wrong country)
"""

    try:
        response = client.chat.completions.create(
            model=MODEL,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        return float(result.get("correctness_score", 0.0))
    
    except Exception as e:
        print(f"‚ùå Error evaluando respuesta: {e}")
        return 0.0

print("‚úÖ Funci√≥n evaluate_answer_correctness definida")

‚úÖ Funci√≥n evaluate_answer_correctness definida


In [4]:
def evaluate_tool_correctness(
    agentic: dict[str, Any],
    ground_truth_agentic: dict[str, Any],
    threshold: float = 0.75
) -> ToolCorrectnessScore:
    """
    Usa Groq para evaluar el uso correcto de herramientas.
    
    Args:
        agentic: Dict con 'tools_used' y 'final_answer_uses_tools'
        ground_truth_agentic: Dict con 'expected_tools' y 'tool_sequence_matters'
        threshold: Umbral para considerar correctas las herramientas (default: 0.75)
    
    Returns:
        ToolCorrectnessScore con scores individuales, overall, e is_correct
    """
    tools_used = agentic.get("tools_used", [])
    final_answer_uses_tools = agentic.get("final_answer_uses_tools", False)
    expected_tools = ground_truth_agentic.get("expected_tools", [])
    sequence_matters = ground_truth_agentic.get("tool_sequence_matters", True)
    
    prompt = f"""You are a STRICT evaluator of AI agent tool usage. Evaluate how correctly an agent used tools.

**Tools Used by Agent:**
{json.dumps(tools_used, indent=2)}

**Final Answer Uses Tools:** {final_answer_uses_tools}

**Expected Tools:**
{json.dumps(expected_tools, indent=2)}

**Sequence Matters:** {sequence_matters}

Evaluate the following aspects with STRICT criteria (each scored 0.0 to 1.0):

1. **tool_selection_correct**: 
   - Did the agent select the correct tools (by tool_name)?
   - 1.0 = all correct tools, 0.0 = wrong tools

2. **parameter_accuracy**: 
   - Were the parameters correct for each tool?
   - Compare parameter values EXACTLY
   - 1.0 = all parameters perfect, 0.0 = wrong parameters

3. **sequence_correct**: 
   - Were tools called in the correct order?
   - If sequence_matters=True: Compare tool_step for EACH tool
   - For each tool_name, check if its tool_step matches expected
   - Example: calculator at step 1 when expected at step 2 = WRONG ORDER
   - 1.0 = ALL tool_steps match, 0.0 = wrong order
   - If sequence_matters=False: Always return 1.0

4. **result_utilization**: 
   - Did the agent use the tool results properly?
   - Check if final_answer_uses_tools=True
   - 1.0 = excellent use, 0.0 = didn't use

CRITICAL FOR SEQUENCE:
- Compare each tool's tool_step with its expected tool_step
- "calculator" at step 1 when expected at step 2 = WRONG (0.0)
- "web_search" at step 2 when expected at step 1 = WRONG (0.0)
- Swapped steps = COMPLETELY WRONG ORDER (0.0)

Return ONLY a JSON object:
{{
  "tool_selection_correct": <float 0.0-1.0>,
  "parameter_accuracy": <float 0.0-1.0>,
  "sequence_correct": <float 0.0-1.0>,
  "result_utilization": <float 0.0-1.0>,
  "reasoning": "<explain why sequence_correct has its value>"
}}
"""

    try:
        response = client.chat.completions.create(
            model=MODEL,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        
        # Si sequence no importa, no penalizar
        if not sequence_matters:
            result["sequence_correct"] = 1.0
        
        # Calcular overall (promedio ponderado 25% cada uno)
        overall = 0.25 * (
            result["tool_selection_correct"] +
            result["parameter_accuracy"] +
            result["sequence_correct"] +
            result["result_utilization"]
        )
        
        # Determinar si es correcto seg√∫n threshold
        is_correct = overall >= threshold
        
        return ToolCorrectnessScore(
            tool_selection_correct=result["tool_selection_correct"],
            parameter_accuracy=result["parameter_accuracy"],
            sequence_correct=result["sequence_correct"],
            result_utilization=result["result_utilization"],
            overall_correctness=overall,
            is_correct=is_correct
        )
    
    except Exception as e:
        print(f"‚ùå Error evaluando tool correctness: {e}")
        return ToolCorrectnessScore(0.0, 0.0, 0.0, 0.0, 0.0, False)

print("‚úÖ evaluate_tool_correctness definida (con threshold)")
print("   Returns: ToolCorrectnessScore con is_correct boolean")

‚úÖ evaluate_tool_correctness definida (con threshold)
   Returns: ToolCorrectnessScore con is_correct boolean


## 4. Funci√≥n Principal: Calcular M√©tricas Agentic

In [5]:
def calculate_agentic_metrics(
    qa_id: str,
    responses: list[AgentResponse],
    ground_truth: GroundTruth,
    threshold: float = 0.7,
    tool_threshold: float = 0.75,
    verbose: bool = True
) -> AgenticMetric:
    """
    Calcula pass@K, pass^K y tool correctness para K respuestas.
    
    Args:
        qa_id: Identificador de la pregunta
        responses: Lista de K respuestas del agente
        ground_truth: Respuesta correcta esperada
        threshold: Umbral para considerar una respuesta correcta (default: 0.7)
        tool_threshold: Umbral para considerar correcto el uso de herramientas (default: 0.75)
        verbose: Mostrar progreso
    
    Returns:
        AgenticMetric con todas las m√©tricas calculadas
    """
    k = len(responses)
    
    if verbose:
        print(f"\n{'='*70}")
        print(f"Evaluando QA ID: {qa_id}")
        print(f"K = {k} | Answer Threshold = {threshold} | Tool Threshold = {tool_threshold}")
        print(f"{'='*70}\n")
    
    # 1. Evaluar correcci√≥n de cada respuesta
    correctness_scores = []
    correct_indices = []
    
    for i, response in enumerate(responses):
        if verbose:
            print(f"üìù Evaluando respuesta {i+1}/{k}...")
        
        score = evaluate_answer_correctness(
            query=response.query,
            answer=response.answer,
            ground_truth=ground_truth.expected_answer
        )
        
        correctness_scores.append(score)
        
        if score >= threshold:
            correct_indices.append(i)
        
        if verbose:
            status = "‚úÖ CORRECTO" if score >= threshold else "‚ùå INCORRECTO"
            print(f"   Score: {score:.3f} {status}\n")
    
    # 2. Calcular pass@K y pass^K
    pass_at_k = len(correct_indices) > 0  # Al menos una correcta
    pass_pow_k = len(correct_indices) == k  # Todas correctas
    
    # 3. Evaluar tool correctness (si hay herramientas)
    tool_correctness = None
    if ground_truth.ground_truth_agentic:
        if verbose:
            print("üîß Evaluando tool correctness...\n")
        
        # Evaluar la primera respuesta correcta (o la primera si ninguna es correcta)
        response_to_eval = responses[correct_indices[0]] if correct_indices else responses[0]
        
        if response_to_eval.agentic:
            tool_correctness = evaluate_tool_correctness(
                agentic=response_to_eval.agentic,
                ground_truth_agentic=ground_truth.ground_truth_agentic,
                threshold=tool_threshold
            )
            
            if verbose:
                print(f"   Tool Selection: {tool_correctness.tool_selection_correct:.3f}")
                print(f"   Parameter Accuracy: {tool_correctness.parameter_accuracy:.3f}")
                print(f"   Sequence Correct: {tool_correctness.sequence_correct:.3f}")
                print(f"   Result Utilization: {tool_correctness.result_utilization:.3f}")
                print(f"   Overall: {tool_correctness.overall_correctness:.3f}")
                print(f"   Is Correct: {tool_correctness.is_correct} ({'‚úÖ' if tool_correctness.is_correct else '‚ùå'})\n")
    
    # 4. Construir resultado
    metric = AgenticMetric(
        qa_id=qa_id,
        k=k,
        threshold=threshold,
        correctness_scores=correctness_scores,
        pass_at_k=pass_at_k,
        pass_pow_k=pass_pow_k,
        correct_indices=correct_indices,
        tool_correctness=tool_correctness
    )
    
    if verbose:
        print(f"{'='*70}")
        print(f"üìä RESULTADOS FINALES")
        print(f"{'='*70}")
        print(f"   pass@{k}: {pass_at_k} ({'‚úÖ' if pass_at_k else '‚ùå'})")
        print(f"   pass^{k}: {pass_pow_k} ({'‚úÖ' if pass_pow_k else '‚ùå'})")
        print(f"   Correctas: {len(correct_indices)}/{k}")
        if tool_correctness:
            print(f"   Tool Usage Correct: {tool_correctness.is_correct} ({'‚úÖ' if tool_correctness.is_correct else '‚ùå'})")
        print(f"{'='*70}\n")
    
    return metric

print("‚úÖ calculate_agentic_metrics definida (con tool_threshold)")

‚úÖ calculate_agentic_metrics definida (con tool_threshold)


## 5. Ejemplo de Uso: Pregunta Simple sin Herramientas

In [6]:
# Definir la pregunta
query = "What is the capital of France?"

# Definir K=3 respuestas de diferentes agentes
responses = [
    AgentResponse(
        query=query,
        answer="Paris"
    ),
    AgentResponse(
        query=query,
        answer="The capital of France is Paris, a beautiful city known for the Eiffel Tower."
    ),
    AgentResponse(
        query=query,
        answer="London"  # Respuesta incorrecta
    ),
    AgentResponse(
    query=query,
    answer="The caital of france is Parissss"  # Respuesta parcialmente incorrecta
)
]

# Definir ground truth
ground_truth = GroundTruth(
    expected_answer="Paris"
)

# Calcular m√©tricas
metrics = calculate_agentic_metrics(
    qa_id="capital_france",
    responses=responses,
    ground_truth=ground_truth,
    threshold=0.8,
    verbose=True
)


Evaluando QA ID: capital_france
K = 4 | Answer Threshold = 0.8 | Tool Threshold = 0.75

üìù Evaluando respuesta 1/4...
   Score: 1.000 ‚úÖ CORRECTO

üìù Evaluando respuesta 2/4...
   Score: 0.950 ‚úÖ CORRECTO

üìù Evaluando respuesta 3/4...
   Score: 0.000 ‚ùå INCORRECTO

üìù Evaluando respuesta 4/4...
   Score: 0.650 ‚ùå INCORRECTO

üìä RESULTADOS FINALES
   pass@4: True (‚úÖ)
   pass^4: False (‚ùå)
   Correctas: 2/4



## 5A. Test de Rigor: Evaluando Typos

Probemos que el sistema ahora penaliza correctamente los typos.

In [None]:
# Query que requiere m√∫ltiples herramientas en orden espec√≠fico
query = "Search for Tokyo's population, then calculate 10% of it"

# Ground truth: Orden esperado y par√°metros correctos
ground_truth = GroundTruth(
    expected_answer="Tokyo has ~14 million people. 10% is 1.4 million.",
    ground_truth_agentic={
        "expected_tools": [
            {
                "tool_id": "search_001",
                "tool_name": "web_search",
                "tool_step": 1,  # Primero buscar
                "parameters": {"query": "Tokyo population 2024"}
            },
            {
                "tool_id": "calc_001",
                "tool_name": "calculator",
                "tool_step": 2,  # Luego calcular
                "parameters": {"operation": "multiply", "a": 14000000, "b": 0.1}
            }
        ],
        "tool_sequence_matters": True  # El orden S√ç importa
    }
)

# K=4 respuestas con diferentes casos
responses = [
    # 1. TODO CORRECTO: herramientas, par√°metros y secuencia
    AgentResponse(
        query=query,
        answer="Tokyo has 14 million people. 10% is 1.4 million.",
        agentic={
            "tools_used": [
                {"tool_id": "search_001", "tool_name": "web_search", "tool_step": 1,
                 "parameters": {"query": "Tokyo population 2024"}, "result": "14 million"},
                {"tool_id": "calc_001", "tool_name": "calculator", "tool_step": 2,
                 "parameters": {"operation": "multiply", "a": 14000000, "b": 0.1}, "result": 1400000}
            ],
            "final_answer_uses_tools": True
        }
    ),
    # 2. SECUENCIA INCORRECTA: Calcula antes de buscar
    AgentResponse(
        query=query,
        answer="10% is 1.4 million.",
        agentic={
            "tools_used": [
                {"tool_id": "calc_002", "tool_name": "calculator", "tool_step": 1,  # ‚ùå Deber√≠a ser 2
                 "parameters": {"operation": "multiply", "a": 14000000, "b": 0.1}, "result": 1400000},
                {"tool_id": "search_002", "tool_name": "web_search", "tool_step": 2,  # ‚ùå Deber√≠a ser 1
                 "parameters": {"query": "Tokyo population 2024"}, "result": "14 million"}
            ],
            "final_answer_uses_tools": True
        }
    ),
    # 3. PAR√ÅMETROS INCORRECTOS: Usa suma en lugar de multiplicaci√≥n
    AgentResponse(
        query=query,
        answer="The result is around 14 million.",
        agentic={
            "tools_used": [
                {"tool_id": "search_003", "tool_name": "web_search", "tool_step": 1,
                 "parameters": {"query": "Tokyo population 2024"}, "result": "14 million"},
                {"tool_id": "calc_003", "tool_name": "calculator", "tool_step": 2,
                 "parameters": {"operation": "add", "a": 14000000, "b": 0.1}, "result": 14000000.1}  # ‚ùå Suma
            ],
            "final_answer_uses_tools": True
        }
    ),
    # 4. SIN HERRAMIENTAS: Responde manualmente (incorrecto)
    AgentResponse(
        query=query,
        answer="Tokyo has around 10 million people, so 10% is 1 million.",
        agentic=None
    )
]

# Evaluar todas las respuestas
print("="*80)
print("EVALUANDO M√öLTIPLES RESPUESTAS CON TOOL USAGE")
print("="*80)

for i, response in enumerate(responses, 1):
    print(f"\n{'‚îÄ'*80}")
    print(f"RESPUESTA {i}")
    print(f"{'‚îÄ'*80}")
    
    # Evaluar correcci√≥n de answer
    answer_score = evaluate_answer_correctness(query, response.answer, ground_truth.expected_answer)
    print(f"Answer Score: {answer_score:.3f} ({'‚úÖ' if answer_score >= 0.7 else '‚ùå'})")
    
    # Evaluar tool correctness si hay herramientas
    if response.agentic and ground_truth.ground_truth_agentic:
        tool_score = evaluate_tool_correctness(
            response.agentic,
            ground_truth.ground_truth_agentic,
            threshold=0.8
        )
        
        print(f"\nüîß Tool Correctness:")
        print(f"   Selection: {tool_score.tool_selection_correct:.3f}")
        print(f"   Parameters: {tool_score.parameter_accuracy:.3f}")
        print(f"   Sequence: {tool_score.sequence_correct:.3f}")
        print(f"   Utilization: {tool_score.result_utilization:.3f}")
        print(f"   Overall: {tool_score.overall_correctness:.3f}")
        print(f"   Is Correct: {tool_score.is_correct} ({'‚úÖ' if tool_score.is_correct else '‚ùå'})")
    else:
        print(f"\nüîß Tool Correctness: N/A (no usa herramientas)")

print(f"\n{'='*80}")
print("RESUMEN")
print(f"{'='*80}")
print("Respuesta 1: ‚úÖ TODO correcto")
print("Respuesta 2: ‚ùå Secuencia incorrecta (invierte orden)")
print("Respuesta 3: ‚ùå Par√°metros incorrectos (usa suma en vez de multiplicaci√≥n)")
print("Respuesta 4: ‚ùå No usa herramientas")
print(f"{'='*80}\n")

EVALUANDO M√öLTIPLES RESPUESTAS CON TOOL USAGE

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
RESPUESTA 1
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Answer Score: 1.000 (‚úÖ)

üîß Tool Correctness:
   Selection: 1.000
   Parameters: 1.000
   Sequence: 1.000
   Utilization: 1.000
   Overall: 1.000
   Is Correct: True (‚úÖ)

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
RESPUESTA 2
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚