# Agentic Metric Example

This notebook demonstrates how to use the **Agentic** metric from Fair Forge to evaluate AI agent responses with pass@K and tool correctness.

The metric evaluates:
- **pass@K**: At least one of K responses is correct (score >= threshold)
- **pass^K**: All K responses are correct
- **Tool Correctness**: Proper tool usage across 4 dimensions:
  - Selection (25%): Were the correct tools chosen?
  - Parameters (25%): Were correct parameters passed?
  - Sequence (25%): Were tools used in the correct order?
  - Utilization (25%): Were tool results used in the final answer?

## Installation

First, install Fair Forge with agentic support and required dependencies.

In [None]:
!pip install "alquimia-fair-forge[agentic]" langchain-groq -q

## Setup

Import required modules and configure your API key.

In [None]:
import json
import sys
from pathlib import Path

# Add examples directory to path for local imports
sys.path.insert(0, str(Path("../..").resolve()))

from langchain_groq import ChatGroq

from fair_forge import Retriever
from fair_forge.metrics.agentic import Agentic
from fair_forge.schemas import Dataset

In [None]:
import getpass

GROQ_API_KEY = getpass.getpass("Enter your Groq API key: ")

## Create an Agentic Retriever

The Agentic metric requires **K datasets with the same qa_id** but different assistant_id values. Each dataset represents one agent response.

In our example, we have K=3 responses for each question.

In [None]:
class AgenticRetriever(Retriever):
    """Retriever that loads K agent responses for the same questions."""

    def load_dataset(self) -> list[Dataset]:
        dataset_path = Path("dataset_agentic.json")
        datasets = []
        with open(dataset_path) as infile:
            for dataset in json.load(infile):
                datasets.append(Dataset.model_validate(dataset))
        return datasets

## Preview the Dataset

Let's see the structure of our test dataset.

In [None]:
retriever = AgenticRetriever()
datasets = retriever.load_dataset()

print(f"Number of agent responses (K): {len(datasets)}")
print(f"\nAgent IDs: {[ds.assistant_id for ds in datasets]}")
print(f"Number of questions: {len(datasets[0].conversation)}")

print("\nQuestions (qa_ids):")
for batch in datasets[0].conversation:
    print(f"  - {batch.qa_id}: {batch.query}")

print("\nSample responses for first question:")
for ds in datasets:
    batch = ds.conversation[0]
    print(f"\n{ds.assistant_id}:")
    print(f"  Answer: {batch.assistant}")
    print(f"  Tools used: {len(batch.agentic['tools_used'])} tool(s)")

## Initialize the Judge Model

The Agentic metric uses an LLM judge to evaluate answer correctness.

In [None]:
judge_model = ChatGroq(
    model="llama-3.3-70b-versatile",
    api_key=GROQ_API_KEY,
    temperature=0.0,
)

## Run the Agentic Metric

The metric will evaluate:
1. Answer correctness for each of K responses
2. pass@K: At least one response is correct
3. pass^K: All responses are correct
4. Tool correctness: Proper tool usage

In [None]:
metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    threshold=0.7,
    tool_threshold=0.75,
    use_structured_output=False,
    verbose=True,
)

## Analyze Results

Let's examine the evaluation results for each question.

In [None]:
print(f"Total metrics: {len(metrics)}\n")
print("=" * 70)

for metric in metrics:
    print(f"\nQuestion ID: {metric.qa_id}")
    print(f"K (responses): {metric.k}")
    print(f"Threshold: {metric.threshold}")
    print(f"\nCorrectness scores: {[f'{s:.2f}' for s in metric.correctness_scores]}")
    print(f"Correct indices: {metric.correct_indices}")
    print(f"\npass@{metric.k}: {metric.pass_at_k} ({'✓' if metric.pass_at_k else '✗'})")
    print(f"pass^{metric.k}: {metric.pass_pow_k} ({'✓' if metric.pass_pow_k else '✗'})")
    
    if metric.tool_correctness:
        tc = metric.tool_correctness
        print(f"\nTool Correctness:")
        print(f"  Selection:    {tc.tool_selection_correct:.2f}")
        print(f"  Parameters:   {tc.parameter_accuracy:.2f}")
        print(f"  Sequence:     {tc.sequence_correct:.2f}")
        print(f"  Utilization:  {tc.result_utilization:.2f}")
        print(f"  Overall:      {tc.overall_correctness:.2f} ({'✓' if tc.is_correct else '✗'})")
    
    print("\n" + "-" * 70)

## Summary Statistics

In [None]:
pass_at_k_count = sum(1 for m in metrics if m.pass_at_k)
pass_pow_k_count = sum(1 for m in metrics if m.pass_pow_k)
tool_correct_count = sum(1 for m in metrics if m.tool_correctness and m.tool_correctness.is_correct)

print("Summary Statistics:")
print(f"Total questions evaluated: {len(metrics)}")
print(f"\npass@K success rate: {pass_at_k_count}/{len(metrics)} ({pass_at_k_count/len(metrics)*100:.1f}%)")
print(f"pass^K success rate: {pass_pow_k_count}/{len(metrics)} ({pass_pow_k_count/len(metrics)*100:.1f}%)")
print(f"Tool correctness rate: {tool_correct_count}/{len(metrics)} ({tool_correct_count/len(metrics)*100:.1f}%)")

avg_score = sum(sum(m.correctness_scores) for m in metrics) / sum(len(m.correctness_scores) for m in metrics)
print(f"\nAverage correctness score: {avg_score:.3f}")

## Visualize Results

In [None]:
import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Correctness scores by question
ax1 = axes[0, 0]
qa_ids = [m.qa_id for m in metrics]
x_pos = np.arange(len(qa_ids))

for i in range(metrics[0].k):
    scores = [m.correctness_scores[i] if i < len(m.correctness_scores) else 0 for m in metrics]
    ax1.bar(x_pos + i*0.25, scores, 0.25, label=f'Response {i+1}', alpha=0.8)

ax1.axhline(y=0.7, color='r', linestyle='--', label='Threshold')
ax1.set_xlabel('Question')
ax1.set_ylabel('Correctness Score')
ax1.set_title('Correctness Scores by Question and Response')
ax1.set_xticks(x_pos + 0.25)
ax1.set_xticklabels([qa.replace('_', '\n') for qa in qa_ids], rotation=0, ha='center')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# Plot 2: pass@K vs pass^K
ax2 = axes[0, 1]
pass_at_k = [m.pass_at_k for m in metrics]
pass_pow_k = [m.pass_pow_k for m in metrics]
x = np.arange(len(qa_ids))
width = 0.35

ax2.bar(x - width/2, pass_at_k, width, label='pass@K', color='green', alpha=0.7)
ax2.bar(x + width/2, pass_pow_k, width, label='pass^K', color='blue', alpha=0.7)
ax2.set_xlabel('Question')
ax2.set_ylabel('Result (True/False)')
ax2.set_title('pass@K vs pass^K by Question')
ax2.set_xticks(x)
ax2.set_xticklabels([qa.replace('_', '\n') for qa in qa_ids], rotation=0, ha='center')
ax2.set_yticks([0, 1])
ax2.set_yticklabels(['False', 'True'])
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

# Plot 3: Tool correctness components
ax3 = axes[1, 0]
components = ['Selection', 'Parameters', 'Sequence', 'Utilization']
x_pos = np.arange(len(components))

for i, m in enumerate(metrics):
    if m.tool_correctness:
        tc = m.tool_correctness
        scores = [
            tc.tool_selection_correct,
            tc.parameter_accuracy,
            tc.sequence_correct,
            tc.result_utilization
        ]
        ax3.plot(x_pos, scores, marker='o', label=m.qa_id, linewidth=2)

ax3.axhline(y=0.75, color='r', linestyle='--', label='Threshold', alpha=0.5)
ax3.set_xlabel('Component')
ax3.set_ylabel('Score')
ax3.set_title('Tool Correctness Components by Question')
ax3.set_xticks(x_pos)
ax3.set_xticklabels(components)
ax3.legend()
ax3.grid(True, alpha=0.3)
ax3.set_ylim(0, 1.05)

# Plot 4: Overall metrics summary
ax4 = axes[1, 1]
metrics_data = [
    pass_at_k_count / len(metrics),
    pass_pow_k_count / len(metrics),
    tool_correct_count / len(metrics),
    avg_score
]
metric_names = ['pass@K\nRate', 'pass^K\nRate', 'Tool\nCorrect Rate', 'Avg\nScore']
colors = ['green', 'blue', 'orange', 'purple']

bars = ax4.bar(metric_names, metrics_data, color=colors, alpha=0.7)
ax4.set_ylabel('Rate / Score')
ax4.set_title('Overall Metrics Summary')
ax4.set_ylim(0, 1.05)
ax4.grid(axis='y', alpha=0.3)

for bar, value in zip(bars, metrics_data):
    height = bar.get_height()
    ax4.text(bar.get_x() + bar.get_width()/2., height + 0.02,
             f'{value:.2f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

## Understanding the Results

### pass@K vs pass^K

- **pass@K**: At least one of K responses is correct. This measures whether an agent *can* produce a correct answer with multiple attempts.
- **pass^K**: All K responses are correct. This measures consistency and reliability.

### Tool Correctness

Tool correctness evaluates four aspects:

1. **Selection (25%)**: Did the agent choose the right tools?
2. **Parameters (25%)**: Did the agent pass correct parameters?
3. **Sequence (25%)**: Did the agent use tools in the correct order? (only if `tool_sequence_matters=true`)
4. **Utilization (25%)**: Did the agent use tool results in the final answer?

The overall score is a weighted average. By default, all weights are 0.25, but you can customize them.

### Use Cases

- **Agent Reliability**: Compare pass@K vs pass^K to understand if your agent is consistent
- **Tool Usage**: Identify which aspects of tool usage need improvement
- **Model Comparison**: Compare different models or prompting strategies
- **Quality Benchmarking**: Track agent performance over time

## Custom Thresholds and Weights

You can customize thresholds and tool correctness weights:

In [None]:
# Run with custom settings
custom_metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    threshold=0.8,  # Stricter answer threshold
    tool_threshold=0.7,  # More lenient tool threshold
    tool_weights={
        "selection": 0.4,  # Emphasize tool selection
        "parameters": 0.3,  # Emphasize parameters
        "sequence": 0.2,
        "utilization": 0.1,
    },
    verbose=False,
)

print(f"Custom evaluation with threshold=0.8:")
for m in custom_metrics:
    print(f"  {m.qa_id}: pass@{m.k}={m.pass_at_k}, pass^{m.k}={m.pass_pow_k}")