# Fair Forge Runners Example

This notebook demonstrates how to use the Fair Forge runners module to execute test datasets against AI systems.

## Overview

The runners module provides:
- **BaseRunner**: Abstract interface for implementing custom runners
- **AlquimiaRunner**: Implementation for Alquimia AI agents
- **Storage backends**: Local filesystem and LakeFS support for loading test datasets and saving results

## Setup

Create `.env` file:
```.env
ALQUIMIA_API_KEY=...
ALQUIMIA_URL=...
AGENT_ID=...
CHANNEL_ID=...

```

Install requiered dependencies:
```bash
uv venv
source .venv/bin/activate
uv pip install .[runners,cloud]
uv run jupyter lab
```

## Imports

In [1]:
import json
import uuid
from datetime import datetime
from pathlib import Path

from fair_forge.runners import AlquimiaRunner
from fair_forge.schemas import Batch, Dataset
from fair_forge.storage import create_local_storage

## Create Mock Test Data

Let's create some mock test datasets to demonstrate the runner functionality:

In [2]:
# Create mock batches (test cases)
mock_batches = [
    Batch(
        qa_id="test_001",
        query="What is the capital of France?",
        assistant="",  # Will be filled by the runner
        ground_truth_assistant="The capital of France is Paris.",
        observation="Basic geography question",
        agentic={},
        ground_truth_agentic={},
    ),
    Batch(
        qa_id="test_002",
        query="Explain quantum computing in simple terms.",
        assistant="",
        ground_truth_assistant="Quantum computing uses quantum mechanics principles to process information...",
        observation="Technical explanation test",
        agentic={},
        ground_truth_agentic={},
    ),
    Batch(
        qa_id="test_003",
        query="Write a haiku about programming.",
        assistant="",
        ground_truth_assistant="Code flows like water\nBugs hide in silent shadows\nDebugger reveals",
        observation="Creative writing test",
        agentic={},
        ground_truth_agentic={},
    ),
]

# Create mock dataset
mock_dataset = Dataset(
    session_id=f"test_session_{uuid.uuid4().hex[:8]}",
    assistant_id="test_assistant_001",
    language="english",
    context="General knowledge and creative writing test suite",
    conversation=mock_batches,
)

print(f"Created mock dataset: {mock_dataset.session_id}")
print(f"Number of test cases: {len(mock_dataset.conversation)}")

Created mock dataset: test_session_22549597
Number of test cases: 3


## Storage Setup

### Option 1: Local Storage

Use local filesystem for storing test datasets and results:

In [3]:
# Create local storage instance
local_storage = create_local_storage(
    tests_dir=Path("./test_datasets"),
    results_dir=Path("./test_results"),
    enabled_suites=None,  # Load all test suites
)

# Save mock dataset to local storage for later loading
test_datasets_dir = Path("./test_datasets")
test_datasets_dir.mkdir(exist_ok=True)

with open(test_datasets_dir / "mock_test_suite.json", "w") as f:
    json.dump(mock_dataset.model_dump(), f, indent=2)

print("Mock dataset saved to local storage")

[32m2026-01-13 10:12:59.870[0m | [1mINFO    [0m | [36mfair_forge.storage[0m:[36mcreate_local_storage[0m:[36m31[0m - [1mCreating local filesystem storage[0m


Mock dataset saved to local storage


### Option 2: LakeFS Storage (Optional)

Use LakeFS for cloud-based storage:

In [None]:
# Uncomment and configure if using LakeFS
# lakefs_storage = create_lakefs_storage(
#     host="http://lakefs.example.com:8000",
#     username="admin",
#     password="your-password",
#     repo_id="fair-forge-tests",
#     enabled_suites=None,
#     tests_prefix="tests/",
#     results_prefix="results/",
#     branch_name="main",
# )

## Runner Setup

### AlquimiaRunner Configuration

Configure the Alquimia runner to execute tests against your AI agent:

In [4]:
import os

from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Configure Alquimia runner
# NOTE: Set these environment variables or replace with your actual values
runner = AlquimiaRunner(
    base_url=os.getenv("ALQUIMIA_URL", "https://api.alquimia.ai"),
    api_key=os.getenv("ALQUIMIA_API_KEY", "your-api-key"),
    agent_id=os.getenv("AGENT_ID", "your-agent-id"),
    channel_id=os.getenv("CHANNEL_ID", "your-channel-id"),
    api_version=os.getenv("ALQUIMIA_VERSION", ""),
)

print("Runner configured successfully")

Runner configured successfully


## Execute Tests

### Run Single Batch

Execute a single test case:

In [6]:
# Run a single batch
async def run_single_batch():
    batch = mock_batches[0]
    session_id = f"test_session_{uuid.uuid4().hex[:8]}"

    print(f"Running batch: {batch.qa_id}")
    print(f"Query: {batch.query}")

    updated_batch, success, exec_time = await runner.run_batch(batch, session_id)

    print(f"\nSuccess: {success}")
    print(f"Execution time: {exec_time:.2f}ms")
    print(f"Response: {updated_batch.assistant}")

    return updated_batch


# Execute (uncomment to run)
# result = await run_single_batch()

### Run Complete Dataset

Execute all test cases in a dataset:

In [7]:
# Run complete dataset
async def run_complete_dataset():
    print(f"Running dataset: {mock_dataset.session_id}")
    print(f"Total batches: {len(mock_dataset.conversation)}\n")

    updated_dataset, summary = await runner.run_dataset(mock_dataset)

    print("\n" + "=" * 70)
    print("EXECUTION SUMMARY")
    print("=" * 70)
    print(f"Session ID: {summary['session_id']}")
    print(f"Total batches: {summary['total_batches']}")
    print(f"Successes: {summary['successes']}")
    print(f"Failures: {summary['failures']}")
    print(f"Total execution time: {summary['total_execution_time_ms']:.2f}ms")
    print(f"Average batch time: {summary['avg_batch_time_ms']:.2f}ms")
    print("=" * 70)

    return updated_dataset, summary


# Execute (uncomment to run)
# results, summary = await run_complete_dataset()

## Save Results

Save execution results to storage:

In [8]:
# Save results to local storage
async def save_results(updated_dataset):
    run_id = str(uuid.uuid4())
    timestamp = datetime.now()

    result_path = local_storage.save_results(
        datasets=[updated_dataset],
        run_id=run_id,
        timestamp=timestamp,
    )

    print(f"Results saved to: {result_path}")
    return result_path


# Execute (uncomment to run after running dataset)
# result_path = await save_results(results)

## Load Datasets

Load test datasets from storage:

In [9]:
# Load datasets from local storage
loaded_datasets = local_storage.load_datasets()

print(f"Loaded {len(loaded_datasets)} dataset(s)")
for ds in loaded_datasets:
    print(f"  - {ds.session_id}: {len(ds.conversation)} batches")

[32m2026-01-13 10:17:31.046[0m | [1mINFO    [0m | [36mfair_forge.storage.local_storage[0m:[36mload_datasets[0m:[36m40[0m - [1mLoading test datasets from test_datasets[0m
[32m2026-01-13 10:17:31.048[0m | [1mINFO    [0m | [36mfair_forge.storage.local_storage[0m:[36mload_datasets[0m:[36m49[0m - [1mFound 1 JSON test file(s)[0m
[32m2026-01-13 10:17:31.049[0m | [1mINFO    [0m | [36mfair_forge.storage.local_storage[0m:[36mload_datasets[0m:[36m59[0m - [1mLoading test dataset: mock_test_suite[0m
[32m2026-01-13 10:17:31.051[0m | [32m[1mSUCCESS [0m | [36mfair_forge.storage.local_storage[0m:[36mload_datasets[0m:[36m76[0m - [32m[1mLoaded dataset from mock_test_suite[0m
[32m2026-01-13 10:17:31.053[0m | [1mINFO    [0m | [36mfair_forge.storage.local_storage[0m:[36mload_datasets[0m:[36m86[0m - [1mLoaded 1 dataset(s) with 3 total test case(s)[0m


Loaded 1 dataset(s)
  - test_session_22549597: 3 batches


## Complete Pipeline Example

Put it all together in a complete pipeline:

In [11]:
async def complete_pipeline():
    """Complete test execution pipeline."""

    # 1. Load datasets from storage
    print("Step 1: Loading test datasets...")
    datasets = local_storage.load_datasets()
    print(f"Loaded {len(datasets)} dataset(s)\n")

    if not datasets:
        print("No datasets found!")
        return

    # 2. Execute all datasets
    print("Step 2: Executing datasets...")
    executed_datasets = []
    all_summaries = []

    for i, dataset in enumerate(datasets, 1):
        print(f"\n[{i}/{len(datasets)}] Processing: {dataset.session_id}")
        updated_dataset, summary = await runner.run_dataset(dataset)
        executed_datasets.append(updated_dataset)
        all_summaries.append(summary)

    # 3. Save results
    print("\nStep 3: Saving results...")
    run_id = str(uuid.uuid4())
    timestamp = datetime.now()
    result_path = local_storage.save_results(
        datasets=executed_datasets,
        run_id=run_id,
        timestamp=timestamp,
    )

    # 4. Print overall summary
    print("\n" + "=" * 70)
    print("OVERALL SUMMARY")
    print("=" * 70)
    total_batches = sum(s["total_batches"] for s in all_summaries)
    total_successes = sum(s["successes"] for s in all_summaries)
    total_failures = sum(s["failures"] for s in all_summaries)
    print(f"Total datasets: {len(datasets)}")
    print(f"Total test cases: {total_batches}")
    print(f"Successes: {total_successes}")
    print(f"Failures: {total_failures}")
    print(f"Success rate: {(total_successes/total_batches*100):.1f}%")
    print(f"Results saved to: {result_path}")
    print("=" * 70)


# Execute (uncomment to run complete pipeline)
await complete_pipeline()

[32m2026-01-13 10:17:54.126[0m | [1mINFO    [0m | [36mfair_forge.storage.local_storage[0m:[36mload_datasets[0m:[36m40[0m - [1mLoading test datasets from test_datasets[0m
[32m2026-01-13 10:17:54.127[0m | [1mINFO    [0m | [36mfair_forge.storage.local_storage[0m:[36mload_datasets[0m:[36m49[0m - [1mFound 1 JSON test file(s)[0m
[32m2026-01-13 10:17:54.127[0m | [1mINFO    [0m | [36mfair_forge.storage.local_storage[0m:[36mload_datasets[0m:[36m59[0m - [1mLoading test dataset: mock_test_suite[0m
[32m2026-01-13 10:17:54.128[0m | [32m[1mSUCCESS [0m | [36mfair_forge.storage.local_storage[0m:[36mload_datasets[0m:[36m76[0m - [32m[1mLoaded dataset from mock_test_suite[0m
[32m2026-01-13 10:17:54.129[0m | [1mINFO    [0m | [36mfair_forge.storage.local_storage[0m:[36mload_datasets[0m:[36m86[0m - [1mLoaded 1 dataset(s) with 3 total test case(s)[0m
[32m2026-01-13 10:17:54.129[0m | [1mINFO    [0m | [36mfair_forge.runners.alquimia_runner[0m:

Step 1: Loading test datasets...
Loaded 1 dataset(s)

Step 2: Executing datasets...

[1/1] Processing: test_session_22549597


2026-01-13 10:17:55,466 - httpx - INFO - HTTP Request: POST https://alquimia-hermes-alquimia-runtime.apps.rosa.alquimia.zvb4.p3.openshiftapps.com/event/infer/chat/vo-community-assistant?chat_history=50&agentspace=_default "HTTP/1.1 200 OK"
2026-01-13 10:17:55,875 - httpx - INFO - HTTP Request: GET https://alquimia-hermes-alquimia-runtime.apps.rosa.alquimia.zvb4.p3.openshiftapps.com/event/stream/task-b51a600b3a724608a6e69e95fb54925a?response_only=true "HTTP/1.1 200 OK"
[32m2026-01-13 10:17:56.385[0m | [34m[1mDEBUG   [0m | [36mfair_forge.runners.alquimia_runner[0m:[36mrun_batch[0m:[36m111[0m - [34m[1m  ✓ Batch test_001 completed (2253.7ms)[0m
[32m2026-01-13 10:17:56.385[0m | [34m[1mDEBUG   [0m | [36mfair_forge.runners.alquimia_runner[0m:[36mrun_dataset[0m:[36m150[0m - [34m[1m  Batch 2/3: test_002[0m
2026-01-13 10:17:57,206 - httpx - INFO - HTTP Request: POST https://alquimia-hermes-alquimia-runtime.apps.rosa.alquimia.zvb4.p3.openshiftapps.com/event/infer/chat/


Step 3: Saving results...

OVERALL SUMMARY
Total datasets: 1
Total test cases: 3
Successes: 3
Failures: 0
Success rate: 100.0%
Results saved to: test_results/test_run_20260113_101759_da1b62d2-2379-4dca-b0ec-1b05091f1b1c.json


## Creating Custom Runners

You can create custom runner implementations by extending `BaseRunner`:

In [None]:
from typing import Any

from fair_forge.schemas.runner import BaseRunner


class MockRunner(BaseRunner):
    """Example mock runner for testing."""

    async def run_batch(self, batch: Batch, session_id: str, **kwargs: Any) -> tuple[Batch, bool, float]:
        """Return a mock response immediately."""
        import time

        start = time.time()

        # Mock response
        mock_response = f"Mock response for: {batch.query}"
        updated_batch = batch.model_copy(update={"assistant": mock_response})

        exec_time = (time.time() - start) * 1000
        return updated_batch, True, exec_time

    async def run_dataset(self, dataset: Dataset, **kwargs: Any) -> tuple[Dataset, dict[str, Any]]:
        """Run all batches in dataset."""
        import time

        start = time.time()

        updated_batches = []
        for batch in dataset.conversation:
            updated_batch, _, _ = await self.run_batch(batch, dataset.session_id)
            updated_batches.append(updated_batch)

        updated_dataset = dataset.model_copy(update={"conversation": updated_batches})

        summary = {
            "session_id": dataset.session_id,
            "total_batches": len(dataset.conversation),
            "successes": len(dataset.conversation),
            "failures": 0,
            "total_execution_time_ms": (time.time() - start) * 1000,
            "avg_batch_time_ms": 1.0,
        }

        return updated_dataset, summary


# Test mock runner
mock_runner = MockRunner()
print("Custom MockRunner created successfully")