# ðŸ§  llm-eval-pro â€“ Demo Notebook

This notebook runs the full evaluation pipeline:

1. Load config & dataset  
2. Run one or more models  
3. Compute metrics  
4. Inspect results and plots  

The default config uses a local `dummy-echo` model so you can run this
without any API keys. Later you can plug in OpenAI or other providers
via `configs/eval_config.yaml`.


In [None]:
from pathlib import Path

import pandas as pd

# If you run this from the `notebooks/` directory, this points to project root.
PROJECT_ROOT = Path(__file__).resolve().parents[1]

CONFIG_PATH = PROJECT_ROOT / "configs" / "eval_config.yaml"
DATASET_PATH = PROJECT_ROOT / "data" / "samples.json"
OUTPUT_DIR = PROJECT_ROOT / "outputs"

print("Project root:", PROJECT_ROOT)
print("Config path:", CONFIG_PATH)
print("Dataset path:", DATASET_PATH)
print("Output dir:", OUTPUT_DIR)

## 1. Inspect dataset

We start by looking at the evaluation dataset â€“ this is what each model
will be asked to answer.


In [None]:
import json

with DATASET_PATH.open("r", encoding="utf-8") as f:
    data = json.load(f)

print(f"Loaded {len(data)} examples")
pd.DataFrame(data).head()

## 2. Run evaluation

Here we run the main evaluation loop.  
This will:

- load `configs/eval_config.yaml`
- evaluate each configured model
- compute metrics per example
- write CSVs under `outputs/`.


In [None]:
from eval.compare_models import run_evaluation

# Make sure outputs directory exists, then run
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

run_evaluation(CONFIG_PATH)

## 3. Per-example metrics

We can now inspect the per-example metrics to understand where a model
is doing well or failing.


In [None]:
per_example_path = OUTPUT_DIR / "per_example_metrics.csv"
per_example_df = pd.read_csv(per_example_path)

print(per_example_df.shape)
per_example_df.head()

## 4. Aggregated metrics

For model comparison, we care about metrics aggregated across the whole
evaluation set.


In [None]:
aggregated_path = OUTPUT_DIR / "aggregated_metrics.csv"
aggregated_df = pd.read_csv(aggregated_path)

aggregated_df

## 5. Visualize model comparison

We generate simple bar charts comparing models across each metric.


In [None]:
from eval.visualize import visualize_aggregated
import matplotlib.pyplot as plt

# Regenerate charts from aggregated metrics
visualize_aggregated(aggregated_path)

# Also display metrics inline in the notebook
plt.style.use("default")

for metric in [c for c in aggregated_df.columns if c != "model"]:
    ax = aggregated_df.plot(
        x="model",
        y=metric,
        kind="bar",
        legend=False,
        title=f"Model comparison â€“ {metric}",
    )
    ax.set_ylabel(metric)
    ax.set_xlabel("model")
    plt.tight_layout()
    plt.show()