## **Usage and examples**

Be sure to follow environment setup instructions provided in the README file.

### **Register available models**

In [1]:
from lib.registry.setup_providers import register_all_models
register_all_models()

[INFO] lib.registry.ModelRegistry: Registered provider 'openai'
[INFO] lib.registry.ModelRegistry: Registered provider 'anthropic'
[INFO] lib.registry.ModelRegistry: Registered provider 'google'
[INFO] lib.registry.ModelRegistry: Registered provider 'openrouter'


You can inspect content of the registry in two ways:
- list available providers
- list available models under a specific provider

In [2]:
from lib.registry.ModelRegistry import ModelRegistry

In [5]:
ModelRegistry.list_providers()

['openai', 'anthropic', 'google', 'openrouter']

In [7]:
ModelRegistry.list_models("openrouter")

['google/gemini-2.5-flash-lite',
 'google/gemini-2.5-pro',
 'google/gemini-3-pro-preview',
 'x-ai/grok-4-fast',
 'x-ai/grok-4.1-fast',
 'anthropic/claude-sonnet-4',
 'minimax/minimax-m2',
 'deepseek/deepseek-v3.2']

### **You can modify providers and models in a few ways**
- Add new model under a specific provider
- Remove a model from providers list
- Add a new provider
- Remove a provider

For model addition it is necessary to provide objects for the model and provider. This project utilizes the pydantic_ai library to do so. It's worth noting that many more providers are available at Pydantic AI's website and some models are compatible with OpenAI API. In case of pydantic_ai library usage, any additions are performed in the same way as the example shows, after replacing classes to the appropriate ones. For details see: https://ai.pydantic.dev/models/overview/

In [2]:
from pydantic_ai.models.openrouter import OpenRouterModel
from lib.providers.openrouter_config import OpenRouterConfig
from lib.registry.ModelRegistry import ModelRegistry

provider = OpenRouterConfig()._provider
mistral = OpenRouterModel(model_name="mistralai/devstral-2512:free", provider=provider)
ModelRegistry.add_model("openrouter", "devstral-2512:free", mistral) #devstral-2512:free is an alias for the registry

[INFO] lib.registry.ModelRegistry: Model 'devstral-2512:free' added to provider 'openrouter'


In [4]:
ModelRegistry.remove_model("openai", "o3-mini")

[INFO] lib.registry.ModelRegistry: Model 'o3-mini' removed from provider 'openai'


For provider addition follow these steps:
- Add an API key for your new provider as an environmental variable (name: *_API_KEY). Reload kernel so it's added to the **keys** dictionary of the **ProviderKeys** class.
- Add a new child class of the **BaseProviderConfig** class. If you are using pydantic_ai library, then the quickest way to add a provider is to copy an existing provider configuration class. After that, just replace the provider/model classes with appropriate ones found on Pydantic AI website and model names defined in the providers API.
- Register your provider using your configuration class.

In [None]:
from lib.providers.anthropic_config import AnthropicConfig
ModelRegistry.register_provider(AnthropicConfig())

Note: if you don't add an API key, the provider will be skipped. Same goes for all built-in providers. This means unused providers don't have to be deleted, unless there was a mistake in configuration of a new one.

In [None]:
ModelRegistry.remove_provider("anthropic")

### **Prompt builders**

Before running a test you should check the dataset for column names and work out a way you want a prompt to be built. For example, QA tests need you to pass a question and a list of choices, both coming from seperate columns in dataset. There are a few prompt builders available for popular usage cases. 

Built-in prompt builders include:
- **basic_prompt**: for cases where you just want to pass one column and ask for an answer based on it,
- **provide_choices_prompt**: for cases where you want to attach a list of choices to your prompt (ex. QA tasks),
- **classification_prompt**: for classification tasks with support for custom labels.

Before running the test be sure to choose one of the above or define your own builder.

## **Run a benchmark**

Steps to running a benchmark: 
- using the **ModelSpec** class define pairs of ("provider", "model") that you want to test and put them into a list
- using the **EvaluationSpec** class define which evaluation methods you want to use and put them into a list. For this you need to specify a name of evaluation method (to be used in the benchmark result dataset and report) and pass an evaluation function.
- fill in all the **BenchmarkRunner** class parameters, including lists created in the latter steps.

In [3]:
from lib.BenchmarkRunner import BenchmarkRunner, ModelSpec, EvaluationSpec
from lib.evaluators.evaluate_accuracy import evaluate_accuracy
from lib.prompt_builders.provide_choices_prompt import provide_choices_prompt

In [4]:
models = [
    ModelSpec("openrouter", "devstral-2512:free"),
    ModelSpec("openai", "gpt-4o-mini")
]

judge_prompt = "Evaluate the quality of the provided summary compared to the reference."

evaluations = [
    EvaluationSpec(
        name="accuracy",
        evaluator=lambda path: evaluate_accuracy(path)
    )
]

splits = {'test': 'anatomy/test-00000-of-00001.parquet'}

benchmark = BenchmarkRunner(
    test_name="MMLU_anatomy",
    dataset_path="hf://datasets/cais/mmlu/",
    prompt_builder=lambda row: provide_choices_prompt(row),
    system_prompt="You are taking an anatomy test. Provide the exact text of one of the choices as an answer. Do not include anything else in the answer aside from the correct choice text. Do not add any quotations and do not change letters case.",
    models=models,
    evaluations=evaluations,
    dataset_splits=splits['test'],
    max_concurrency=2,
    output_format="jsonl",
    report_format="json"
)

results = await benchmark.run()

[INFO] lib.BenchmarkRunner: Running benchmark for model openrouter:devstral-2512:free
[INFO] lib.TestRunner: Running test 'MMLU_anatomy' using provider=openrouter model=devstral-2512:free
[INFO] lib.TestRunner: Loading dataset: hf://datasets/cais/mmlu/
[INFO] lib.TestRunner: Loaded dataset with 135 rows
[INFO] lib.TestRunner: Running concurrent generation (135 rows, max_concurrency=2)
[INFO] lib.TestRunner: Results saved to: test_results\MMLU_anatomy\results_2026-01-23_16-31-37.jsonl
[INFO] lib.TestRunner: Test 'MMLU_anatomy' completed successfully
[INFO] lib.BenchmarkRunner: Evaluating openrouter:devstral-2512:free with accuracy
[INFO] lib.BenchmarkRunner: Running benchmark for model openai:gpt-4o-mini
[INFO] lib.TestRunner: Running test 'MMLU_anatomy' using provider=openai model=gpt-4o-mini
[INFO] lib.TestRunner: Loading dataset: hf://datasets/cais/mmlu/
[INFO] lib.TestRunner: Loaded dataset with 135 rows
[INFO] lib.TestRunner: Running concurrent generation (135 rows, max_concurrency