evaluation pipeline
This release brings significant improvements to align EvalEx with inspect-ai
functionality, focusing on task definition patterns, dataset integration,
and enhanced scorer capabilities.
Task System Enhancements
Introduced task/2 macro enabling decorator-style task definitions with
automatic registry metadata collection. Tasks can now be defined using a
more ergonomic syntax while maintaining full compatibility with existing
module-based approaches. The Task struct has been expanded to include
inspect-ai compatible fields: display_name, version, solver, setup,
cleanup, metrics, model configuration, model_roles, and resource limits
(message_limit, token_limit, time_limit, working_limit).
Added EvalEx.Task.Definition module to capture task metadata for registry
discovery, including function name, arity, parameters, and attributes.
The registry now supports both module-based tasks and decorated function
definitions through register_module/2.
Enhanced Task.from_module/1 to build complete task structs from behaviour
implementations, with safe fallbacks for optional callbacks.
Dataset Integration
New EvalEx.Dataset module provides adapters for converting
CrucibleDatasets structures into EvalEx samples. Supports MemoryDataset
items, raw maps, and existing Sample structs. Handles question/choices
input normalization for multiple choice datasets.
Sample struct expanded with sandbox, files, and setup fields to match
inspect-ai Sample capabilities, enabling richer evaluation scenarios with
sandboxed execution environments and file handling.
Scorer Improvements
Completely rewrote EvalEx.Scorer.LLMJudge to support inspect-ai grading
semantics. Now parses "GRADE: C/I/P" patterns with configurable partial
credit (0.5 for P when enabled). Added model and model_role parameters
for explicit grader model specification.
Default template and instructions now match inspect-ai's model_graded_qa
scorer format, with step-by-step reasoning prompts and explicit grade
letter output requirements.
Added optional metrics/0 callback to Scorer behaviour for exposing
metric identifiers. LLMJudge reports accuracy and stderr metrics.
Metrics Module
Implemented accuracy/1 for computing mean accuracy from numeric scores.
Implemented stderr/1 for standard error of the mean calculation.
Documentation and Testing
Updated README, QUICK_REFERENCE, and QUICK_SUMMARY to reflect new
capabilities and version bump. Added comprehensive documentation in
docs/20251224/INSPECT_AI_PARITY_REQUIREMENTS.md mapping Python
inspect-ai sources to EvalEx implementation coverage.
Test suite expanded from 152 to 164 tests covering new functionality:
dataset adapters, task decorator registration, LLMJudge grade parsing
with partial credit, and metrics helpers. All tests passing with
dialyzer checks clean.
Dependencies
Added crucible_datasets 0.5.1 to support dataset adapter functionality.
This release maintains backward compatibility while providing the
foundation for running inspect-ai style evaluation tasks within the
EvalEx framework.