Release Release v0.1.2: inspect-ai parity enhancements for task registry and · North-Shore-AI/eval_ex

v0.1.2
4975fc6
Verified

This commit was signed with the committer’s verified signature.

nshkrdotcom nshkrdotcom

SSH Key Fingerprint: 7E9kicni4Zs9x0ZdPw3mRTQmdFtF9t4LDAbO0Ve5vZA
Verified
Learn about vigilant mode
Choose a tag to compare

Filter

View all tags

Release v0.1.2: inspect-ai parity enhancements for task registry and

v0.1.2
4975fc6
Choose a tag to compare

Filter

View all tags
Verified

This commit was signed with the committer’s verified signature.

nshkrdotcom nshkrdotcom

SSH Key Fingerprint: 7E9kicni4Zs9x0ZdPw3mRTQmdFtF9t4LDAbO0Ve5vZA
Verified
Learn about vigilant mode

nshkrdotcom tagged this 25 Dec 01:57

evaluation pipeline

This release brings significant improvements to align EvalEx with inspect-ai
functionality, focusing on task definition patterns, dataset integration,
and enhanced scorer capabilities.

Task System Enhancements

Introduced task/2 macro enabling decorator-style task definitions with
automatic registry metadata collection. Tasks can now be defined using a
more ergonomic syntax while maintaining full compatibility with existing
module-based approaches. The Task struct has been expanded to include
inspect-ai compatible fields: display_name, version, solver, setup,
cleanup, metrics, model configuration, model_roles, and resource limits
(message_limit, token_limit, time_limit, working_limit).

Added EvalEx.Task.Definition module to capture task metadata for registry
discovery, including function name, arity, parameters, and attributes.
The registry now supports both module-based tasks and decorated function
definitions through register_module/2.

Enhanced Task.from_module/1 to build complete task structs from behaviour
implementations, with safe fallbacks for optional callbacks.

Dataset Integration

New EvalEx.Dataset module provides adapters for converting
CrucibleDatasets structures into EvalEx samples. Supports MemoryDataset
items, raw maps, and existing Sample structs. Handles question/choices
input normalization for multiple choice datasets.

Sample struct expanded with sandbox, files, and setup fields to match
inspect-ai Sample capabilities, enabling richer evaluation scenarios with
sandboxed execution environments and file handling.

Scorer Improvements

Completely rewrote EvalEx.Scorer.LLMJudge to support inspect-ai grading
semantics. Now parses "GRADE: C/I/P" patterns with configurable partial
credit (0.5 for P when enabled). Added model and model_role parameters
for explicit grader model specification.

Default template and instructions now match inspect-ai's model_graded_qa
scorer format, with step-by-step reasoning prompts and explicit grade
letter output requirements.

Added optional metrics/0 callback to Scorer behaviour for exposing
metric identifiers. LLMJudge reports accuracy and stderr metrics.

Metrics Module

Implemented accuracy/1 for computing mean accuracy from numeric scores.
Implemented stderr/1 for standard error of the mean calculation.

Documentation and Testing

Updated README, QUICK_REFERENCE, and QUICK_SUMMARY to reflect new
capabilities and version bump. Added comprehensive documentation in
docs/20251224/INSPECT_AI_PARITY_REQUIREMENTS.md mapping Python
inspect-ai sources to EvalEx implementation coverage.

Test suite expanded from 152 to 164 tests covering new functionality:
dataset adapters, task decorator registration, LLMJudge grade parsing
with partial credit, and metrics helpers. All tests passing with
dialyzer checks clean.

Dependencies

Added crucible_datasets 0.5.1 to support dataset adapter functionality.

This release maintains backward compatibility while providing the
foundation for running inspect-ai style evaluation tasks within the
EvalEx framework.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!