-
-
Notifications
You must be signed in to change notification settings - Fork 28
Eval System
Agent evaluation framework for CortexPrism, providing structured task suites, pattern-based scoring, regression detection, and pre-built evaluation harnesses.
Three files in src/eval/:
| File | Purpose |
|---|---|
types.ts |
Core types: EvalTask, EvalResult, EvalSuite, EvalRunSummary, RegressionCheck |
scorer.ts |
Pattern-based scoring and regression detection |
runner.ts |
Suite execution, baseline management, regression analysis |
| Field | Description |
|---|---|
id |
Task identifier |
category |
One of 8 task categories |
description |
Human-readable description |
prompt |
The prompt sent to the agent |
expectedPatterns |
Expected output patterns (at least one must match) |
expectedFiles |
Files expected to be created/modified with content checks |
expectedExitCode |
Expected shell exit code |
expectedToolSequence |
Expected tool calls in order |
maxRounds |
Maximum tool rounds allowed |
requiredTools |
Tools needed to run this task |
timeoutMs |
Task timeout in milliseconds |
| Category | Description |
|---|---|
code_generation |
Generate new code |
bug_fix |
Fix bugs in existing code |
refactoring |
Refactor existing code |
code_review |
Review code for issues |
shell_command |
Execute shell commands |
file_operation |
Read/write/edit files |
search_retrieval |
Search and retrieve information |
tool_use_sequence |
Multi-tool sequence operations |
| Field | Description |
|---|---|
taskId |
Matched task |
taskCategory |
Task category |
passed |
Overall pass/fail |
score |
0.0–1.0 |
durationMs |
Execution duration |
tokensUsed |
Total tokens consumed |
costUsd |
Estimated cost |
toolCallsMade |
Number of tool invocations |
error |
Error message if failed |
details |
Per-check details |
{
name: string;
description?: string;
tasks: EvalTask[];
}The scoreResponse() function evaluates agent output against expected patterns:
| Pattern Prefix | Behavior |
|---|---|
regex:<pattern> |
Case-insensitive regex match; detail: "matched" / "no match" |
contains:<text> |
Case-insensitive substring; detail: "found" / "not found" |
not_contains:<text> |
Must NOT contain; detail: "correctly absent" / "found forbidden: ..." |
| (no prefix) | Default fuzzy contains (case-insensitive substring) |
Pass condition: ALL patterns must match. If no patterns are provided, passes if output > 10 characters.
Score: matchingPatterns / totalPatterns, or 1.0 if no patterns and output exists.
scoreFileContent() checks expected files:
- Verifies file existence
- Optionally checks that file content contains expected text
checkRegression() compares previous and current scores:
{ degraded: boolean; delta: number }- Threshold: Default 0.1 score difference
-
Degraded:
previous.score - current.score > threshold
runSuite() executes all tasks in a suite sequentially against a provider:
- Creates per-task sessions via
sessionDbFactory - Executes each task via
agentTurn()with a timeout - Scores output against expected patterns
- Checks expected files for existence and content
- Aggregates results into
EvalRunSummarywith per-category breakdown
| Function | Description |
|---|---|
setBaseline(runId) |
Mark a run as the baseline |
listBaselines() |
List saved baselines |
detectRegressions(previous, current) |
Compare two runs and return degraded tasks |
{
taskId: string;
previousScore: number;
currentScore: number;
degraded: boolean;
delta: number; // positive = degradation
}Supports evaluation of retrieval-augmented generation through search_retrieval category tasks. Expected patterns can validate that retrieved information appears in the agent's response.
In-memory suite and run storage. Suites are registered by name and can be retrieved, listed, and executed. Runs are stored with summaries and can be promoted to baselines for regression comparison.
cortex eval list-suites # List registered eval suites
cortex eval run <suite-name> # Run a suite
cortex eval baseline <run-id> # Set a run as baseline
cortex eval regressions <suite-name> # Check for regressions| Method | Path | Description |
|---|---|---|
GET |
/api/eval/suites |
List suites |
POST |
/api/eval/suites |
Create suite |
POST |
/api/eval/suites/:name/run |
Run suite |
GET |
/api/eval/runs |
List runs |
GET |
/api/eval/runs/:id |
Get run results |
POST |
/api/eval/runs/:id/baseline |
Set as baseline |
GET |
/api/eval/baselines |
List baselines |
GET |
/api/eval/regressions/:suite |
Detect regressions |
- Memori Checkpoints — Agent state preservation for eval reproducibility
- Workflow Engine — DAG-based task orchestration
- Update System — Health checks with similar pass/fail model
CortexPrism — Open-source agentic AI harness · MIT License · Built with Deno 2.x + TypeScript
- Agent Loop
- Metacognition
- Memory System
- Skills System
- Sub-Agents
- Built-in Tools
- Code Intelligence
- Code Sandbox
- Cross-Agent Context Protocol
- Prompt Lab
- PKM Assistant
- Voice Pipeline
- Computer Use
- Browser Tool
- Git & GitHub
- Scheduler & Jobs
- Dashboard
- Observability
- A2A Protocol
- MCP Gateway
- Distributed Nodes
- Memori Checkpoints
- Eval System
- Workflow Engine
- Triggers
- Projects
- TUI
- Glossary
- Update System