Eval System

Agent evaluation framework for CortexPrism, providing structured task suites, pattern-based scoring, regression detection, and pre-built evaluation harnesses.

Architecture

Three files in src/eval/:

File	Purpose
`types.ts`	Core types: EvalTask, EvalResult, EvalSuite, EvalRunSummary, RegressionCheck
`scorer.ts`	Pattern-based scoring and regression detection
`runner.ts`	Suite execution, baseline management, regression analysis

Core Types

EvalTask

Field	Description
`id`	Task identifier
`category`	One of 8 task categories
`description`	Human-readable description
`prompt`	The prompt sent to the agent
`expectedPatterns`	Expected output patterns (at least one must match)
`expectedFiles`	Files expected to be created/modified with content checks
`expectedExitCode`	Expected shell exit code
`expectedToolSequence`	Expected tool calls in order
`maxRounds`	Maximum tool rounds allowed
`requiredTools`	Tools needed to run this task
`timeoutMs`	Task timeout in milliseconds

Task Categories

Category	Description
`code_generation`	Generate new code
`bug_fix`	Fix bugs in existing code
`refactoring`	Refactor existing code
`code_review`	Review code for issues
`shell_command`	Execute shell commands
`file_operation`	Read/write/edit files
`search_retrieval`	Search and retrieve information
`tool_use_sequence`	Multi-tool sequence operations

EvalResult

Field	Description
`taskId`	Matched task
`taskCategory`	Task category
`passed`	Overall pass/fail
`score`	0.0–1.0
`durationMs`	Execution duration
`tokensUsed`	Total tokens consumed
`costUsd`	Estimated cost
`toolCallsMade`	Number of tool invocations
`error`	Error message if failed
`details`	Per-check details

EvalSuite

{
  name: string;
  description?: string;
  tasks: EvalTask[];
}

Scoring

The scoreResponse() function evaluates agent output against expected patterns:

Pattern Prefix	Behavior
`regex:<pattern>`	Case-insensitive regex match; detail: "matched" / "no match"
`contains:<text>`	Case-insensitive substring; detail: "found" / "not found"
`not_contains:<text>`	Must NOT contain; detail: "correctly absent" / "found forbidden: ..."
(no prefix)	Default fuzzy contains (case-insensitive substring)

Pass condition: ALL patterns must match. If no patterns are provided, passes if output > 10 characters.

Score: matchingPatterns / totalPatterns, or 1.0 if no patterns and output exists.

File Content Scoring

scoreFileContent() checks expected files:

Verifies file existence
Optionally checks that file content contains expected text

Regression Detection

checkRegression() compares previous and current scores:

{ degraded: boolean; delta: number }

Threshold: Default 0.1 score difference
Degraded: previous.score - current.score > threshold

Runner

runSuite() executes all tasks in a suite sequentially against a provider:

Creates per-task sessions via sessionDbFactory
Executes each task via agentTurn() with a timeout
Scores output against expected patterns
Checks expected files for existence and content
Aggregates results into EvalRunSummary with per-category breakdown

Baseline Comparison

Function	Description
`setBaseline(runId)`	Mark a run as the baseline
`listBaselines()`	List saved baselines
`detectRegressions(previous, current)`	Compare two runs and return degraded tasks

Regression Output

{
  taskId: string;
  previousScore: number;
  currentScore: number;
  degraded: boolean;
  delta: number;  // positive = degradation
}

RAG Eval

Supports evaluation of retrieval-augmented generation through search_retrieval category tasks. Expected patterns can validate that retrieved information appears in the agent's response.

Pre-built Harnesses

In-memory suite and run storage. Suites are registered by name and can be retrieved, listed, and executed. Runs are stored with summaries and can be promoted to baselines for regression comparison.

CLI and REST API

cortex eval list-suites                    # List registered eval suites
cortex eval run <suite-name>               # Run a suite
cortex eval baseline <run-id>              # Set a run as baseline
cortex eval regressions <suite-name>       # Check for regressions

Method	Path	Description
`GET`	`/api/eval/suites`	List suites
`POST`	`/api/eval/suites`	Create suite
`POST`	`/api/eval/suites/:name/run`	Run suite
`GET`	`/api/eval/runs`	List runs
`GET`	`/api/eval/runs/:id`	Get run results
`POST`	`/api/eval/runs/:id/baseline`	Set as baseline
`GET`	`/api/eval/baselines`	List baselines
`GET`	`/api/eval/regressions/:suite`	Detect regressions

Uh oh!

Eval System

Eval System

Architecture

Core Types

EvalTask

Task Categories

EvalResult

EvalSuite

Scoring

File Content Scoring

Regression Detection

Runner

Baseline Comparison

Regression Output

RAG Eval

Pre-built Harnesses

CLI and REST API

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!