Skip to content

Eval System

scarecr0w12 edited this page Jun 19, 2026 · 2 revisions

Eval System

Agent evaluation framework for CortexPrism, providing structured task suites, pattern-based scoring, regression detection, and pre-built evaluation harnesses.

Architecture

Three files in src/eval/:

File Purpose
types.ts Core types: EvalTask, EvalResult, EvalSuite, EvalRunSummary, RegressionCheck
scorer.ts Pattern-based scoring and regression detection
runner.ts Suite execution, baseline management, regression analysis

Core Types

EvalTask

Field Description
id Task identifier
category One of 8 task categories
description Human-readable description
prompt The prompt sent to the agent
expectedPatterns Expected output patterns (at least one must match)
expectedFiles Files expected to be created/modified with content checks
expectedExitCode Expected shell exit code
expectedToolSequence Expected tool calls in order
maxRounds Maximum tool rounds allowed
requiredTools Tools needed to run this task
timeoutMs Task timeout in milliseconds

Task Categories

Category Description
code_generation Generate new code
bug_fix Fix bugs in existing code
refactoring Refactor existing code
code_review Review code for issues
shell_command Execute shell commands
file_operation Read/write/edit files
search_retrieval Search and retrieve information
tool_use_sequence Multi-tool sequence operations

EvalResult

Field Description
taskId Matched task
taskCategory Task category
passed Overall pass/fail
score 0.0–1.0
durationMs Execution duration
tokensUsed Total tokens consumed
costUsd Estimated cost
toolCallsMade Number of tool invocations
error Error message if failed
details Per-check details

EvalSuite

{
  name: string;
  description?: string;
  tasks: EvalTask[];
}

Scoring

The scoreResponse() function evaluates agent output against expected patterns:

Pattern Prefix Behavior
regex:<pattern> Case-insensitive regex match; detail: "matched" / "no match"
contains:<text> Case-insensitive substring; detail: "found" / "not found"
not_contains:<text> Must NOT contain; detail: "correctly absent" / "found forbidden: ..."
(no prefix) Default fuzzy contains (case-insensitive substring)

Pass condition: ALL patterns must match. If no patterns are provided, passes if output > 10 characters.

Score: matchingPatterns / totalPatterns, or 1.0 if no patterns and output exists.

File Content Scoring

scoreFileContent() checks expected files:

  • Verifies file existence
  • Optionally checks that file content contains expected text

Regression Detection

checkRegression() compares previous and current scores:

{ degraded: boolean; delta: number }
  • Threshold: Default 0.1 score difference
  • Degraded: previous.score - current.score > threshold

Runner

runSuite() executes all tasks in a suite sequentially against a provider:

  1. Creates per-task sessions via sessionDbFactory
  2. Executes each task via agentTurn() with a timeout
  3. Scores output against expected patterns
  4. Checks expected files for existence and content
  5. Aggregates results into EvalRunSummary with per-category breakdown

Baseline Comparison

Function Description
setBaseline(runId) Mark a run as the baseline
listBaselines() List saved baselines
detectRegressions(previous, current) Compare two runs and return degraded tasks

Regression Output

{
  taskId: string;
  previousScore: number;
  currentScore: number;
  degraded: boolean;
  delta: number;  // positive = degradation
}

RAG Eval

Supports evaluation of retrieval-augmented generation through search_retrieval category tasks. Expected patterns can validate that retrieved information appears in the agent's response.

Pre-built Harnesses

In-memory suite and run storage. Suites are registered by name and can be retrieved, listed, and executed. Runs are stored with summaries and can be promoted to baselines for regression comparison.

CLI and REST API

cortex eval list-suites                    # List registered eval suites
cortex eval run <suite-name>               # Run a suite
cortex eval baseline <run-id>              # Set a run as baseline
cortex eval regressions <suite-name>       # Check for regressions
Method Path Description
GET /api/eval/suites List suites
POST /api/eval/suites Create suite
POST /api/eval/suites/:name/run Run suite
GET /api/eval/runs List runs
GET /api/eval/runs/:id Get run results
POST /api/eval/runs/:id/baseline Set as baseline
GET /api/eval/baselines List baselines
GET /api/eval/regressions/:suite Detect regressions

See Also

Clone this wiki locally