Jiva v0.3.48 - Code-Mode Benchmark Suite
Release Date: June 29, 2026
Summary
Jiva's --code mode works well with gpt-oss-120b but degrades with other models - Sarvam-105b (4096-token output cap) chokes on tasks that need larger writes/edits, and Krutrim struggles with longer, multi-step tasks. Until now there was no objective way to measure how a given model + config performs in code mode.
v0.3.48 adds a built-in, TDD-style, deterministically-scored benchmark suite. It drives CodeAgent through a sequence of coding tasks of increasing complexity in isolated throwaway workspaces, scoring each task with Node's built-in test runner (node --test) — no LLM judge, no network, no external test framework. The result is a per-task pass/fail plus diagnostic metrics (iterations, tokens, wall-time, and whether the iteration cap was hit) that surface exactly where a model breaks down.
The suite is invokable from both the CLI (jiva benchmark) and the HTTP API (/api/benchmark/*).
The task suite
A single evolving Node library (taskstore) is the substrate. Each tier's workspace is scaffolded from the canonical solution of all prior tiers plus a new failing test, so the tiers genuinely build on one another while each is measured in isolation. The suite mixes building from scratch with improving/fixing existing code:
| Tier | Task | Capability exercised | Kind |
|---|---|---|---|
| 1 | Create createTask |
Write a new file | scratch |
| 2 | Add addTask / removeTask / listTasks |
Read & extend a file | extend |
| 3 | Fix the toggleTask bug |
Diagnose & fix a targeted bug | bugfix |
| 4 | Add task priorities | Multi-function feature | extend |
| 5 | Split into model / query modules |
Multi-file refactor without regressions | refactor |
| 6 | Implement sortTasks with edge cases |
Algorithmic reasoning (nulls, stability) | extend |
| 7 | Add JSON serialize / deserialize |
New module + serialization | extend |
| 8 | Debug the id-collision integration bug | Long-horizon cross-module debugging | bugfix |
Later tiers (5/7/8) require larger outputs and more iterations — the regime where short-output and rate-limited models fail. Scoring is cumulative: tier N's workspace runs tests 1..N, so a refactor that breaks an earlier tier is caught.
Anti-gaming: every test file is byte-hashed before the agent runs. If the agent edits a test to make it "pass", the task fails with reason tests-modified.
CLI: jiva benchmark
# Run tiers 1–3 against the configured model
jiva benchmark --max-tier 3
# Run the whole suite and write a JSON report
jiva benchmark --output report.json
# Run specific tasks, machine-readable output
jiva benchmark --tasks t05-refactor-split,t08-debug-id-collision --json
# Carry the agent's own output forward between tiers (cascading, like a real session)
jiva benchmark --continuousOptions: --config, --max-tier, --tasks, --max-iterations, --timeout, --lsp, --continuous, --output, --json, --keep-workspaces. The process exits non-zero when any task fails, so it drops straight into CI.
The CLI prints a per-task table (status, iterations, tests passed, tokens, time), a summary line with the highest tier passed (a single capability-ceiling number), and the diagnostic notes for any failures.
HTTP API
Registered under the existing /api auth middleware:
| Route | Purpose |
|---|---|
GET /api/benchmark/tasks |
List available tasks (metadata only) |
POST /api/benchmark/run |
Run the suite, return the full SuiteResult JSON |
POST /api/benchmark/run/stream |
Run the suite, stream per-task progress via SSE (task-start, task-done, done, error) |
Request body: { maxTier?, tasks?, maxIterations?, timeoutMs?, lsp?, continuous? }. The orchestrator is built once from the stored config and cached across requests.
Architecture
src/code/benchmark/
types.ts — BenchmarkTask, TaskResult, SuiteResult, RunnerOptions
fixtures.ts — canonical taskstore source (golden state per tier)
tests.ts — cumulative node:test files (the protected tests)
tasks.ts — the 8 ordered tiers + selection helpers
verify.ts — runNodeTests(): tamper check + `node --test` (scoped to the protected tests) + parse
agent-factory.ts — minimal CodeAgent per task (no MCP/persona/persistence)
orchestrator-factory.ts — shared ModelOrchestrator builder (CLI + HTTP)
runner.ts — scaffolds workspaces, drives the agent, collects metrics
report.ts — CLI table + JSON serializer
index.ts — public entry point
Both interfaces converge on runBenchmark(orchestrator, tasks, options, progress). The runner creates an isolated mkdtemp workspace per task (the user's repo is never touched), races agent.chat() against a wall-clock timeout, runs the verifier, then tears everything down.
A self-test (scripts/bench-selftest.mjs, run after npm run build) validates the fixtures and verifier without any model: for every tier it asserts the golden solution passes, the scaffold fails, and tampering is detected (24 checks).
Benchmark suites (taskstore + micro-CRM)
The benchmark is now organised into suites with two scoring modes, laying the
groundwork for a tiered strategy (baseline → capability → frontier):
taskstore(baseline, gating). The original 8-tier TDD suite. Binary pass/fail,
tasks build on one another. Answers "does this model + config do code mode at all?"microcrm(capability, scored). A building suite that builds a CRM REST API
using only Node built-ins —node:http+node:sqlite— graded by the fraction of
spec tests passed (51 tests across 5 tasks). Tier 1 builds the base API from scratch
(a genuine large-output test). Tiers 2-5 scaffold the working base and add one harder
feature each: atomic bulk insert (transaction rollback), advanced querying
(combined filters + sorting +hasMorepagination), weighted pipeline analytics, and
idempotency-key dedup on create. Zero external dependencies, fully deterministic, real
HTTP + real SQLite. The report shows the pass-rate and lists the exact spec tests missed.
Requires Node ≥ 22.5.
Output-length flag. CodeAgent now counts output-token truncations per turn
(AgentResponse.truncationEvents), and the benchmark flags a failed task as
[output-limited] when it failed after hitting the model's output cap — distinguishing a
model that can't emit a large enough response (e.g. a hard 4096-token cap) from one that
got the logic wrong.
New scoring mode: scored suites report Score: N/M tests (P%) instead of all-or-nothing,
and the runner aggregates per-suite test totals. Verifier now also captures failing test
names. Run jiva benchmark --list to see all suites/tasks; --suite <id> to pick one.
HTTP gains GET /api/benchmark/suites and a suite field on the run routes.
Code-mode compatibility fixes (surfaced by the benchmark)
The first benchmark runs immediately exposed two code-mode issues that hurt every model and broke weaker ones (Sarvam-105b, Krutrim) outright. Both are fixed in this release:
-
write_file/edit_filenow acceptpathas an alias forfile_path. Models frequently emit the commonpathargument; provider-side tool-call validation (Groq/Sarvam) then rejected the call with a 400 (missing properties: 'file_path') before it reached the tool, wasting iterations and tokens. The schema now hard-requires onlycontent(write) /old_string+new_string(edit), accepts eitherpathorfile_path, andexecute()validates the path itself.repairFailedToolCallalso normalizespath→file_pathas a fallback. -
Schema/argument errors are no longer misdiagnosed as output-token truncation. Groq codes both truncated and schema-invalid tool calls as
tool_use_failed. The handler previously treated any such error on a file tool as "content too large" and told the model to "write in stages" — the wrong remedy. It now inspectsfailed_generation: a complete, parseable payload is treated as a schema error (with a targeted, tool-specific parameter correction), and only a genuinely cut-off payload routes to the write-in-stages path. -
Sarvam-105b output budget corrected to 4096. The setup-wizard preset requested 8192 completion tokens, but Sarvam's API caps output at 4096 (requesting more is rejected). The preset now uses 4096.
-
Graceful "continue?" prompt at the step limit. When code mode hits its iteration limit before finishing, the agent now reports a
stopReason: 'max-iterations', and the interactive CLI offers the user a choice — Continue working… (resumes with full history) or Stop for now — instead of silently emitting[Max iterations reached without a final response]. (HTTP behaviour is unchanged.)
Files Changed
| File | Change |
|---|---|
src/code/benchmark/* |
New — the benchmark module (suites, runner, verifier, report) |
src/code/benchmark/suites.ts |
New — suite registry (taskstore + microcrm) |
src/code/benchmark/microcrm/* |
New — micro-CRM scored suite + node:http/node:sqlite test & reference assets |
scripts/copy-benchmark-assets.mjs |
New — copies benchmark .mjs assets into dist on build |
src/code/tools/write.ts |
Accept path alias; relax schema required to content |
src/code/tools/edit.ts |
Accept path alias; relax schema required to old_string/new_string |
src/code/agent.ts |
Distinguish truncation vs schema errors; targeted schema-arg correction; path→file_path repair; truncationEvents + stopReason on the response |
src/interfaces/cli/setup-wizard.ts |
Sarvam defaultMaxTokens 8192 → 4096 |
src/interfaces/cli/repl.ts |
Offer Continue/Stop when code mode hits its step limit |
src/core/agent-interface.ts |
stopReason on AgentChatResponse |
src/interfaces/cli/index.ts |
New benchmark command |
src/interfaces/http/routes/benchmark.ts |
New — benchmark HTTP routes |
src/interfaces/http/index.ts |
Register setupBenchmarkRoutes |
scripts/bench-selftest.mjs |
New — fixture/verifier self-test |
docs/guides/benchmark-suite.md |
New — usage & extension guide |
package.json |
Version 0.3.47 → 0.3.48 |
Upgrade
npm install -g jiva-core@0.3.48No config changes required. The benchmark reuses the model stack from your existing configuration.