Jiva v0.3.48 - Code-Mode Benchmark Suite

Release Date: June 29, 2026

Summary

Jiva's --code mode works well with gpt-oss-120b but degrades with other models - Sarvam-105b (4096-token output cap) chokes on tasks that need larger writes/edits, and Krutrim struggles with longer, multi-step tasks. Until now there was no objective way to measure how a given model + config performs in code mode.

v0.3.48 adds a built-in, TDD-style, deterministically-scored benchmark suite. It drives CodeAgent through a sequence of coding tasks of increasing complexity in isolated throwaway workspaces, scoring each task with Node's built-in test runner (node --test) — no LLM judge, no network, no external test framework. The result is a per-task pass/fail plus diagnostic metrics (iterations, tokens, wall-time, and whether the iteration cap was hit) that surface exactly where a model breaks down.

The suite is invokable from both the CLI (jiva benchmark) and the HTTP API (/api/benchmark/*).

The task suite

A single evolving Node library (taskstore) is the substrate. Each tier's workspace is scaffolded from the canonical solution of all prior tiers plus a new failing test, so the tiers genuinely build on one another while each is measured in isolation. The suite mixes building from scratch with improving/fixing existing code:

Tier	Task	Capability exercised	Kind
1	Create `createTask`	Write a new file	scratch
2	Add `addTask` / `removeTask` / `listTasks`	Read & extend a file	extend
3	Fix the `toggleTask` bug	Diagnose & fix a targeted bug	bugfix
4	Add task priorities	Multi-function feature	extend
5	Split into `model` / `query` modules	Multi-file refactor without regressions	refactor
6	Implement `sortTasks` with edge cases	Algorithmic reasoning (nulls, stability)	extend
7	Add JSON `serialize` / `deserialize`	New module + serialization	extend
8	Debug the id-collision integration bug	Long-horizon cross-module debugging	bugfix

Later tiers (5/7/8) require larger outputs and more iterations — the regime where short-output and rate-limited models fail. Scoring is cumulative: tier N's workspace runs tests 1..N, so a refactor that breaks an earlier tier is caught.

Anti-gaming: every test file is byte-hashed before the agent runs. If the agent edits a test to make it "pass", the task fails with reason tests-modified.

CLI: `jiva benchmark`

# Run tiers 1–3 against the configured model
jiva benchmark --max-tier 3

# Run the whole suite and write a JSON report
jiva benchmark --output report.json

# Run specific tasks, machine-readable output
jiva benchmark --tasks t05-refactor-split,t08-debug-id-collision --json

# Carry the agent's own output forward between tiers (cascading, like a real session)
jiva benchmark --continuous

Options: --config, --max-tier, --tasks, --max-iterations, --timeout, --lsp, --continuous, --output, --json, --keep-workspaces. The process exits non-zero when any task fails, so it drops straight into CI.

The CLI prints a per-task table (status, iterations, tests passed, tokens, time), a summary line with the highest tier passed (a single capability-ceiling number), and the diagnostic notes for any failures.

HTTP API

Registered under the existing /api auth middleware:

Route	Purpose
`GET /api/benchmark/tasks`	List available tasks (metadata only)
`POST /api/benchmark/run`	Run the suite, return the full `SuiteResult` JSON
`POST /api/benchmark/run/stream`	Run the suite, stream per-task progress via SSE (`task-start`, `task-done`, `done`, `error`)

Request body: { maxTier?, tasks?, maxIterations?, timeoutMs?, lsp?, continuous? }. The orchestrator is built once from the stored config and cached across requests.

Architecture

src/code/benchmark/
  types.ts                — BenchmarkTask, TaskResult, SuiteResult, RunnerOptions
  fixtures.ts             — canonical taskstore source (golden state per tier)
  tests.ts                — cumulative node:test files (the protected tests)
  tasks.ts                — the 8 ordered tiers + selection helpers
  verify.ts               — runNodeTests(): tamper check + `node --test` (scoped to the protected tests) + parse
  agent-factory.ts        — minimal CodeAgent per task (no MCP/persona/persistence)
  orchestrator-factory.ts — shared ModelOrchestrator builder (CLI + HTTP)
  runner.ts               — scaffolds workspaces, drives the agent, collects metrics
  report.ts               — CLI table + JSON serializer
  index.ts                — public entry point

Both interfaces converge on runBenchmark(orchestrator, tasks, options, progress). The runner creates an isolated mkdtemp workspace per task (the user's repo is never touched), races agent.chat() against a wall-clock timeout, runs the verifier, then tears everything down.

A self-test (scripts/bench-selftest.mjs, run after npm run build) validates the fixtures and verifier without any model: for every tier it asserts the golden solution passes, the scaffold fails, and tampering is detected (24 checks).

Benchmark suites (taskstore + micro-CRM)

The benchmark is now organised into suites with two scoring modes, laying the
groundwork for a tiered strategy (baseline → capability → frontier):

taskstore (baseline, gating). The original 8-tier TDD suite. Binary pass/fail,
tasks build on one another. Answers "does this model + config do code mode at all?"
microcrm (capability, scored). A building suite that builds a CRM REST API
using only Node built-ins — node:http + node:sqlite — graded by the fraction of
spec tests passed (51 tests across 5 tasks). Tier 1 builds the base API from scratch
(a genuine large-output test). Tiers 2-5 scaffold the working base and add one harder
feature each: atomic bulk insert (transaction rollback), advanced querying
(combined filters + sorting + hasMore pagination), weighted pipeline analytics, and
idempotency-key dedup on create. Zero external dependencies, fully deterministic, real
HTTP + real SQLite. The report shows the pass-rate and lists the exact spec tests missed.
Requires Node ≥ 22.5.

Output-length flag. CodeAgent now counts output-token truncations per turn
(AgentResponse.truncationEvents), and the benchmark flags a failed task as
[output-limited] when it failed after hitting the model's output cap — distinguishing a
model that can't emit a large enough response (e.g. a hard 4096-token cap) from one that
got the logic wrong.

New scoring mode: scored suites report Score: N/M tests (P%) instead of all-or-nothing,
and the runner aggregates per-suite test totals. Verifier now also captures failing test
names. Run jiva benchmark --list to see all suites/tasks; --suite <id> to pick one.

HTTP gains GET /api/benchmark/suites and a suite field on the run routes.

Code-mode compatibility fixes (surfaced by the benchmark)

The first benchmark runs immediately exposed two code-mode issues that hurt every model and broke weaker ones (Sarvam-105b, Krutrim) outright. Both are fixed in this release:

write_file / edit_file now accept path as an alias for file_path. Models frequently emit the common path argument; provider-side tool-call validation (Groq/Sarvam) then rejected the call with a 400 (missing properties: 'file_path') before it reached the tool, wasting iterations and tokens. The schema now hard-requires only content (write) / old_string+new_string (edit), accepts either path or file_path, and execute() validates the path itself. repairFailedToolCall also normalizes path→file_path as a fallback.
Schema/argument errors are no longer misdiagnosed as output-token truncation. Groq codes both truncated and schema-invalid tool calls as tool_use_failed. The handler previously treated any such error on a file tool as "content too large" and told the model to "write in stages" — the wrong remedy. It now inspects failed_generation: a complete, parseable payload is treated as a schema error (with a targeted, tool-specific parameter correction), and only a genuinely cut-off payload routes to the write-in-stages path.
Sarvam-105b output budget corrected to 4096. The setup-wizard preset requested 8192 completion tokens, but Sarvam's API caps output at 4096 (requesting more is rejected). The preset now uses 4096.
Graceful "continue?" prompt at the step limit. When code mode hits its iteration limit before finishing, the agent now reports a stopReason: 'max-iterations', and the interactive CLI offers the user a choice — Continue working… (resumes with full history) or Stop for now — instead of silently emitting [Max iterations reached without a final response]. (HTTP behaviour is unchanged.)

Files Changed

File	Change
`src/code/benchmark/*`	New — the benchmark module (suites, runner, verifier, report)
`src/code/benchmark/suites.ts`	New — suite registry (taskstore + microcrm)
`src/code/benchmark/microcrm/*`	New — micro-CRM scored suite + `node:http`/`node:sqlite` test & reference assets
`scripts/copy-benchmark-assets.mjs`	New — copies benchmark `.mjs` assets into dist on build
`src/code/tools/write.ts`	Accept `path` alias; relax schema `required` to `content`
`src/code/tools/edit.ts`	Accept `path` alias; relax schema `required` to `old_string`/`new_string`
`src/code/agent.ts`	Distinguish truncation vs schema errors; targeted schema-arg correction; `path`→`file_path` repair; `truncationEvents` + `stopReason` on the response
`src/interfaces/cli/setup-wizard.ts`	Sarvam `defaultMaxTokens` 8192 → 4096
`src/interfaces/cli/repl.ts`	Offer Continue/Stop when code mode hits its step limit
`src/core/agent-interface.ts`	`stopReason` on `AgentChatResponse`
`src/interfaces/cli/index.ts`	New `benchmark` command
`src/interfaces/http/routes/benchmark.ts`	New — benchmark HTTP routes
`src/interfaces/http/index.ts`	Register `setupBenchmarkRoutes`
`scripts/bench-selftest.mjs`	New — fixture/verifier self-test
`docs/guides/benchmark-suite.md`	New — usage & extension guide
`package.json`	Version `0.3.47` → `0.3.48`

Upgrade

npm install -g jiva-core@0.3.48

No config changes required. The benchmark reuses the model stack from your existing configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.3.48

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Jiva v0.3.48 - Code-Mode Benchmark Suite

Summary

The task suite

CLI: `jiva benchmark`

HTTP API

Architecture

Benchmark suites (taskstore + micro-CRM)

Code-mode compatibility fixes (surfaced by the benchmark)

Files Changed

Upgrade

Uh oh!

Uh oh!

v0.3.48

Jiva v0.3.48 - Code-Mode Benchmark Suite

Summary

The task suite

CLI: jiva benchmark

HTTP API

Architecture

Benchmark suites (taskstore + micro-CRM)

Code-mode compatibility fixes (surfaced by the benchmark)

Files Changed

Upgrade

Uh oh!

CLI: `jiva benchmark`