Skip to content

v0.3.48

Latest

Choose a tag to compare

@abi-chatterjee abi-chatterjee released this 30 Jun 07:35
12d1d33

Jiva v0.3.48 - Code-Mode Benchmark Suite

Release Date: June 29, 2026


Summary

Jiva's --code mode works well with gpt-oss-120b but degrades with other models - Sarvam-105b (4096-token output cap) chokes on tasks that need larger writes/edits, and Krutrim struggles with longer, multi-step tasks. Until now there was no objective way to measure how a given model + config performs in code mode.

v0.3.48 adds a built-in, TDD-style, deterministically-scored benchmark suite. It drives CodeAgent through a sequence of coding tasks of increasing complexity in isolated throwaway workspaces, scoring each task with Node's built-in test runner (node --test) — no LLM judge, no network, no external test framework. The result is a per-task pass/fail plus diagnostic metrics (iterations, tokens, wall-time, and whether the iteration cap was hit) that surface exactly where a model breaks down.

The suite is invokable from both the CLI (jiva benchmark) and the HTTP API (/api/benchmark/*).


The task suite

A single evolving Node library (taskstore) is the substrate. Each tier's workspace is scaffolded from the canonical solution of all prior tiers plus a new failing test, so the tiers genuinely build on one another while each is measured in isolation. The suite mixes building from scratch with improving/fixing existing code:

Tier Task Capability exercised Kind
1 Create createTask Write a new file scratch
2 Add addTask / removeTask / listTasks Read & extend a file extend
3 Fix the toggleTask bug Diagnose & fix a targeted bug bugfix
4 Add task priorities Multi-function feature extend
5 Split into model / query modules Multi-file refactor without regressions refactor
6 Implement sortTasks with edge cases Algorithmic reasoning (nulls, stability) extend
7 Add JSON serialize / deserialize New module + serialization extend
8 Debug the id-collision integration bug Long-horizon cross-module debugging bugfix

Later tiers (5/7/8) require larger outputs and more iterations — the regime where short-output and rate-limited models fail. Scoring is cumulative: tier N's workspace runs tests 1..N, so a refactor that breaks an earlier tier is caught.

Anti-gaming: every test file is byte-hashed before the agent runs. If the agent edits a test to make it "pass", the task fails with reason tests-modified.


CLI: jiva benchmark

# Run tiers 1–3 against the configured model
jiva benchmark --max-tier 3

# Run the whole suite and write a JSON report
jiva benchmark --output report.json

# Run specific tasks, machine-readable output
jiva benchmark --tasks t05-refactor-split,t08-debug-id-collision --json

# Carry the agent's own output forward between tiers (cascading, like a real session)
jiva benchmark --continuous

Options: --config, --max-tier, --tasks, --max-iterations, --timeout, --lsp, --continuous, --output, --json, --keep-workspaces. The process exits non-zero when any task fails, so it drops straight into CI.

The CLI prints a per-task table (status, iterations, tests passed, tokens, time), a summary line with the highest tier passed (a single capability-ceiling number), and the diagnostic notes for any failures.


HTTP API

Registered under the existing /api auth middleware:

Route Purpose
GET /api/benchmark/tasks List available tasks (metadata only)
POST /api/benchmark/run Run the suite, return the full SuiteResult JSON
POST /api/benchmark/run/stream Run the suite, stream per-task progress via SSE (task-start, task-done, done, error)

Request body: { maxTier?, tasks?, maxIterations?, timeoutMs?, lsp?, continuous? }. The orchestrator is built once from the stored config and cached across requests.


Architecture

src/code/benchmark/
  types.ts                — BenchmarkTask, TaskResult, SuiteResult, RunnerOptions
  fixtures.ts             — canonical taskstore source (golden state per tier)
  tests.ts                — cumulative node:test files (the protected tests)
  tasks.ts                — the 8 ordered tiers + selection helpers
  verify.ts               — runNodeTests(): tamper check + `node --test` (scoped to the protected tests) + parse
  agent-factory.ts        — minimal CodeAgent per task (no MCP/persona/persistence)
  orchestrator-factory.ts — shared ModelOrchestrator builder (CLI + HTTP)
  runner.ts               — scaffolds workspaces, drives the agent, collects metrics
  report.ts               — CLI table + JSON serializer
  index.ts                — public entry point

Both interfaces converge on runBenchmark(orchestrator, tasks, options, progress). The runner creates an isolated mkdtemp workspace per task (the user's repo is never touched), races agent.chat() against a wall-clock timeout, runs the verifier, then tears everything down.

A self-test (scripts/bench-selftest.mjs, run after npm run build) validates the fixtures and verifier without any model: for every tier it asserts the golden solution passes, the scaffold fails, and tampering is detected (24 checks).


Benchmark suites (taskstore + micro-CRM)

The benchmark is now organised into suites with two scoring modes, laying the
groundwork for a tiered strategy (baseline → capability → frontier):

  • taskstore (baseline, gating). The original 8-tier TDD suite. Binary pass/fail,
    tasks build on one another. Answers "does this model + config do code mode at all?"
  • microcrm (capability, scored). A building suite that builds a CRM REST API
    using only Node built-insnode:http + node:sqlite — graded by the fraction of
    spec tests passed
    (51 tests across 5 tasks). Tier 1 builds the base API from scratch
    (a genuine large-output test). Tiers 2-5 scaffold the working base and add one harder
    feature each: atomic bulk insert (transaction rollback), advanced querying
    (combined filters + sorting + hasMore pagination), weighted pipeline analytics, and
    idempotency-key dedup on create. Zero external dependencies, fully deterministic, real
    HTTP + real SQLite. The report shows the pass-rate and lists the exact spec tests missed.
    Requires Node ≥ 22.5.

Output-length flag. CodeAgent now counts output-token truncations per turn
(AgentResponse.truncationEvents), and the benchmark flags a failed task as
[output-limited] when it failed after hitting the model's output cap — distinguishing a
model that can't emit a large enough response (e.g. a hard 4096-token cap) from one that
got the logic wrong.

New scoring mode: scored suites report Score: N/M tests (P%) instead of all-or-nothing,
and the runner aggregates per-suite test totals. Verifier now also captures failing test
names. Run jiva benchmark --list to see all suites/tasks; --suite <id> to pick one.

HTTP gains GET /api/benchmark/suites and a suite field on the run routes.

Code-mode compatibility fixes (surfaced by the benchmark)

The first benchmark runs immediately exposed two code-mode issues that hurt every model and broke weaker ones (Sarvam-105b, Krutrim) outright. Both are fixed in this release:

  1. write_file / edit_file now accept path as an alias for file_path. Models frequently emit the common path argument; provider-side tool-call validation (Groq/Sarvam) then rejected the call with a 400 (missing properties: 'file_path') before it reached the tool, wasting iterations and tokens. The schema now hard-requires only content (write) / old_string+new_string (edit), accepts either path or file_path, and execute() validates the path itself. repairFailedToolCall also normalizes pathfile_path as a fallback.

  2. Schema/argument errors are no longer misdiagnosed as output-token truncation. Groq codes both truncated and schema-invalid tool calls as tool_use_failed. The handler previously treated any such error on a file tool as "content too large" and told the model to "write in stages" — the wrong remedy. It now inspects failed_generation: a complete, parseable payload is treated as a schema error (with a targeted, tool-specific parameter correction), and only a genuinely cut-off payload routes to the write-in-stages path.

  3. Sarvam-105b output budget corrected to 4096. The setup-wizard preset requested 8192 completion tokens, but Sarvam's API caps output at 4096 (requesting more is rejected). The preset now uses 4096.

  4. Graceful "continue?" prompt at the step limit. When code mode hits its iteration limit before finishing, the agent now reports a stopReason: 'max-iterations', and the interactive CLI offers the user a choice — Continue working… (resumes with full history) or Stop for now — instead of silently emitting [Max iterations reached without a final response]. (HTTP behaviour is unchanged.)

Files Changed

File Change
src/code/benchmark/* New — the benchmark module (suites, runner, verifier, report)
src/code/benchmark/suites.ts New — suite registry (taskstore + microcrm)
src/code/benchmark/microcrm/* New — micro-CRM scored suite + node:http/node:sqlite test & reference assets
scripts/copy-benchmark-assets.mjs New — copies benchmark .mjs assets into dist on build
src/code/tools/write.ts Accept path alias; relax schema required to content
src/code/tools/edit.ts Accept path alias; relax schema required to old_string/new_string
src/code/agent.ts Distinguish truncation vs schema errors; targeted schema-arg correction; pathfile_path repair; truncationEvents + stopReason on the response
src/interfaces/cli/setup-wizard.ts Sarvam defaultMaxTokens 8192 → 4096
src/interfaces/cli/repl.ts Offer Continue/Stop when code mode hits its step limit
src/core/agent-interface.ts stopReason on AgentChatResponse
src/interfaces/cli/index.ts New benchmark command
src/interfaces/http/routes/benchmark.ts New — benchmark HTTP routes
src/interfaces/http/index.ts Register setupBenchmarkRoutes
scripts/bench-selftest.mjs New — fixture/verifier self-test
docs/guides/benchmark-suite.md New — usage & extension guide
package.json Version 0.3.470.3.48

Upgrade

npm install -g jiva-core@0.3.48

No config changes required. The benchmark reuses the model stack from your existing configuration.