v1.5.0 — Hard-Task Stress Test (2026-05-27)

Theme: Push past the D1/D5 ceiling. v1.4.x had top configs hitting
100% on the canonical 8-task refactor set, which made the leaderboard
uninformative at the top end. v1.5 adds a new harder task shape (D6)
designed to actually stress 30B local models, and stress-tests the
v1.4.1 champion configurations against it.

What's new

New D6 task class: hard refactors (4 tasks, 80 acceptance tests)

Each D6 task is a single-file Python implementation challenge with
comprehensive pytest coverage. The reference solutions are all ≤200
LOC, so the difficulty isn't volume — it's the breadth of corner
cases the model has to internalise from the prompt alone.

Task	LOC	Tests	What it stresses
`d6-lru-ttl-cache`	100	23	OrderedDict-based LRU + monkey-patchable TTL clock + careful eviction accounting
`d6-token-bucket`	60	14	Lazy refill correctness + multi-key isolation + non-positive-arg validation
`d6-toposort`	90	16	Kahn's with deterministic tie-break + DFS cycle detection with path reconstruction
`d6-mini-template`	200	27	Recursive-descent parser + AST evaluator + escape filters + nested if/for + comments

Total: ~450 LOC of reference solution, 80 pytest assertions.
The same overlay+pytest scoring path D1/D5 uses — no new scoring
infrastructure needed.

Stress-test sweeps

Two new sweep configs target the v1.4.1 audit's top configurations:

configs/v1.5-hard-gemma4.yaml — aider+gemma4 on always-cloud and
heuristic (the v1.3 marquee profile). 4 tasks × 2 strategies × 3
seeds = 24 rows.
configs/v1.5-hard-qwen3.6.yaml — cline+qwen3.6 on always-cloud,
always-local, and cascade (the v1.4.1 champion profile). 4 tasks ×
3 strategies × 3 seeds = 36 rows.

Total 60 new rows of hard-task data. Cost cap: $50 cloud spend.
Wall-time cap: 6 hours.

Findings

Both sweeps complete. Full analysis in
personal/reports/publish-v1.5/article.html §9.5.

Agent	Model	Strategy	Pass	Cloud-frac	Notes
aider	gemma4:31b	always-cloud	12/12 (100%)	100%	ceiling
aider	gemma4:31b	heuristic	7/12 (58%)	61%	v1.3 marquee profile falls off on D6
cline	qwen3.6:35b	always-cloud	12/12 (100%)	100%	ceiling
cline	qwen3.6:35b	always-local	8/12 (67%)	0%	30B local-only ceiling — $0 cloud spend
cline	qwen3.6:35b	cascade	9/12 (75%)	13%	v1.4.1 champion holds 75% but loses on mini-template

Key findings:

30B local-only solves 67% of hard refactors with zero cloud. cline + qwen3.6:35b
nails token-bucket and toposort 3/3 each, partial-passes lru-ttl-cache and mini-template.
Two of the four "failures" are cline-on-local session bugs, not model quality.
The v1.4.1 cascade champion drops from 100% to 75% on D6. Cascade only marginally
beats always-local (75% vs 67%) because the router has no global view of task difficulty.
The d6-mini-template recursive-descent parser is the hard wall.
Always-cloud (gpt-5.5) is 100% on both configs. The cloud advantage on D6 is real and
42 percentage points over heuristic gemma4.
The pytest-parser bug fix in src/hybrid_coding_eval/agents/aider.py was caught by a
new parametrized test (tests/agents/test_aider_parser.py). The bug undercounted
passes for aider rows where X failed appeared before Y passed in the summary line.
Rescoring v1.5 gemma4 data found 0 affected rows; the fix is preventative.

How to reproduce

git checkout v1.5.0
./scripts/reproduce.sh    # one-time env setup if not done yet
./bench sweep --config configs/v1.5-hard-gemma4.yaml \
    --strategies always-cloud,heuristic --seeds 42,7,13
./bench sweep --config configs/v1.5-hard-qwen3.6.yaml \
    --strategies always-cloud,always-local,cascade --seeds 42,7,13
./bench analyze results/runs/v1.5-hard-gemma4
./bench analyze results/runs/v1.5-hard-qwen3.6

Expected wall time on M4 Max 64 GB: ~30 min for gemma4 sweep,
~80 min for qwen3.6 sweep. Expected cloud spend at gpt-5.5
list pricing: ~$8 total across both sweeps.

Migration from v1.4.4

Zero migration cost. v1.5 is purely additive:

The D1–D5 task shapes are unchanged. The v1.4.1 canonical dataset
remains the headline dataset.
The D6 shape uses the same overlay+pytest scoring path as D1/D5.
No new scoring dependencies.
Existing configs continue to work. The new configs are isolated to
configs/v1.5-hard-*.yaml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.5.0 — Hard-Task Stress Test

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v1.5.0 — Hard-Task Stress Test (2026-05-27)

What's new

New D6 task class: hard refactors (4 tasks, 80 acceptance tests)

Stress-test sweeps

Findings

How to reproduce

Migration from v1.4.4

Uh oh!