v1.5.0 — Hard-Task Stress Test
v1.5.0 — Hard-Task Stress Test (2026-05-27)
Theme: Push past the D1/D5 ceiling. v1.4.x had top configs hitting
100% on the canonical 8-task refactor set, which made the leaderboard
uninformative at the top end. v1.5 adds a new harder task shape (D6)
designed to actually stress 30B local models, and stress-tests the
v1.4.1 champion configurations against it.
What's new
New D6 task class: hard refactors (4 tasks, 80 acceptance tests)
Each D6 task is a single-file Python implementation challenge with
comprehensive pytest coverage. The reference solutions are all ≤200
LOC, so the difficulty isn't volume — it's the breadth of corner
cases the model has to internalise from the prompt alone.
| Task | LOC | Tests | What it stresses |
|---|---|---|---|
d6-lru-ttl-cache |
100 | 23 | OrderedDict-based LRU + monkey-patchable TTL clock + careful eviction accounting |
d6-token-bucket |
60 | 14 | Lazy refill correctness + multi-key isolation + non-positive-arg validation |
d6-toposort |
90 | 16 | Kahn's with deterministic tie-break + DFS cycle detection with path reconstruction |
d6-mini-template |
200 | 27 | Recursive-descent parser + AST evaluator + escape filters + nested if/for + comments |
Total: ~450 LOC of reference solution, 80 pytest assertions.
The same overlay+pytest scoring path D1/D5 uses — no new scoring
infrastructure needed.
Stress-test sweeps
Two new sweep configs target the v1.4.1 audit's top configurations:
configs/v1.5-hard-gemma4.yaml— aider+gemma4 on always-cloud and
heuristic (the v1.3 marquee profile). 4 tasks × 2 strategies × 3
seeds = 24 rows.configs/v1.5-hard-qwen3.6.yaml— cline+qwen3.6 on always-cloud,
always-local, and cascade (the v1.4.1 champion profile). 4 tasks ×
3 strategies × 3 seeds = 36 rows.
Total 60 new rows of hard-task data. Cost cap: $50 cloud spend.
Wall-time cap: 6 hours.
Findings
Both sweeps complete. Full analysis in
personal/reports/publish-v1.5/article.html §9.5.
| Agent | Model | Strategy | Pass | Cloud-frac | Notes |
|---|---|---|---|---|---|
| aider | gemma4:31b | always-cloud | 12/12 (100%) | 100% | ceiling |
| aider | gemma4:31b | heuristic | 7/12 (58%) | 61% | v1.3 marquee profile falls off on D6 |
| cline | qwen3.6:35b | always-cloud | 12/12 (100%) | 100% | ceiling |
| cline | qwen3.6:35b | always-local | 8/12 (67%) | 0% | 30B local-only ceiling — $0 cloud spend |
| cline | qwen3.6:35b | cascade | 9/12 (75%) | 13% | v1.4.1 champion holds 75% but loses on mini-template |
Key findings:
- 30B local-only solves 67% of hard refactors with zero cloud. cline + qwen3.6:35b
nails token-bucket and toposort 3/3 each, partial-passes lru-ttl-cache and mini-template.
Two of the four "failures" are cline-on-local session bugs, not model quality. - The v1.4.1 cascade champion drops from 100% to 75% on D6. Cascade only marginally
beats always-local (75% vs 67%) because the router has no global view of task difficulty.
Thed6-mini-templaterecursive-descent parser is the hard wall. - Always-cloud (gpt-5.5) is 100% on both configs. The cloud advantage on D6 is real and
42 percentage points over heuristic gemma4. - The pytest-parser bug fix in
src/hybrid_coding_eval/agents/aider.pywas caught by a
new parametrized test (tests/agents/test_aider_parser.py). The bug undercounted
passes for aider rows whereX failedappeared beforeY passedin the summary line.
Rescoring v1.5 gemma4 data found 0 affected rows; the fix is preventative.
How to reproduce
git checkout v1.5.0
./scripts/reproduce.sh # one-time env setup if not done yet
./bench sweep --config configs/v1.5-hard-gemma4.yaml \
--strategies always-cloud,heuristic --seeds 42,7,13
./bench sweep --config configs/v1.5-hard-qwen3.6.yaml \
--strategies always-cloud,always-local,cascade --seeds 42,7,13
./bench analyze results/runs/v1.5-hard-gemma4
./bench analyze results/runs/v1.5-hard-qwen3.6Expected wall time on M4 Max 64 GB: ~30 min for gemma4 sweep,
~80 min for qwen3.6 sweep. Expected cloud spend at gpt-5.5
list pricing: ~$8 total across both sweeps.
Migration from v1.4.4
Zero migration cost. v1.5 is purely additive:
- The D1–D5 task shapes are unchanged. The v1.4.1 canonical dataset
remains the headline dataset. - The D6 shape uses the same overlay+pytest scoring path as D1/D5.
No new scoring dependencies. - Existing configs continue to work. The new configs are isolated to
configs/v1.5-hard-*.yaml.