Skip to content

v1.5.0 — Hard-Task Stress Test

Choose a tag to compare

@sanchitmonga22 sanchitmonga22 released this 27 May 10:08
· 10 commits to main since this release

v1.5.0 — Hard-Task Stress Test (2026-05-27)

Theme: Push past the D1/D5 ceiling. v1.4.x had top configs hitting
100% on the canonical 8-task refactor set, which made the leaderboard
uninformative at the top end. v1.5 adds a new harder task shape (D6)
designed to actually stress 30B local models, and stress-tests the
v1.4.1 champion configurations against it.

What's new

New D6 task class: hard refactors (4 tasks, 80 acceptance tests)

Each D6 task is a single-file Python implementation challenge with
comprehensive pytest coverage. The reference solutions are all ≤200
LOC, so the difficulty isn't volume — it's the breadth of corner
cases the model has to internalise from the prompt alone.

Task LOC Tests What it stresses
d6-lru-ttl-cache 100 23 OrderedDict-based LRU + monkey-patchable TTL clock + careful eviction accounting
d6-token-bucket 60 14 Lazy refill correctness + multi-key isolation + non-positive-arg validation
d6-toposort 90 16 Kahn's with deterministic tie-break + DFS cycle detection with path reconstruction
d6-mini-template 200 27 Recursive-descent parser + AST evaluator + escape filters + nested if/for + comments

Total: ~450 LOC of reference solution, 80 pytest assertions.
The same overlay+pytest scoring path D1/D5 uses — no new scoring
infrastructure needed.

Stress-test sweeps

Two new sweep configs target the v1.4.1 audit's top configurations:

  • configs/v1.5-hard-gemma4.yaml — aider+gemma4 on always-cloud and
    heuristic (the v1.3 marquee profile). 4 tasks × 2 strategies × 3
    seeds = 24 rows.
  • configs/v1.5-hard-qwen3.6.yaml — cline+qwen3.6 on always-cloud,
    always-local, and cascade (the v1.4.1 champion profile). 4 tasks ×
    3 strategies × 3 seeds = 36 rows.

Total 60 new rows of hard-task data. Cost cap: $50 cloud spend.
Wall-time cap: 6 hours.

Findings

Both sweeps complete. Full analysis in
personal/reports/publish-v1.5/article.html §9.5.

Agent Model Strategy Pass Cloud-frac Notes
aider gemma4:31b always-cloud 12/12 (100%) 100% ceiling
aider gemma4:31b heuristic 7/12 (58%) 61% v1.3 marquee profile falls off on D6
cline qwen3.6:35b always-cloud 12/12 (100%) 100% ceiling
cline qwen3.6:35b always-local 8/12 (67%) 0% 30B local-only ceiling — $0 cloud spend
cline qwen3.6:35b cascade 9/12 (75%) 13% v1.4.1 champion holds 75% but loses on mini-template

Key findings:

  1. 30B local-only solves 67% of hard refactors with zero cloud. cline + qwen3.6:35b
    nails token-bucket and toposort 3/3 each, partial-passes lru-ttl-cache and mini-template.
    Two of the four "failures" are cline-on-local session bugs, not model quality.
  2. The v1.4.1 cascade champion drops from 100% to 75% on D6. Cascade only marginally
    beats always-local (75% vs 67%) because the router has no global view of task difficulty.
    The d6-mini-template recursive-descent parser is the hard wall.
  3. Always-cloud (gpt-5.5) is 100% on both configs. The cloud advantage on D6 is real and
    42 percentage points over heuristic gemma4.
  4. The pytest-parser bug fix in src/hybrid_coding_eval/agents/aider.py was caught by a
    new parametrized test (tests/agents/test_aider_parser.py). The bug undercounted
    passes for aider rows where X failed appeared before Y passed in the summary line.
    Rescoring v1.5 gemma4 data found 0 affected rows; the fix is preventative.

How to reproduce

git checkout v1.5.0
./scripts/reproduce.sh    # one-time env setup if not done yet
./bench sweep --config configs/v1.5-hard-gemma4.yaml \
    --strategies always-cloud,heuristic --seeds 42,7,13
./bench sweep --config configs/v1.5-hard-qwen3.6.yaml \
    --strategies always-cloud,always-local,cascade --seeds 42,7,13
./bench analyze results/runs/v1.5-hard-gemma4
./bench analyze results/runs/v1.5-hard-qwen3.6

Expected wall time on M4 Max 64 GB: ~30 min for gemma4 sweep,
~80 min for qwen3.6 sweep. Expected cloud spend at gpt-5.5
list pricing: ~$8 total across both sweeps.

Migration from v1.4.4

Zero migration cost. v1.5 is purely additive:

  • The D1–D5 task shapes are unchanged. The v1.4.1 canonical dataset
    remains the headline dataset.
  • The D6 shape uses the same overlay+pytest scoring path as D1/D5.
    No new scoring dependencies.
  • Existing configs continue to work. The new configs are isolated to
    configs/v1.5-hard-*.yaml.