feat(benchmarks): bundle-based Test B real-project design (1,800/wave; 9,000 full) by Davincc77 · Pull Request #90 · Davincc77/klickdskill

Davincc77 · 2026-05-29T05:28:59Z

Summary

Implements the final approved Test B "real project" design for v4.1: 5 representative project bundles × 150 sessions × 12 conditions = 9,000 outputs full, with a 1,800-output long pilot per bundle.
No real LLM calls. No publish / no tag / no release / no Zenodo / no npm / no PyPI.
The full 5-bundle design is intentionally not launchable as a single run; it must be dispatched as 5 separate waves of the long pilot, each separated by manual review of the prior wave's audit report.

What changed

benchmarks/v4.1/fixtures/bundles.py — deterministic bundle generator. 5 bundles, 10 phases × 15 sessions, role / language / contradiction anchors, JSONL + bundle_manifest.json with per-file SHA-256.
benchmarks/v4.1/prompts/test_b_bundles.py — 12 condition builders. Same user probe + generation config across all 12 conditions; only the prepended memory block differs.
benchmarks/v4.1/runner/executor_b_bundles.py — bundle pilot executor, reuses retry / backoff / batching / JSONL / resumability primitives.
benchmarks/v4.1/runner/runner.py — new pilot-test-b-bundles subcommand. Hard caps: --bundles ≤ 1, --concurrency ∈ [1, 2], --sessions-per-bundle ≤ 150. --full-design intentionally refused.
benchmarks/v4.1/runner/audit_b_bundles.py — robust auditor. Hard checks: condition balance, bundle/phase/session/role coverage, hash completeness, secret scan, forbidden claim phrases, missing timestamps. Soft per-condition cost curves and session-depth token growth bins.
.github/workflows/benchmark-v41-pilot-testb-bundles.yml — manual-only workflow. Provider locked to gemini. All inputs validated and capped before secret access. execute=false by default.
benchmarks/v4.1/tests/test_pilot_test_b_bundles.py — 16 tests (mock provider only). Covers full=9000 specs, pilot=1800 specs, 5 bundles, 150 sessions, 12 conditions, all 10 phases, prompt determinism, runner caps (including --full-design refusal), plan-only and execute paths, audit PASS and audit-FAIL on forbidden claim phrases.
README.md and BENCHMARK_PROTOCOL.md updated with the final design, scientific rationale, cost/throughput caution, and the exact dispatch command for the long pilot.

12 conditions (in audit order)

`no_memory`, `prompt_history`, `manual_context_repetition`, `project_docs_only`, `xklickd_static_bundle`, `xklickd_compressed_bundle`, `xklickd_cross_session_resume`, `xklickd_cross_language`, `xklickd_cross_agent`, `xklickd_human_veto`, `xklickd_contradiction_handling`, `xklickd_ci_weakening_resistance`.

Exact dispatch — long pilot (1,800 outputs, plan-only first)

```bash
gh workflow run benchmark-v41-pilot-testb-bundles.yml \
-f bundle_index=0 \
-f sessions_per_bundle=150 \
-f concurrency=2 \
-f seed=4242 \
-f provider=gemini \
-f execute=false
```

To actually call Gemini after human review, dispatch again with `execute=true`. To run the full design, repeat with `bundle_index = 1, 2, 3, 4` between manual audit reviews.

Test plan

`python3 -m pytest benchmarks/v4.1/tests/` — 91 passed (16 new + 75 existing)
Bundle generator smoke test: 5 × 150 = 750 sessions, full design = 9000 outputs, long pilot = 1800 outputs
All 12 conditions produce distinct prompts; user probe is byte-identical across conditions
Runner refuses `--full-design`, `--bundles > 1`, `--concurrency > 2`, `--sessions-per-bundle > 150`
Plan-only path emits `expected_outputs = 1800` without provider call
Audit passes on mock-provider output; fails on injected forbidden claim phrases
Manual workflow dispatch with `execute=false` (intentionally not run by this PR)
Real Gemini long pilot (intentionally not run by this PR)

🤖 Generated with Claude Code

Implements the final approved Test B benchmark for v4.1: 5 representative project bundles x 150 sessions x 12 conditions = 9,000 outputs full design, with a 1,800-output long pilot per bundle. - fixtures/bundles.py: deterministic generator for 5 bundles, 10 phases of 15 sessions each, role/language/contradiction anchors per fact, JSONL + bundle_manifest.json with SHA-256 per file. - prompts/test_b_bundles.py: 12 condition builders. Same user probe and generation config across conditions; only the prepended memory block differs. - runner/executor_b_bundles.py: bundle pilot executor. Re-uses retry/ backoff/batching/JSONL primitives so the mock provider drives tests with no network. - runner/runner.py: new pilot-test-b-bundles subcommand. Hard caps: bundles<=1, concurrency<=2, sessions_per_bundle<=150. --full-design is intentionally refused; the full 5-bundle design is launched as five separate waves. - runner/audit_b_bundles.py: robust auditor. Hard checks for condition balance, bundle/phase/session/role coverage, hash completeness, secret scan, forbidden claim phrases, and missing timestamps. Soft per-condition cost curves and session-depth token growth bins. - .github/workflows/benchmark-v41-pilot-testb-bundles.yml: manual-only workflow. Provider locked to gemini. Validates inputs, hard-caps bundle_index/concurrency/sessions/retry/backoff/sleep before secret access. execute=false by default. - tests/test_pilot_test_b_bundles.py: 16 tests covering full=9000 and pilot=1800 spec counts, 5 bundles, 150 sessions, 12 conditions, all phases, prompt determinism, runner caps including --full-design refusal, plan-only and execute paths with mock provider, audit pass, and audit failure on forbidden claim phrases. Mock provider only; no network calls. - README.md and BENCHMARK_PROTOCOL.md updated with the final design, scientific rationale, cost/throughput caution, and the exact gh workflow run dispatch command for the long pilot. No publish / no tag / no release / no Zenodo / no npm / no PyPI. No real LLM calls are made by tests or by the runner under default flags. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Davincc77 merged commit 9448382 into main May 29, 2026
3 checks passed

Davincc77 deleted the feat/benchmark-v41-test-b-bundles branch May 29, 2026 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): bundle-based Test B real-project design (1,800/wave; 9,000 full)#90

feat(benchmarks): bundle-based Test B real-project design (1,800/wave; 9,000 full)#90
Davincc77 merged 1 commit into
mainfrom
feat/benchmark-v41-test-b-bundles

Davincc77 commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Davincc77 commented May 29, 2026

Summary

What changed

12 conditions (in audit order)

Exact dispatch — long pilot (1,800 outputs, plan-only first)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants