Tracking issue for the deferred second pass of the perf benchmark suite.
Where to see existing results
The harness and first-pass data live on the benchmark-harness branch:
Branch commits:
d388bed bench: fix scheduled-rerun.sh PATH for systemd-run --user
3933c62 bench: add PR 31 stack, apples-to-apples comparison, scheduled rerun
0e954fa bench: measure PR 1 head, fix workspace-default-hidden bug
d287312 bench: add A/B/C perf harness for visualizer rendering
Headlines from the first pass (5 reps/cell)
Apples-to-apples (1 session, both stacks render 1 canvas):
|
1× throttle |
4× throttle |
Baseline (59ccf4e) |
45 FPS, 27 ms p95, 0 long tasks |
10 FPS, 116 ms p95, 933 long tasks / 87 s blocking |
PR 1 (f5d9976) |
197 FPS, 7 ms p95 |
39 FPS, 42 ms p95, 1 long task |
| Δ |
FPS +334%, p95 −74% |
FPS +274%, p95 −64%, long tasks −99.9% |
Workload-matched (3 sessions; baseline shows 1 canvas, PR 1 / PR 31 show 3):
|
1× throttle |
4× throttle |
| Baseline |
45.6 FPS / 26 ms p95 / 0 long tasks |
9.0 FPS / 133 ms p95 / 816 long tasks |
PR 1 (f5d9976) |
54.4 FPS / 34 ms p95 |
9.4 FPS / 139 ms p95 |
PR 31 (df3bd94) |
65.4 FPS / 32 ms p95 |
11.7 FPS / 115 ms p95 |
| PR 31 vs baseline |
FPS +43%, p95 +24% |
FPS +30%, p95 −13.5% |
| PR 31 vs PR 1 |
FPS +20%, p95 −6.6% |
FPS +24%, p95 −16.8% |
Plan for the rerun (this issue)
Re-execute the full matrix on the same machine to:
- Confirm first-pass numbers are reproducible
- Establish run-to-run variance for any future regression-tracking
- Catch any drift introduced by post-PR-31 commits
Cells (5 reps × 2 throttle levels each — total ~10 reps × 5 cells × ~2 min = ~110 min):
| stack |
sim-count |
rationale |
A-base (59ccf4e) |
1 |
apples-to-apples baseline |
C-pr1 (f5d9976) |
1 |
apples-to-apples PR 1 |
A-base (59ccf4e) |
3 |
workload-matched baseline (UI still shows 1 canvas) |
C-pr1 (f5d9976) |
3 |
workload-matched PR 1 |
D-pr31 (df3bd94) |
3 |
workload-matched PR 31 |
Worktrees already exist locally at ../baseline-tree, ../pr1-tree, ../pr31-tree. Each has its app/dist/app.js built. The script bench/scheduled-rerun.sh walks the matrix, skipping cells whose app build is missing, and appends to runs.jsonl.
Run command (when ready):
cd /home/claudette/multi-agent-flow/instance_2_source/source/bench
bash scheduled-rerun.sh
# Or in the background, with logs:
nohup bash scheduled-rerun.sh > results/rerun-$(date -u +%Y%m%dT%H%M%SZ).log 2>&1 &
Known caveats
- First-pass
scheduled-rerun.sh had a PATH bug — under systemd-run --user, ~/.local/bin isn't on PATH and pnpm (which the sim spawns) wasn't found. The 16:19 UTC scheduled rerun fired but every cell failed before completing a rep. Fixed in d388bed.
- The simulator only exists on PR-1+ checkouts. It produces JSONL the relay consumes — that input is identical across stacks, so it doesn't bias results.
- Performance numbers are machine-specific. Anything compared against the first pass needs to run on the same VM.
- Headless Chromium FPS cap is removed via
--disable-frame-rate-limit --disable-gpu-vsync. Heap precision comes from --enable-precise-memory-info. Both are essential for the numbers to be meaningful.
Acceptance
- New rows appended to
runs.jsonl for each of the 5 cells (10 rows × 5 = 50 new rows)
- Updated
summary.md shows tighter or unchanged variance vs first pass
- No cell shows >10% drift from the first-pass mean (otherwise: investigate before drawing conclusions)
🤖 Generated with Claude Code
Tracking issue for the deferred second pass of the perf benchmark suite.
Where to see existing results
The harness and first-pass data live on the
benchmark-harnessbranch:bench/results/summary.md(aggregated tables + deltas)bench/results/runs.jsonl(50 runs total, one row per(stack, throttle, simCount, rep))bench/run-bench.mjsbench/README.mdBranch commits:
Headlines from the first pass (5 reps/cell)
Apples-to-apples (1 session, both stacks render 1 canvas):
59ccf4e)f5d9976)Workload-matched (3 sessions; baseline shows 1 canvas, PR 1 / PR 31 show 3):
f5d9976)df3bd94)Plan for the rerun (this issue)
Re-execute the full matrix on the same machine to:
Cells (5 reps × 2 throttle levels each — total ~10 reps × 5 cells × ~2 min = ~110 min):
59ccf4e)f5d9976)59ccf4e)f5d9976)df3bd94)Worktrees already exist locally at
../baseline-tree,../pr1-tree,../pr31-tree. Each has itsapp/dist/app.jsbuilt. The scriptbench/scheduled-rerun.shwalks the matrix, skipping cells whose app build is missing, and appends toruns.jsonl.Run command (when ready):
Known caveats
scheduled-rerun.shhad a PATH bug — undersystemd-run --user,~/.local/binisn't on PATH andpnpm(which the sim spawns) wasn't found. The 16:19 UTC scheduled rerun fired but every cell failed before completing a rep. Fixed ind388bed.--disable-frame-rate-limit --disable-gpu-vsync. Heap precision comes from--enable-precise-memory-info. Both are essential for the numbers to be meaningful.Acceptance
runs.jsonlfor each of the 5 cells (10 rows × 5 = 50 new rows)summary.mdshows tighter or unchanged variance vs first pass🤖 Generated with Claude Code