Skip to content

Re-run benchmark matrix (track 2nd-pass measurements) #37

@DFearing

Description

@DFearing

Tracking issue for the deferred second pass of the perf benchmark suite.

Where to see existing results

The harness and first-pass data live on the benchmark-harness branch:

Branch commits:

d388bed bench: fix scheduled-rerun.sh PATH for systemd-run --user
3933c62 bench: add PR 31 stack, apples-to-apples comparison, scheduled rerun
0e954fa bench: measure PR 1 head, fix workspace-default-hidden bug
d287312 bench: add A/B/C perf harness for visualizer rendering

Headlines from the first pass (5 reps/cell)

Apples-to-apples (1 session, both stacks render 1 canvas):

1× throttle 4× throttle
Baseline (59ccf4e) 45 FPS, 27 ms p95, 0 long tasks 10 FPS, 116 ms p95, 933 long tasks / 87 s blocking
PR 1 (f5d9976) 197 FPS, 7 ms p95 39 FPS, 42 ms p95, 1 long task
Δ FPS +334%, p95 −74% FPS +274%, p95 −64%, long tasks −99.9%

Workload-matched (3 sessions; baseline shows 1 canvas, PR 1 / PR 31 show 3):

1× throttle 4× throttle
Baseline 45.6 FPS / 26 ms p95 / 0 long tasks 9.0 FPS / 133 ms p95 / 816 long tasks
PR 1 (f5d9976) 54.4 FPS / 34 ms p95 9.4 FPS / 139 ms p95
PR 31 (df3bd94) 65.4 FPS / 32 ms p95 11.7 FPS / 115 ms p95
PR 31 vs baseline FPS +43%, p95 +24% FPS +30%, p95 −13.5%
PR 31 vs PR 1 FPS +20%, p95 −6.6% FPS +24%, p95 −16.8%

Plan for the rerun (this issue)

Re-execute the full matrix on the same machine to:

  • Confirm first-pass numbers are reproducible
  • Establish run-to-run variance for any future regression-tracking
  • Catch any drift introduced by post-PR-31 commits

Cells (5 reps × 2 throttle levels each — total ~10 reps × 5 cells × ~2 min = ~110 min):

stack sim-count rationale
A-base (59ccf4e) 1 apples-to-apples baseline
C-pr1 (f5d9976) 1 apples-to-apples PR 1
A-base (59ccf4e) 3 workload-matched baseline (UI still shows 1 canvas)
C-pr1 (f5d9976) 3 workload-matched PR 1
D-pr31 (df3bd94) 3 workload-matched PR 31

Worktrees already exist locally at ../baseline-tree, ../pr1-tree, ../pr31-tree. Each has its app/dist/app.js built. The script bench/scheduled-rerun.sh walks the matrix, skipping cells whose app build is missing, and appends to runs.jsonl.

Run command (when ready):

cd /home/claudette/multi-agent-flow/instance_2_source/source/bench
bash scheduled-rerun.sh
# Or in the background, with logs:
nohup bash scheduled-rerun.sh > results/rerun-$(date -u +%Y%m%dT%H%M%SZ).log 2>&1 &

Known caveats

  • First-pass scheduled-rerun.sh had a PATH bug — under systemd-run --user, ~/.local/bin isn't on PATH and pnpm (which the sim spawns) wasn't found. The 16:19 UTC scheduled rerun fired but every cell failed before completing a rep. Fixed in d388bed.
  • The simulator only exists on PR-1+ checkouts. It produces JSONL the relay consumes — that input is identical across stacks, so it doesn't bias results.
  • Performance numbers are machine-specific. Anything compared against the first pass needs to run on the same VM.
  • Headless Chromium FPS cap is removed via --disable-frame-rate-limit --disable-gpu-vsync. Heap precision comes from --enable-precise-memory-info. Both are essential for the numbers to be meaningful.

Acceptance

  • New rows appended to runs.jsonl for each of the 5 cells (10 rows × 5 = 50 new rows)
  • Updated summary.md shows tighter or unchanged variance vs first pass
  • No cell shows >10% drift from the first-pass mean (otherwise: investigate before drawing conclusions)

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions