Re-run benchmark matrix (track 2nd-pass measurements)

Tracking issue for the deferred second pass of the perf benchmark suite.

## Where to see existing results

The harness and first-pass data live on the [`benchmark-harness`](https://github.com/DFearing/multi-agent-flow/tree/benchmark-harness) branch:

- **Summary report** — [`bench/results/summary.md`](https://github.com/DFearing/multi-agent-flow/blob/benchmark-harness/bench/results/summary.md) (aggregated tables + deltas)
- **Raw per-run rows** — [`bench/results/runs.jsonl`](https://github.com/DFearing/multi-agent-flow/blob/benchmark-harness/bench/results/runs.jsonl) (50 runs total, one row per `(stack, throttle, simCount, rep)`)
- **Harness source** — [`bench/run-bench.mjs`](https://github.com/DFearing/multi-agent-flow/blob/benchmark-harness/bench/run-bench.mjs)
- **Setup notes** — [`bench/README.md`](https://github.com/DFearing/multi-agent-flow/blob/benchmark-harness/bench/README.md)

Branch commits:

```
d388bed bench: fix scheduled-rerun.sh PATH for systemd-run --user
3933c62 bench: add PR 31 stack, apples-to-apples comparison, scheduled rerun
0e954fa bench: measure PR 1 head, fix workspace-default-hidden bug
d287312 bench: add A/B/C perf harness for visualizer rendering
```

## Headlines from the first pass (5 reps/cell)

**Apples-to-apples (1 session, both stacks render 1 canvas):**
| | 1× throttle | 4× throttle |
|---|---|---|
| Baseline (`59ccf4e`) | 45 FPS, 27 ms p95, 0 long tasks | 10 FPS, 116 ms p95, **933 long tasks / 87 s blocking** |
| PR 1 (`f5d9976`) | **197 FPS**, 7 ms p95 | **39 FPS**, 42 ms p95, **1 long task** |
| Δ | FPS +334%, p95 −74% | FPS +274%, p95 −64%, long tasks −99.9% |

**Workload-matched (3 sessions; baseline shows 1 canvas, PR 1 / PR 31 show 3):**
| | 1× throttle | 4× throttle |
|---|---|---|
| Baseline | 45.6 FPS / 26 ms p95 / 0 long tasks | 9.0 FPS / 133 ms p95 / 816 long tasks |
| PR 1 (`f5d9976`) | 54.4 FPS / 34 ms p95 | 9.4 FPS / 139 ms p95 |
| PR 31 (`df3bd94`) | **65.4 FPS / 32 ms p95** | **11.7 FPS / 115 ms p95** |
| PR 31 vs baseline | FPS +43%, p95 +24% | FPS +30%, p95 −13.5% |
| PR 31 vs PR 1 | FPS +20%, p95 −6.6% | FPS +24%, p95 −16.8% |

## Plan for the rerun (this issue)

Re-execute the full matrix on the same machine to:
- Confirm first-pass numbers are reproducible
- Establish run-to-run variance for any future regression-tracking
- Catch any drift introduced by post-PR-31 commits

Cells (5 reps × 2 throttle levels each — total ~10 reps × 5 cells × ~2 min = ~110 min):

| stack | sim-count | rationale |
|---|---|---|
| A-base   (`59ccf4e`) | 1 | apples-to-apples baseline |
| C-pr1    (`f5d9976`) | 1 | apples-to-apples PR 1 |
| A-base   (`59ccf4e`) | 3 | workload-matched baseline (UI still shows 1 canvas) |
| C-pr1    (`f5d9976`) | 3 | workload-matched PR 1 |
| D-pr31   (`df3bd94`) | 3 | workload-matched PR 31 |

Worktrees already exist locally at `../baseline-tree`, `../pr1-tree`, `../pr31-tree`. Each has its `app/dist/app.js` built. The script `bench/scheduled-rerun.sh` walks the matrix, skipping cells whose app build is missing, and appends to `runs.jsonl`.

Run command (when ready):

```bash
cd /home/claudette/multi-agent-flow/instance_2_source/source/bench
bash scheduled-rerun.sh
# Or in the background, with logs:
nohup bash scheduled-rerun.sh > results/rerun-$(date -u +%Y%m%dT%H%M%SZ).log 2>&1 &
```

## Known caveats

- **First-pass `scheduled-rerun.sh` had a PATH bug** — under `systemd-run --user`, `~/.local/bin` isn't on PATH and `pnpm` (which the sim spawns) wasn't found. The 16:19 UTC scheduled rerun fired but every cell failed before completing a rep. Fixed in `d388bed`.
- **The simulator only exists on PR-1+ checkouts.** It produces JSONL the relay consumes — that input is identical across stacks, so it doesn't bias results.
- **Performance numbers are machine-specific.** Anything compared against the first pass needs to run on the same VM.
- **Headless Chromium FPS cap** is removed via `--disable-frame-rate-limit --disable-gpu-vsync`. **Heap precision** comes from `--enable-precise-memory-info`. Both are essential for the numbers to be meaningful.

## Acceptance

- New rows appended to `runs.jsonl` for each of the 5 cells (10 rows × 5 = 50 new rows)
- Updated `summary.md` shows tighter or unchanged variance vs first pass
- No cell shows >10% drift from the first-pass mean (otherwise: investigate before drawing conclusions)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-run benchmark matrix (track 2nd-pass measurements) #37

Where to see existing results

Headlines from the first pass (5 reps/cell)

Plan for the rerun (this issue)

Known caveats

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	1× throttle	4× throttle
Baseline (`59ccf4e`)	45 FPS, 27 ms p95, 0 long tasks	10 FPS, 116 ms p95, 933 long tasks / 87 s blocking
PR 1 (`f5d9976`)	197 FPS, 7 ms p95	39 FPS, 42 ms p95, 1 long task
Δ	FPS +334%, p95 −74%	FPS +274%, p95 −64%, long tasks −99.9%

	1× throttle	4× throttle
Baseline	45.6 FPS / 26 ms p95 / 0 long tasks	9.0 FPS / 133 ms p95 / 816 long tasks
PR 1 (`f5d9976`)	54.4 FPS / 34 ms p95	9.4 FPS / 139 ms p95
PR 31 (`df3bd94`)	65.4 FPS / 32 ms p95	11.7 FPS / 115 ms p95
PR 31 vs baseline	FPS +43%, p95 +24%	FPS +30%, p95 −13.5%
PR 31 vs PR 1	FPS +20%, p95 −6.6%	FPS +24%, p95 −16.8%

stack	sim-count	rationale
A-base (`59ccf4e`)	1	apples-to-apples baseline
C-pr1 (`f5d9976`)	1	apples-to-apples PR 1
A-base (`59ccf4e`)	3	workload-matched baseline (UI still shows 1 canvas)
C-pr1 (`f5d9976`)	3	workload-matched PR 1
D-pr31 (`df3bd94`)	3	workload-matched PR 31

Re-run benchmark matrix (track 2nd-pass measurements) #37

Description

Where to see existing results

Headlines from the first pass (5 reps/cell)

Plan for the rerun (this issue)

Known caveats

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions