chore(mcp,testing,ci): wire MCP load tests into existing Playwright + CI infra

## Problem

[`eval/load-tests/mcp/`](https://github.com/AtlasDevHQ/atlas/tree/main/eval/load-tests/mcp) (#2070, PR #2128) is reproducible-by-hand — clone, install k6, lift a bearer from a connected Claude Desktop / Cursor session, run. That's enough to answer "does the curve break at N sessions" once. It is **not** enough to keep the perf doc fresh: nothing runs the scripts on a cadence, nothing captures results, and the bearer-acquisition step is undocumented enough to be a footgun.

The shorter follow-up framing was "ship a `scripts/print-bearer.ts`." On a closer look at the repo: a working Playwright suite already lives at `e2e/browser/` with a `global-setup.ts` that does login + saves storage state, plus an `auth.spec.ts` exercising the auth flow. Building a standalone token printer ignores that infra — the right move is to extend it.

## Proposal

Extend the existing Playwright + CI infrastructure so the MCP load tests run on a cadence against staging (or prod) without manual bearer juggling.

### 1. Token minting via Playwright

Add a fixture (or extend `e2e/browser/global-setup.ts`) that, after the existing login flow, drives the OAuth 2.1 loopback flow against the running API to mint a workspace-bound JWT and write it to disk:

```
e2e/browser/global-setup.ts
  → login as load-test user
  → GET /.well-known/oauth-authorization-server (discovery)
  → POST /api/auth/oauth2/register (DCR with redirect_uri = http://127.0.0.1:<port>/callback)
  → GET /api/auth/oauth2/authorize?... (authorization-code-with-PKCE; Playwright clicks consent)
  → POST /api/auth/oauth2/token (exchange code + verifier; include resource indicator)
  → write JWT + workspaceId to e2e/.load-test-bearer.json (gitignored)
```

The plumbing already exists in `plugins/mcp/src/init/hosted.ts` (`runHostedAuthFlow`) — the test seams (`fetchImpl`, `serveImpl`, `openBrowserImpl`) make it possible to wrap that helper from inside Playwright instead of building a fresh DCR/PKCE round-trip.

### 2. CI workflow

Add `.github/workflows/load-test-mcp.yml`:

- **Trigger**: weekly cron + manual `workflow_dispatch` (full 5-min stages × 5 sessions × 3 scripts ≈ 75 min wall time — too expensive for per-PR)
- **Steps**: install k6 → run Playwright `global-setup` (or new `mint-bearer` fixture) → read JWT from disk → run the three k6 scripts pointed at staging → upload `summary.json` per scenario as workflow artifacts

### 3. Result capture — in scope, minimal

Skipping this defeats the point: a workflow that runs and produces artifacts nobody reads is theater. The minimal-but-effective approach:

- Open a long-lived "MCP load-test results history" tracking issue
- Each workflow run posts a comment with: timestamp, git SHA, target env, per-scenario P50/P95/P99 + throughput, and a link to the workflow run
- The comment formatter optionally bolds cells that regressed >25% vs the previous run, so a notification is informative on its own
- The perf doc (`apps/docs/content/docs/architecture/mcp-performance.mdx`) references the tracking issue as the live source — TBD cells become "see results history at #XXXX"

Implementation: ~30 lines of script that parses `summary.json` and shells out to `gh issue comment`. Same PR.

Future expansion (separate follow-up — out of scope here): push to OpenStatus or a Grafana board if the tracking-issue history grows unwieldy. Defer until cadence is live and the volume justifies it.

## Test-user provisioning — what we need

This issue cannot land without an answer to:

- **Target environment**: ✅ **prod** (`api.useatlas.dev`) — Atlas is pre-launch, no staging environment exists yet. Decided 2026-05-06.
- **Test user**: dedicated email + password, stored in GH Actions secrets. Either provisioned manually + invited to a load-test workspace, or scripted via `scripts/provision-loadtest-user.ts` if there's an admin path that supports it.
- **Workspace fixture**: load-test workspace must have the NovaMart demo dataset attached so `lib.js`'s fixture pool (`total_gmv`, `customers`, `orders`) resolves. Either pre-seed manually or include a seed step in the workflow.
- **Cadence**: weekly cron + manual trigger? Or manual-trigger-only initially while we build confidence?

## Decisions log

| # | Question | Answer | Decided |
|---|----------|--------|---------|
| 1 | Target environment | **prod** — pre-launch, no staging exists yet | 2026-05-06 |
| 2 | Test-user provisioning | **scripted** — `scripts/provision-loadtest-user.ts` runs once locally against prod admin path, creates user + workspace, outputs creds for GH secrets | 2026-05-06 |
| 3 | Workspace / dataset | _open_ | — |
| 4 | Cadence | _open_ | — |
| 5 | Bearer fixture path | _open (assume `e2e/.load-test-bearer.json` if not specified)_ | — |
| 6 | Result-capture loop | **in scope, minimal**: append-comment per run to a long-lived tracking issue; perf doc links to it as live source. External dashboards out of scope. | 2026-05-06 |
| 7 | Failure notification | _open_ | — |

(Updated as decisions land. The Decisions log is the source of truth — discussions in PR threads or chat should land here before implementation starts.)

## Acceptance criteria

- [ ] Playwright fixture (or `global-setup.ts` extension) mints a workspace-bound JWT against the configured API and writes it where k6 can read it
- [ ] `.github/workflows/load-test-mcp.yml` runs the three k6 scripts on a defined cadence + on-demand
- [ ] Workflow artifacts include `summary.json` per scenario
- [ ] `eval/load-tests/mcp/README.md` updated to point at the workflow as the primary way to run; manual instructions stay as the secondary path
- [ ] Test user creds, workspace ID, base URL all driven by GH Actions secrets — no plaintext in the repo
- [ ] Long-lived "MCP load-test results history" tracking issue opened
- [ ] Workflow posts a formatted-table comment to the tracking issue per run, with regression flagging vs prior run
- [ ] `apps/docs/content/docs/architecture/mcp-performance.mdx` updated — TBD cells → "see results at #XXXX"

## Out of scope

- External dashboards (OpenStatus / Grafana) for results — defer until tracking-issue volume justifies it
- Migrating the existing perf doc TBD cells to specific numbers in the doc — the doc points at the tracking issue instead; per-run numbers live there

## Related

- #2070 — the load-test scripts + perf doc (PR #2128, merged)
- `e2e/browser/global-setup.ts` — existing Playwright auth + storage-state pattern to reuse
- `plugins/mcp/src/init/hosted.ts:204` — `runHostedAuthFlow` with test seams
- `packages/mcp/src/eval/auth.ts` — in-process auth helper (reference for token-shape requirements; not the right primitive here)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(mcp,testing,ci): wire MCP load tests into existing Playwright + CI infra #2129

Problem

Proposal

1. Token minting via Playwright

2. CI workflow

3. Result capture — in scope, minimal

Test-user provisioning — what we need

Decisions log

Acceptance criteria

Out of scope

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Question	Answer	Decided
1	Target environment	prod — pre-launch, no staging exists yet	2026-05-06
2	Test-user provisioning	scripted — `scripts/provision-loadtest-user.ts` runs once locally against prod admin path, creates user + workspace, outputs creds for GH secrets	2026-05-06
3	Workspace / dataset	open	—
4	Cadence	open	—
5	Bearer fixture path	open (assume `e2e/.load-test-bearer.json` if not specified)	—
6	Result-capture loop	in scope, minimal: append-comment per run to a long-lived tracking issue; perf doc links to it as live source. External dashboards out of scope.	2026-05-06
7	Failure notification	open	—

chore(mcp,testing,ci): wire MCP load tests into existing Playwright + CI infra #2129

Description

Problem

Proposal

1. Token minting via Playwright

2. CI workflow

3. Result capture — in scope, minimal

Test-user provisioning — what we need

Decisions log

Acceptance criteria

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions