Problem
eval/load-tests/mcp/ (#2070, PR #2128) is reproducible-by-hand — clone, install k6, lift a bearer from a connected Claude Desktop / Cursor session, run. That's enough to answer "does the curve break at N sessions" once. It is not enough to keep the perf doc fresh: nothing runs the scripts on a cadence, nothing captures results, and the bearer-acquisition step is undocumented enough to be a footgun.
The shorter follow-up framing was "ship a scripts/print-bearer.ts." On a closer look at the repo: a working Playwright suite already lives at e2e/browser/ with a global-setup.ts that does login + saves storage state, plus an auth.spec.ts exercising the auth flow. Building a standalone token printer ignores that infra — the right move is to extend it.
Proposal
Extend the existing Playwright + CI infrastructure so the MCP load tests run on a cadence against staging (or prod) without manual bearer juggling.
1. Token minting via Playwright
Add a fixture (or extend e2e/browser/global-setup.ts) that, after the existing login flow, drives the OAuth 2.1 loopback flow against the running API to mint a workspace-bound JWT and write it to disk:
e2e/browser/global-setup.ts
→ login as load-test user
→ GET /.well-known/oauth-authorization-server (discovery)
→ POST /api/auth/oauth2/register (DCR with redirect_uri = http://127.0.0.1:<port>/callback)
→ GET /api/auth/oauth2/authorize?... (authorization-code-with-PKCE; Playwright clicks consent)
→ POST /api/auth/oauth2/token (exchange code + verifier; include resource indicator)
→ write JWT + workspaceId to e2e/.load-test-bearer.json (gitignored)
The plumbing already exists in plugins/mcp/src/init/hosted.ts (runHostedAuthFlow) — the test seams (fetchImpl, serveImpl, openBrowserImpl) make it possible to wrap that helper from inside Playwright instead of building a fresh DCR/PKCE round-trip.
2. CI workflow
Add .github/workflows/load-test-mcp.yml:
- Trigger: weekly cron + manual
workflow_dispatch (full 5-min stages × 5 sessions × 3 scripts ≈ 75 min wall time — too expensive for per-PR)
- Steps: install k6 → run Playwright
global-setup (or new mint-bearer fixture) → read JWT from disk → run the three k6 scripts pointed at staging → upload summary.json per scenario as workflow artifacts
3. Result capture — in scope, minimal
Skipping this defeats the point: a workflow that runs and produces artifacts nobody reads is theater. The minimal-but-effective approach:
- Open a long-lived "MCP load-test results history" tracking issue
- Each workflow run posts a comment with: timestamp, git SHA, target env, per-scenario P50/P95/P99 + throughput, and a link to the workflow run
- The comment formatter optionally bolds cells that regressed >25% vs the previous run, so a notification is informative on its own
- The perf doc (
apps/docs/content/docs/architecture/mcp-performance.mdx) references the tracking issue as the live source — TBD cells become "see results history at #XXXX"
Implementation: ~30 lines of script that parses summary.json and shells out to gh issue comment. Same PR.
Future expansion (separate follow-up — out of scope here): push to OpenStatus or a Grafana board if the tracking-issue history grows unwieldy. Defer until cadence is live and the volume justifies it.
Test-user provisioning — what we need
This issue cannot land without an answer to:
- Target environment: ✅ prod (
api.useatlas.dev) — Atlas is pre-launch, no staging environment exists yet. Decided 2026-05-06.
- Test user: dedicated email + password, stored in GH Actions secrets. Either provisioned manually + invited to a load-test workspace, or scripted via
scripts/provision-loadtest-user.ts if there's an admin path that supports it.
- Workspace fixture: load-test workspace must have the NovaMart demo dataset attached so
lib.js's fixture pool (total_gmv, customers, orders) resolves. Either pre-seed manually or include a seed step in the workflow.
- Cadence: weekly cron + manual trigger? Or manual-trigger-only initially while we build confidence?
Decisions log
| # |
Question |
Answer |
Decided |
| 1 |
Target environment |
prod — pre-launch, no staging exists yet |
2026-05-06 |
| 2 |
Test-user provisioning |
scripted — scripts/provision-loadtest-user.ts runs once locally against prod admin path, creates user + workspace, outputs creds for GH secrets |
2026-05-06 |
| 3 |
Workspace / dataset |
open |
— |
| 4 |
Cadence |
open |
— |
| 5 |
Bearer fixture path |
open (assume e2e/.load-test-bearer.json if not specified) |
— |
| 6 |
Result-capture loop |
in scope, minimal: append-comment per run to a long-lived tracking issue; perf doc links to it as live source. External dashboards out of scope. |
2026-05-06 |
| 7 |
Failure notification |
open |
— |
(Updated as decisions land. The Decisions log is the source of truth — discussions in PR threads or chat should land here before implementation starts.)
Acceptance criteria
Out of scope
- External dashboards (OpenStatus / Grafana) for results — defer until tracking-issue volume justifies it
- Migrating the existing perf doc TBD cells to specific numbers in the doc — the doc points at the tracking issue instead; per-run numbers live there
Related
Problem
eval/load-tests/mcp/(#2070, PR #2128) is reproducible-by-hand — clone, install k6, lift a bearer from a connected Claude Desktop / Cursor session, run. That's enough to answer "does the curve break at N sessions" once. It is not enough to keep the perf doc fresh: nothing runs the scripts on a cadence, nothing captures results, and the bearer-acquisition step is undocumented enough to be a footgun.The shorter follow-up framing was "ship a
scripts/print-bearer.ts." On a closer look at the repo: a working Playwright suite already lives ate2e/browser/with aglobal-setup.tsthat does login + saves storage state, plus anauth.spec.tsexercising the auth flow. Building a standalone token printer ignores that infra — the right move is to extend it.Proposal
Extend the existing Playwright + CI infrastructure so the MCP load tests run on a cadence against staging (or prod) without manual bearer juggling.
1. Token minting via Playwright
Add a fixture (or extend
e2e/browser/global-setup.ts) that, after the existing login flow, drives the OAuth 2.1 loopback flow against the running API to mint a workspace-bound JWT and write it to disk:The plumbing already exists in
plugins/mcp/src/init/hosted.ts(runHostedAuthFlow) — the test seams (fetchImpl,serveImpl,openBrowserImpl) make it possible to wrap that helper from inside Playwright instead of building a fresh DCR/PKCE round-trip.2. CI workflow
Add
.github/workflows/load-test-mcp.yml:workflow_dispatch(full 5-min stages × 5 sessions × 3 scripts ≈ 75 min wall time — too expensive for per-PR)global-setup(or newmint-bearerfixture) → read JWT from disk → run the three k6 scripts pointed at staging → uploadsummary.jsonper scenario as workflow artifacts3. Result capture — in scope, minimal
Skipping this defeats the point: a workflow that runs and produces artifacts nobody reads is theater. The minimal-but-effective approach:
apps/docs/content/docs/architecture/mcp-performance.mdx) references the tracking issue as the live source — TBD cells become "see results history at #XXXX"Implementation: ~30 lines of script that parses
summary.jsonand shells out togh issue comment. Same PR.Future expansion (separate follow-up — out of scope here): push to OpenStatus or a Grafana board if the tracking-issue history grows unwieldy. Defer until cadence is live and the volume justifies it.
Test-user provisioning — what we need
This issue cannot land without an answer to:
api.useatlas.dev) — Atlas is pre-launch, no staging environment exists yet. Decided 2026-05-06.scripts/provision-loadtest-user.tsif there's an admin path that supports it.lib.js's fixture pool (total_gmv,customers,orders) resolves. Either pre-seed manually or include a seed step in the workflow.Decisions log
scripts/provision-loadtest-user.tsruns once locally against prod admin path, creates user + workspace, outputs creds for GH secretse2e/.load-test-bearer.jsonif not specified)(Updated as decisions land. The Decisions log is the source of truth — discussions in PR threads or chat should land here before implementation starts.)
Acceptance criteria
global-setup.tsextension) mints a workspace-bound JWT against the configured API and writes it where k6 can read it.github/workflows/load-test-mcp.ymlruns the three k6 scripts on a defined cadence + on-demandsummary.jsonper scenarioeval/load-tests/mcp/README.mdupdated to point at the workflow as the primary way to run; manual instructions stay as the secondary pathapps/docs/content/docs/architecture/mcp-performance.mdxupdated — TBD cells → "see results at #XXXX"Out of scope
Related
e2e/browser/global-setup.ts— existing Playwright auth + storage-state pattern to reuseplugins/mcp/src/init/hosted.ts:204—runHostedAuthFlowwith test seamspackages/mcp/src/eval/auth.ts— in-process auth helper (reference for token-shape requirements; not the right primitive here)