Skip to content

chore(mcp,testing,ci): wire MCP load tests into existing Playwright + CI infra #2129

@msywulak

Description

@msywulak

Problem

eval/load-tests/mcp/ (#2070, PR #2128) is reproducible-by-hand — clone, install k6, lift a bearer from a connected Claude Desktop / Cursor session, run. That's enough to answer "does the curve break at N sessions" once. It is not enough to keep the perf doc fresh: nothing runs the scripts on a cadence, nothing captures results, and the bearer-acquisition step is undocumented enough to be a footgun.

The shorter follow-up framing was "ship a scripts/print-bearer.ts." On a closer look at the repo: a working Playwright suite already lives at e2e/browser/ with a global-setup.ts that does login + saves storage state, plus an auth.spec.ts exercising the auth flow. Building a standalone token printer ignores that infra — the right move is to extend it.

Proposal

Extend the existing Playwright + CI infrastructure so the MCP load tests run on a cadence against staging (or prod) without manual bearer juggling.

1. Token minting via Playwright

Add a fixture (or extend e2e/browser/global-setup.ts) that, after the existing login flow, drives the OAuth 2.1 loopback flow against the running API to mint a workspace-bound JWT and write it to disk:

e2e/browser/global-setup.ts
  → login as load-test user
  → GET /.well-known/oauth-authorization-server (discovery)
  → POST /api/auth/oauth2/register (DCR with redirect_uri = http://127.0.0.1:<port>/callback)
  → GET /api/auth/oauth2/authorize?... (authorization-code-with-PKCE; Playwright clicks consent)
  → POST /api/auth/oauth2/token (exchange code + verifier; include resource indicator)
  → write JWT + workspaceId to e2e/.load-test-bearer.json (gitignored)

The plumbing already exists in plugins/mcp/src/init/hosted.ts (runHostedAuthFlow) — the test seams (fetchImpl, serveImpl, openBrowserImpl) make it possible to wrap that helper from inside Playwright instead of building a fresh DCR/PKCE round-trip.

2. CI workflow

Add .github/workflows/load-test-mcp.yml:

  • Trigger: weekly cron + manual workflow_dispatch (full 5-min stages × 5 sessions × 3 scripts ≈ 75 min wall time — too expensive for per-PR)
  • Steps: install k6 → run Playwright global-setup (or new mint-bearer fixture) → read JWT from disk → run the three k6 scripts pointed at staging → upload summary.json per scenario as workflow artifacts

3. Result capture — in scope, minimal

Skipping this defeats the point: a workflow that runs and produces artifacts nobody reads is theater. The minimal-but-effective approach:

  • Open a long-lived "MCP load-test results history" tracking issue
  • Each workflow run posts a comment with: timestamp, git SHA, target env, per-scenario P50/P95/P99 + throughput, and a link to the workflow run
  • The comment formatter optionally bolds cells that regressed >25% vs the previous run, so a notification is informative on its own
  • The perf doc (apps/docs/content/docs/architecture/mcp-performance.mdx) references the tracking issue as the live source — TBD cells become "see results history at #XXXX"

Implementation: ~30 lines of script that parses summary.json and shells out to gh issue comment. Same PR.

Future expansion (separate follow-up — out of scope here): push to OpenStatus or a Grafana board if the tracking-issue history grows unwieldy. Defer until cadence is live and the volume justifies it.

Test-user provisioning — what we need

This issue cannot land without an answer to:

  • Target environment: ✅ prod (api.useatlas.dev) — Atlas is pre-launch, no staging environment exists yet. Decided 2026-05-06.
  • Test user: dedicated email + password, stored in GH Actions secrets. Either provisioned manually + invited to a load-test workspace, or scripted via scripts/provision-loadtest-user.ts if there's an admin path that supports it.
  • Workspace fixture: load-test workspace must have the NovaMart demo dataset attached so lib.js's fixture pool (total_gmv, customers, orders) resolves. Either pre-seed manually or include a seed step in the workflow.
  • Cadence: weekly cron + manual trigger? Or manual-trigger-only initially while we build confidence?

Decisions log

# Question Answer Decided
1 Target environment prod — pre-launch, no staging exists yet 2026-05-06
2 Test-user provisioning scriptedscripts/provision-loadtest-user.ts runs once locally against prod admin path, creates user + workspace, outputs creds for GH secrets 2026-05-06
3 Workspace / dataset open
4 Cadence open
5 Bearer fixture path open (assume e2e/.load-test-bearer.json if not specified)
6 Result-capture loop in scope, minimal: append-comment per run to a long-lived tracking issue; perf doc links to it as live source. External dashboards out of scope. 2026-05-06
7 Failure notification open

(Updated as decisions land. The Decisions log is the source of truth — discussions in PR threads or chat should land here before implementation starts.)

Acceptance criteria

  • Playwright fixture (or global-setup.ts extension) mints a workspace-bound JWT against the configured API and writes it where k6 can read it
  • .github/workflows/load-test-mcp.yml runs the three k6 scripts on a defined cadence + on-demand
  • Workflow artifacts include summary.json per scenario
  • eval/load-tests/mcp/README.md updated to point at the workflow as the primary way to run; manual instructions stay as the secondary path
  • Test user creds, workspace ID, base URL all driven by GH Actions secrets — no plaintext in the repo
  • Long-lived "MCP load-test results history" tracking issue opened
  • Workflow posts a formatted-table comment to the tracking issue per run, with regression flagging vs prior run
  • apps/docs/content/docs/architecture/mcp-performance.mdx updated — TBD cells → "see results at #XXXX"

Out of scope

  • External dashboards (OpenStatus / Grafana) for results — defer until tracking-issue volume justifies it
  • Migrating the existing perf doc TBD cells to specific numbers in the doc — the doc points at the tracking issue instead; per-run numbers live there

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: mcpMCP serverarea: testingTest infrastructure, utilities, coveragechoreCI, deps, maintenance, infra

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions