Summary
The gating-mode OAuth refresh path issues stateless JWE blobs with no consumed-set tracking. A captured refresh JWE can be redeemed many times in parallel — each redemption mints a fresh access+refresh pair until the JWE's exp (default 30 days). OAuth 2.1 §4.13.2 and the MCP authorization spec (2025-11-25) require refresh-token rotation with reuse detection: when a previously-redeemed jti is replayed, the entire token family must be invalidated and the user forced to re-authenticate.
Threat model
An attacker who briefly captures a refresh JWE — leaked log line, intermediate proxy, compromised browser extension, etc. — currently gets a silent 30-day window of access-token minting against the legitimate user's identity. The legitimate user has no signal that this is happening, and the attacker can keep refreshing indefinitely until the original JWE expires.
With reuse detection in place, the window closes the moment the legitimate client refreshes after the attacker (or vice-versa): the jti the loser presents is in the consumed-set, the entire family is revoked, both parties are rejected on subsequent attempts, and the user re-authenticates once. The auth event is a clear signal.
Spec references
This was flagged as H-2 in the internal OAuth security review (artifact in the deployment wiki, mcp-oauth-debugging.md).
Approach
- Embed two new claims into MCP-issued refresh JWEs:
jti — fresh 16-byte random per issuance
family_id — fresh 16-byte random at initial code→token exchange, stable across the refresh chain
- Add two small ClickHouse tables in a new
altinity database:
oauth_refresh_consumed_jtis (jti, family_id, consumed_at) — append-only log of redeemed refresh tokens
oauth_refresh_revoked_families (family_id, revoked_at, reason) — families flagged after reuse detection
- Both
Replicated|MergeTree with TTL consumed_at + INTERVAL 35 DAY to bound storage
- On each
grant_type=refresh_token:
- SELECT — is jti in consumed, or family_id in revoked? If yes → INSERT family into revoked, return
invalid_grant
- Else → INSERT consumed jti, mint new pair with same
family_id, fresh jti
- Gated by config flag
oauth.refresh_revokes_tracking (default false, opt-in) so existing deployments keep working until operators run the DDL + GRANT.
Query volume: lookups happen only on grant_type=refresh_token, never on regular MCP requests (those validate access tokens locally via HMAC). At realistic load (~hundreds of clients, refresh ~1×/h) that's ~0.5 qps cluster-wide — negligible.
Out of scope
- Forward mode. Forward mode wraps an upstream Auth0 refresh token in our JWE; Auth0 itself rotates and detects reuse upstream. Detecting at our wrapper layer too would be defense-in-depth but adds complexity for marginal gain. Startup validation will refuse
mode: forward + refresh_revokes_tracking: true.
- Single-node KV stores (EmbeddedRocksDB, KeeperMap). EmbeddedRocksDB doesn't replicate across MCP pods → reuse window opens between instances. KeeperMap couples OAuth correctness to Keeper ensemble health. Plain
[Replicated]MergeTree is the right tool at this query volume.
- Other findings from the review (H-3 X-Forwarded-Proto, H-4 per-DCR-client consent, M-1..M-4, L-1..L-2). Each gets its own issue.
Operator prerequisites
Before flipping the flag, operators must:
- Run
docs/sql/oauth-state.sql (clustered or single-node flavor) as a CH admin user — creates the altinity database, both tables, and GRANT INSERT, SELECT ON altinity.* TO mcp_service.
- Verify
mcp_service does not have READONLY=1 profile (the connection used for state writes cannot be read-only). Startup validation also enforces cfg.ClickHouse.read_only=false when the flag is on.
Legacy-token policy
Refresh tokens issued before deploy lack the family_id/jti claims. They are rejected with invalid_grant — clients re-authenticate once. This avoids a "silent bypass" window where reuse detection wouldn't apply to in-flight tokens.
Failure modes
Hard fail. If any CH state operation fails (unreachable, permission denied, timeout), the refresh request returns HTTP 500 server_error with a structured ERR-level zerolog line. We never silently fall through to "mint a new pair anyway" — that would defeat the security control.
Rollback: flip oauth.refresh_revokes_tracking: false in helm values, helm-upgrade. The refresh path returns to its current stateless behavior. Tables can stay populated; TTL drops them in 35 days.
Implementation plan
A working branch (wip/oauth-h2) carries the iterative work + image builds + per-deployment smoke tests. Once stable, a clean feature/oauth-refresh-reuse-detection branch off main will be opened as the PR (Closes #<this-issue>).
Summary
The gating-mode OAuth refresh path issues stateless JWE blobs with no consumed-set tracking. A captured refresh JWE can be redeemed many times in parallel — each redemption mints a fresh access+refresh pair until the JWE's exp (default 30 days). OAuth 2.1 §4.13.2 and the MCP authorization spec (2025-11-25) require refresh-token rotation with reuse detection: when a previously-redeemed jti is replayed, the entire token family must be invalidated and the user forced to re-authenticate.
Threat model
An attacker who briefly captures a refresh JWE — leaked log line, intermediate proxy, compromised browser extension, etc. — currently gets a silent 30-day window of access-token minting against the legitimate user's identity. The legitimate user has no signal that this is happening, and the attacker can keep refreshing indefinitely until the original JWE expires.
With reuse detection in place, the window closes the moment the legitimate client refreshes after the attacker (or vice-versa): the jti the loser presents is in the consumed-set, the entire family is revoked, both parties are rejected on subsequent attempts, and the user re-authenticates once. The auth event is a clear signal.
Spec references
This was flagged as H-2 in the internal OAuth security review (artifact in the deployment wiki,
mcp-oauth-debugging.md).Approach
jti— fresh 16-byte random per issuancefamily_id— fresh 16-byte random at initial code→token exchange, stable across the refresh chainaltinitydatabase:oauth_refresh_consumed_jtis(jti, family_id, consumed_at) — append-only log of redeemed refresh tokensoauth_refresh_revoked_families(family_id, revoked_at, reason) — families flagged after reuse detectionReplicated|MergeTreewithTTL consumed_at + INTERVAL 35 DAYto bound storagegrant_type=refresh_token:invalid_grantfamily_id, freshjtioauth.refresh_revokes_tracking(defaultfalse, opt-in) so existing deployments keep working until operators run the DDL + GRANT.Query volume: lookups happen only on
grant_type=refresh_token, never on regular MCP requests (those validate access tokens locally via HMAC). At realistic load (~hundreds of clients, refresh ~1×/h) that's ~0.5 qps cluster-wide — negligible.Out of scope
mode: forward+refresh_revokes_tracking: true.[Replicated]MergeTreeis the right tool at this query volume.Operator prerequisites
Before flipping the flag, operators must:
docs/sql/oauth-state.sql(clustered or single-node flavor) as a CH admin user — creates thealtinitydatabase, both tables, andGRANT INSERT, SELECT ON altinity.* TO mcp_service.mcp_servicedoes not haveREADONLY=1profile (the connection used for state writes cannot be read-only). Startup validation also enforcescfg.ClickHouse.read_only=falsewhen the flag is on.Legacy-token policy
Refresh tokens issued before deploy lack the
family_id/jticlaims. They are rejected withinvalid_grant— clients re-authenticate once. This avoids a "silent bypass" window where reuse detection wouldn't apply to in-flight tokens.Failure modes
Hard fail. If any CH state operation fails (unreachable, permission denied, timeout), the refresh request returns HTTP 500
server_errorwith a structured ERR-level zerolog line. We never silently fall through to "mint a new pair anyway" — that would defeat the security control.Rollback: flip
oauth.refresh_revokes_tracking: falsein helm values, helm-upgrade. The refresh path returns to its current stateless behavior. Tables can stay populated; TTL drops them in 35 days.Implementation plan
A working branch (
wip/oauth-h2) carries the iterative work + image builds + per-deployment smoke tests. Once stable, a cleanfeature/oauth-refresh-reuse-detectionbranch offmainwill be opened as the PR (Closes #<this-issue>).