Skip to content

Refresh-token reuse detection in gating mode (H-2) #103

@BorisTyshkevich

Description

@BorisTyshkevich

Summary

The gating-mode OAuth refresh path issues stateless JWE blobs with no consumed-set tracking. A captured refresh JWE can be redeemed many times in parallel — each redemption mints a fresh access+refresh pair until the JWE's exp (default 30 days). OAuth 2.1 §4.13.2 and the MCP authorization spec (2025-11-25) require refresh-token rotation with reuse detection: when a previously-redeemed jti is replayed, the entire token family must be invalidated and the user forced to re-authenticate.

Threat model

An attacker who briefly captures a refresh JWE — leaked log line, intermediate proxy, compromised browser extension, etc. — currently gets a silent 30-day window of access-token minting against the legitimate user's identity. The legitimate user has no signal that this is happening, and the attacker can keep refreshing indefinitely until the original JWE expires.

With reuse detection in place, the window closes the moment the legitimate client refreshes after the attacker (or vice-versa): the jti the loser presents is in the consumed-set, the entire family is revoked, both parties are rejected on subsequent attempts, and the user re-authenticates once. The auth event is a clear signal.

Spec references

This was flagged as H-2 in the internal OAuth security review (artifact in the deployment wiki, mcp-oauth-debugging.md).

Approach

  1. Embed two new claims into MCP-issued refresh JWEs:
    • jti — fresh 16-byte random per issuance
    • family_id — fresh 16-byte random at initial code→token exchange, stable across the refresh chain
  2. Add two small ClickHouse tables in a new altinity database:
    • oauth_refresh_consumed_jtis (jti, family_id, consumed_at) — append-only log of redeemed refresh tokens
    • oauth_refresh_revoked_families (family_id, revoked_at, reason) — families flagged after reuse detection
    • Both Replicated|MergeTree with TTL consumed_at + INTERVAL 35 DAY to bound storage
  3. On each grant_type=refresh_token:
    • SELECT — is jti in consumed, or family_id in revoked? If yes → INSERT family into revoked, return invalid_grant
    • Else → INSERT consumed jti, mint new pair with same family_id, fresh jti
  4. Gated by config flag oauth.refresh_revokes_tracking (default false, opt-in) so existing deployments keep working until operators run the DDL + GRANT.

Query volume: lookups happen only on grant_type=refresh_token, never on regular MCP requests (those validate access tokens locally via HMAC). At realistic load (~hundreds of clients, refresh ~1×/h) that's ~0.5 qps cluster-wide — negligible.

Out of scope

  • Forward mode. Forward mode wraps an upstream Auth0 refresh token in our JWE; Auth0 itself rotates and detects reuse upstream. Detecting at our wrapper layer too would be defense-in-depth but adds complexity for marginal gain. Startup validation will refuse mode: forward + refresh_revokes_tracking: true.
  • Single-node KV stores (EmbeddedRocksDB, KeeperMap). EmbeddedRocksDB doesn't replicate across MCP pods → reuse window opens between instances. KeeperMap couples OAuth correctness to Keeper ensemble health. Plain [Replicated]MergeTree is the right tool at this query volume.
  • Other findings from the review (H-3 X-Forwarded-Proto, H-4 per-DCR-client consent, M-1..M-4, L-1..L-2). Each gets its own issue.

Operator prerequisites

Before flipping the flag, operators must:

  1. Run docs/sql/oauth-state.sql (clustered or single-node flavor) as a CH admin user — creates the altinity database, both tables, and GRANT INSERT, SELECT ON altinity.* TO mcp_service.
  2. Verify mcp_service does not have READONLY=1 profile (the connection used for state writes cannot be read-only). Startup validation also enforces cfg.ClickHouse.read_only=false when the flag is on.

Legacy-token policy

Refresh tokens issued before deploy lack the family_id/jti claims. They are rejected with invalid_grant — clients re-authenticate once. This avoids a "silent bypass" window where reuse detection wouldn't apply to in-flight tokens.

Failure modes

Hard fail. If any CH state operation fails (unreachable, permission denied, timeout), the refresh request returns HTTP 500 server_error with a structured ERR-level zerolog line. We never silently fall through to "mint a new pair anyway" — that would defeat the security control.

Rollback: flip oauth.refresh_revokes_tracking: false in helm values, helm-upgrade. The refresh path returns to its current stateless behavior. Tables can stay populated; TTL drops them in 35 days.

Implementation plan

A working branch (wip/oauth-h2) carries the iterative work + image builds + per-deployment smoke tests. Once stable, a clean feature/oauth-refresh-reuse-detection branch off main will be opened as the PR (Closes #<this-issue>).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions