Refresh-token reuse detection in gating mode (H-2)

## Summary

The gating-mode OAuth refresh path issues stateless JWE blobs with no consumed-set tracking. A captured refresh JWE can be redeemed many times in parallel — each redemption mints a fresh access+refresh pair until the JWE's exp (default 30 days). OAuth 2.1 §4.13.2 and the MCP authorization spec (2025-11-25) require refresh-token *rotation with reuse detection*: when a previously-redeemed jti is replayed, the entire token *family* must be invalidated and the user forced to re-authenticate.

## Threat model

An attacker who briefly captures a refresh JWE — leaked log line, intermediate proxy, compromised browser extension, etc. — currently gets a silent 30-day window of access-token minting against the legitimate user's identity. The legitimate user has no signal that this is happening, and the attacker can keep refreshing indefinitely until the original JWE expires.

With reuse detection in place, the window closes the moment the legitimate client refreshes after the attacker (or vice-versa): the jti the loser presents is in the consumed-set, the entire family is revoked, both parties are rejected on subsequent attempts, and the user re-authenticates once. The auth event is a clear signal.

## Spec references

- [RFC 6749 §10.4 — Refresh Tokens](https://datatracker.ietf.org/doc/html/rfc6749#section-10.4) — *"The authorization server SHOULD detect refresh-token replay attempts ... If a replay is detected, the authorization server SHOULD revoke all access and refresh tokens for the active session."*
- [OAuth 2.1 (draft) §4.13.2 — Refresh Token Protection](https://datatracker.ietf.org/doc/html/draft-ietf-oauth-v2-1) — same requirement, MUST level for public clients with refresh-token rotation.
- [MCP authorization spec 2025-11-25](https://modelcontextprotocol.io/specification/2025-06-18/basic/authorization) — references OAuth 2.1, inherits the rotation+reuse-detection requirement for MCP servers issuing refresh tokens.

This was flagged as **H-2** in the internal OAuth security review (artifact in the deployment wiki, `mcp-oauth-debugging.md`).

## Approach

1. Embed two new claims into MCP-issued refresh JWEs:
   - `jti` — fresh 16-byte random per issuance
   - `family_id` — fresh 16-byte random at initial code→token exchange, **stable across the refresh chain**
2. Add two small ClickHouse tables in a new `altinity` database:
   - `oauth_refresh_consumed_jtis` (jti, family_id, consumed_at) — append-only log of redeemed refresh tokens
   - `oauth_refresh_revoked_families` (family_id, revoked_at, reason) — families flagged after reuse detection
   - Both `Replicated|MergeTree` with `TTL consumed_at + INTERVAL 35 DAY` to bound storage
3. On each `grant_type=refresh_token`:
   - SELECT — is jti in consumed, or family_id in revoked? If yes → INSERT family into revoked, return `invalid_grant`
   - Else → INSERT consumed jti, mint new pair with same `family_id`, fresh `jti`
4. Gated by config flag `oauth.refresh_revokes_tracking` (default `false`, opt-in) so existing deployments keep working until operators run the DDL + GRANT.

Query volume: lookups happen only on `grant_type=refresh_token`, never on regular MCP requests (those validate access tokens locally via HMAC). At realistic load (~hundreds of clients, refresh ~1×/h) that's ~0.5 qps cluster-wide — negligible.

## Out of scope

- **Forward mode.** Forward mode wraps an upstream Auth0 refresh token in our JWE; Auth0 itself rotates and detects reuse upstream. Detecting at our wrapper layer too would be defense-in-depth but adds complexity for marginal gain. Startup validation will refuse `mode: forward` + `refresh_revokes_tracking: true`.
- **Single-node KV stores** (EmbeddedRocksDB, KeeperMap). EmbeddedRocksDB doesn't replicate across MCP pods → reuse window opens between instances. KeeperMap couples OAuth correctness to Keeper ensemble health. Plain `[Replicated]MergeTree` is the right tool at this query volume.
- **Other findings from the review** (H-3 X-Forwarded-Proto, H-4 per-DCR-client consent, M-1..M-4, L-1..L-2). Each gets its own issue.

## Operator prerequisites

Before flipping the flag, operators must:

1. Run `docs/sql/oauth-state.sql` (clustered or single-node flavor) as a CH admin user — creates the `altinity` database, both tables, and `GRANT INSERT, SELECT ON altinity.* TO mcp_service`.
2. Verify `mcp_service` does not have `READONLY=1` profile (the connection used for state writes cannot be read-only). Startup validation also enforces `cfg.ClickHouse.read_only=false` when the flag is on.

## Legacy-token policy

Refresh tokens issued before deploy lack the `family_id`/`jti` claims. They are rejected with `invalid_grant` — clients re-authenticate once. This avoids a "silent bypass" window where reuse detection wouldn't apply to in-flight tokens.

## Failure modes

Hard fail. If any CH state operation fails (unreachable, permission denied, timeout), the refresh request returns HTTP 500 `server_error` with a structured ERR-level zerolog line. We never silently fall through to "mint a new pair anyway" — that would defeat the security control.

Rollback: flip `oauth.refresh_revokes_tracking: false` in helm values, helm-upgrade. The refresh path returns to its current stateless behavior. Tables can stay populated; TTL drops them in 35 days.

## Implementation plan

A working branch (`wip/oauth-h2`) carries the iterative work + image builds + per-deployment smoke tests. Once stable, a clean `feature/oauth-refresh-reuse-detection` branch off `main` will be opened as the PR (`Closes #<this-issue>`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh-token reuse detection in gating mode (H-2) #103

Summary

Threat model

Spec references

Approach

Out of scope

Operator prerequisites

Legacy-token policy

Failure modes

Implementation plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refresh-token reuse detection in gating mode (H-2) #103

Description

Summary

Threat model

Spec references

Approach

Out of scope

Operator prerequisites

Legacy-token policy

Failure modes

Implementation plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions