Skip to content

fix(mcp): restore cloud-snapshot mode for Recce Summary agent (DRC-3384)#1359

Merged
kentwelcome merged 1 commit into
mainfrom
feature/drc-3384-recce-gitlab-summary-generation-hits-401-auth-error-at-cloud
May 7, 2026
Merged

fix(mcp): restore cloud-snapshot mode for Recce Summary agent (DRC-3384)#1359
kentwelcome merged 1 commit into
mainfrom
feature/drc-3384-recce-gitlab-summary-generation-hits-401-auth-error-at-cloud

Conversation

@kentwelcome
Copy link
Copy Markdown
Member

Summary

recce mcp-server had collapsed two distinct cloud paths into a single live CloudBackend branch. With RECCE_SESSION_ID in the environment (the contract used by the Recce Summary agent), the server tried to spawn a live cloud instance via POST /api/v2/sessions/{id}/instance and 401'd before MCP could start — breaking GitLab summary generation in production (DRC-3384).

This PR splits mcp-server into three explicit modes:

Mode Inputs Behavior
cloud-session (public) --cloud --session <id> CloudBackend proxies tool calls to a live Recce Cloud session
cloud-snapshot (internal) RECCE_SESSION_ID / RECCE_SNAPSHOT_ID env CloudStateLoader downloads state file from S3, served as a local MCP. Used by the Recce Summary agent.
local (default) none Local MCP from local artifacts/state

Root cause

patch_derived_args() promotes session_id (which Click reads from RECCE_SESSION_ID via the hidden --session-id option) to cloud=True. After e7b7f81 added CloudBackend, the env-driven flag also routed into the live-session branch — but the Recce Summary agent only has snapshot artifacts, not a running cloud instance, so the spawn call 401'd.

Fix

  • Capture explicit --cloud / --session flags before patch_derived_args() so explicit cloud-session can still be told apart from env-driven cloud-snapshot afterward.
  • Route cloud-snapshot through create_state_loader_by_args (legacy path) and reset kwargs[\"cloud\"] = False before run_mcp_server so it does not take the CloudBackend branch.
  • Distinct startup banners per mode for diagnosability.

Test plan

  • tests/test_cli.py::TestCommandMCPServer — all 6 tests pass
    • new test_cmd_mcp_server_cloud_snapshot_mode_via_env_var: RECCE_SESSION_ID → state loader, no live spawn
    • new test_cmd_mcp_server_cloud_session_mode_skips_state_loader: --cloud --sessionCloudBackend, no state loader
  • Full tests/test_cli.py suite passes (21 tests)
  • Manual verification: recce mcp-server --sse --debug with RECCE_SESSION_ID set should print Loading Recce state from cloud snapshot (...) and start an SSE server (no POST /instance call)
  • Manual verification: recce mcp-server --cloud --session <id> still spawns a live cloud session

Related

  • Linear: DRC-3384
  • Regressed by: e7b7f81 (feat(mcp): add cloud-mode support for recce mcp-server, DRC-3345)
  • Partial earlier fix: e3687ab (DRC-3383) — accepted env-var session_id but routed it to the wrong branch

🤖 Generated with Claude Code

`recce mcp-server` collapsed two distinct cloud paths into a single live
CloudBackend branch. Setting RECCE_SESSION_ID (the Recce Summary agent's
contract) routed into the live-session spawn (POST /api/v2/sessions/{id}/
instance) and 401'd before the MCP server could start, breaking GitLab
summary generation in production.

Split mcp-server into three explicit modes:

- cloud-session  (public)   `--cloud --session <id>`
    CloudBackend proxies tool calls to a live Recce Cloud session.
- cloud-snapshot (internal) `RECCE_SESSION_ID` / `RECCE_SNAPSHOT_ID` env
    CloudStateLoader downloads the state file from S3 and the MCP server
    runs locally against that snapshot. Used by the Recce Summary agent.
- local          (default)   no flags / no env

Capture --cloud / --session BEFORE patch_derived_args() so explicit
cloud-session can still be told apart from env-driven cloud-snapshot
after the helper promotes session_id to cloud=True. Reset cloud=False
before run_mcp_server() in non-session paths so the legacy snapshot
flow doesn't take the CloudBackend branch.

Tests updated:
- cloud-snapshot via RECCE_SESSION_ID env -> state loader, no live spawn
- cloud-session via --cloud --session -> CloudBackend, no state loader

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Kent Huang <kent@infuseai.io>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 89.58333% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
recce/cli.py 80.76% 5 Missing ⚠️
Files with missing lines Coverage Δ
tests/test_cli.py 100.00% <100.00%> (ø)
recce/cli.py 67.17% <80.76%> (+0.11%) ⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kentwelcome
Copy link
Copy Markdown
Member Author

Code Review: PR #1359

Files reviewed: 2 (recce/cli.py, tests/test_cli.py)
Categories: Business logic, Tests
Passes run: A, C, D, E, F, G, H
HEAD: 0553a97a

Validation Results

Pass A: Correctness & Logic — PASS

Traced the three-mode dispatch end-to-end:

  • explicit_cloud is captured from kwargs.get("cloud", False) before patch_derived_args() mutates it, so env-driven snapshot mode (which causes patch_derived_args to flip cloud=True) does not falsely look like an explicit --cloud. (recce/cli.py:2587)
  • is_cloud_snapshot = kwargs.get("cloud", False) and not is_cloud_session is read after patch_derived_args(), so it correctly captures the env-promoted cloud=True while excluding the explicit-flag case. (recce/cli.py:2604)
  • The state-loader gate if not is_cloud_session and (state_file or is_cloud_snapshot) runs create_state_loader_by_args(state_file, **kwargs) while kwargs["cloud"] is still Truecreate_state_loader_by_args (recce/cli.py:48-105) reads kwargs["cloud"] to decide between CloudStateLoader and FileStateLoader, so the order matters and is correct here. (recce/cli.py:2613)
  • The reset kwargs["cloud"] = False happens after state-loader creation but before run_mcp_server is called, so run_mcp_server (recce/mcp_server.py:2535) takes the else branch and does not spawn CloudBackend. (recce/cli.py:2637)

Pass C: Cross-Reference Consistency — PASS

  • run_mcp_server signature: cloud: bool = False, session: Optional[str] = None (recce/mcp_server.py:2478-2484). The PR passes session=cloud_session (None for snapshot, set for session) and cloud via **kwargs. Internal guard if session and not cloud: raise ValueError (line 2511) holds in both paths because cloud_session is None in snapshot mode.
  • create_state_loader_by_args(state_file, **kwargs) reads cloud, session_id, share_url, state_file_host, api_token from kwargs — all present in the snapshot kwargs path.
  • Test mocks (mock_create_state_loader.return_value = MagicMock(), etc.) match the real function signature: create_state_loader_by_args returns a state loader instance.

Pass D: Error Handling & Edge Cases — PASS

  • if explicit_cloud and not cloud_session: exit(1) and if cloud_session and not explicit_cloud: exit(1) enforce mutual presence of the public flags; env-driven snapshot mode bypasses both correctly. (recce/cli.py:2592, 2595)
  • State-loader creation is wrapped at the call site (in create_state_loader, recce/cli.py:21-35) with RecceCloudException and broad Exception handling that exits cleanly. Pre-existing.

Pass E: Test Coverage & Quality — PASS with NOTE

  • Two new tests cover both the cloud-snapshot env-var path and the cloud-session explicit-flag path, with non-vacuous assertions on state_loader, cloud, and session kwargs.
  • Mocks now include recce.cli.create_state_loader_by_args, which is the correct patch path (the function is referenced via recce.cli from the CLI entry point).
  • All 6 TestCommandMCPServer tests pass locally.

Pass F: Diff-Specific Checks — PASS

  • The is_cloud_mcp variable was renamed to is_cloud_session / is_cloud_snapshot. Searched the diff and surrounding code: no stale is_cloud_mcp references remain in recce/cli.py.
  • Banner messages updated for all three modes including the SSE/stdio variant, no orphaned print branches.

Pass G: Performance — N/A

CLI startup path; no hot loops or query changes.

Pass H: Async/Concurrency — N/A

asyncio.run(run_mcp_server(...)) is unchanged; no new awaits or tasks introduced.

Verification Results

  • uv run pytest tests/test_cli.py::TestCommandMCPServer — 6 passed
  • make flake8 — clean

Verdict: GO

All validation passes clean.

Notes

  1. Behavior tightening — not a regression. Pre-PR, --cloud plus RECCE_SESSION_ID env (no --session flag) was accepted via the or kwargs.get("session_id") fallback at the old cloud_session = kwargs.pop(...) or kwargs.get("session_id") line. Post-PR, this combination now hits the --session is required when using --cloud error. This is consistent with the PR's intent (separate the public flag-driven path from the internal env-driven path), but is a small behavior change worth noting if any tooling out there mixed --cloud with the env var. The Recce Summary agent's contract is env-only, so it is unaffected.

  2. session_id stays in kwargs after the cloud reset. kwargs["cloud"] = False is set before run_mcp_server, but kwargs["session_id"] (from env) is left in place and forwarded into load_context(**kwargs). This matches pre-PR behavior in the legacy snapshot path and state_loader is already provided so session_id is functionally redundant downstream — calling out for awareness, not as a defect.

  3. No test for the inverted-validation path --session sess-X without --cloud. The error branch (recce/cli.py:2595) is symmetric to the existing test_cmd_mcp_server_cloud_requires_session and trivially correct, but there's no negative test covering it.

What I could not verify

  • End-to-end behavior against a real CloudStateLoader download from S3 — both new tests mock create_state_loader_by_args. The actual cloud-snapshot flow against the Recce Summary agent's deployment is out of reach from a local review.
  • Whether anything in the Recce Cloud / Summary agent deployment passes --cloud alongside RECCE_SESSION_ID (see Note 1).

What I looked for and did not find

  • Stale references to the removed is_cloud_mcp variable.
  • Type or signature drift between run_mcp_server and the CLI call site.
  • State-loader creation order bugs (it correctly runs while cloud=True is still in kwargs).
  • Test mocks drifting from the real function signatures.
  • Logic inversions in the new is_cloud_session / is_cloud_snapshot predicates.
  • Resource leaks or swallowed exceptions in the new branches.
  • PII or secrets in the new banner messages (session_id is logged but it is not a secret in this codebase's existing convention).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restores the intended “cloud-snapshot” behavior for recce mcp-server when invoked by the Recce Summary agent via RECCE_SESSION_ID, while preserving an explicit “cloud-session” mode for end users (--cloud --session <id>). It prevents env-driven snapshot runs from incorrectly taking the live CloudBackend branch (and failing early with a 401 before MCP starts).

Changes:

  • Split MCP startup logic into explicit modes (cloud-session vs cloud-snapshot vs local) by capturing explicit flags before patch_derived_args().
  • Route env-driven snapshot startup through create_state_loader_by_args() and force cloud=False when calling run_mcp_server to avoid CloudBackend.
  • Add/adjust CLI tests to validate snapshot-via-env uses a state loader and explicit cloud-session skips it.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
recce/cli.py Introduces explicit mode detection for mcp-server, routes snapshot mode through cloud state loader, and prevents accidental CloudBackend activation.
tests/test_cli.py Adds coverage for env-driven cloud-snapshot behavior and explicit --cloud --session cloud-session behavior.

Comment thread recce/cli.py
@even-wei
Copy link
Copy Markdown
Contributor

even-wei commented May 7, 2026

Code Review: PR #1359

Files reviewed: 2 (recce/cli.py, tests/test_cli.py)
Categories: Business logic, Tests
Passes run: A, B, C, D, E, F, G, H
HEAD: 0553a97a (independent re-review at user's request)

Summary of approach

Independently traced the dispatch logic for every legal input combination, verified the production caller of the cloud-snapshot path, and re-ran the full CLI test suite. This is a fresh adversarial read; not a re-statement of the prior review.

Validation Results

Pass A: Correctness & Logic — PASS

Traced 9 input combinations end-to-end against the new dispatch:

# Input Result
1 --cloud --session sess-X cloud-session → CloudBackend, no state loader, banner cloud-session ✓
2 RECCE_SESSION_ID=sess-Y (no flags) cloud-snapshot → state loader, cloud=False to run_mcp_server, banner cloud-snapshot ✓
3 bare recce mcp-server local, single-env probe runs, banner local ✓
4 recce mcp-server state.json local + FileStateLoader (cloud=False in create_state_loader_by_args) ✓
5 --cloud alone exit(1) "--session is required" ✓ (recce/cli.py:2592-2593)
6 --session sess-Z alone exit(1) "--cloud is required" ✓ (recce/cli.py:2595-2596)
7 --cloud --session sess-A + RECCE_SESSION_ID=sess-B cloud-session uses sess-A; sess-B remains in kwargs but is ignored by CloudBackend.create(session_id=cloud_session)
8 RECCE_SHARE_URL=... (env share URL) cloud-snapshot, session_label falls back to share_url (recce/cli.py:2649); create_state_loader_by_args routes share_url through CloudStateLoader's share_id path ✓
9 RECCE_SNAPSHOT_ID=snap-X (legacy env) mapped to session_id by Click envvar=["RECCE_SNAPSHOT_ID","RECCE_SESSION_ID"] (recce/cli.py:2376); same as case 2 ✓

Order-of-operations is critical and correct:

  • cloud_session = kwargs.pop("cloud_session", None) and explicit_cloud = kwargs.get("cloud", False) are captured before patch_derived_args() (recce/cli.py:2586-2588). This is the actual fix — without this, env-driven snapshot mode is indistinguishable from explicit cloud-session after patch_derived_args flips cloud=True.
  • create_state_loader_by_args(state_file, **kwargs) runs while kwargs["cloud"] is still True (recce/cli.py:2615), so it correctly picks CloudStateLoader (recce/cli.py:81-100). The reset kwargs["cloud"] = False (recce/cli.py:2638-2639) happens after state-loader creation but before run_mcp_server, so run_mcp_server's cloud branch (recce/mcp_server.py:2535) does not spawn CloudBackend.

Pass B: Security — PASS

No secrets logged. session_label in the cloud-snapshot banner echoes session_id/share_url — neither is a secret in this codebase's existing convention (already logged elsewhere). No SQL/command/path injection surface introduced; this is pure CLI dispatch.

Pass C: Cross-Reference Consistency — PASS

  • run_mcp_server signature (sse, host, port, cloud=False, session=None, **kwargs) (recce/mcp_server.py:2478-2484) matches the call site run_mcp_server(sse=sse, host=host, port=port, session=cloud_session, **kwargs) (recce/cli.py:2667). cloud arrives via **kwargs. Internal guard if session and not cloud: raise ValueError (line 2511) holds in both branches: cloud-session (cloud=True), cloud-snapshot (cloud=False, session=None).
  • Test mocks at recce.cli.create_state_loader_by_args patch the right symbol — create_state_loader_by_args is referenced via the module's local namespace at recce/cli.py:2615.
  • recce.mcp_server.run_mcp_server patched with new_callable=MagicMock works correctly because the lazy import from recce.mcp_server import run_mcp_server (recce/cli.py:2552) executes at runtime after the patch is applied — the local name binds to the patched MagicMock.
  • is_cloud_mcp variable fully removed; grep -rn "is_cloud_mcp" recce/ tests/ returns no matches.

Pass D: Error Handling & Edge Cases — PASS

  • Both validation branches (recce/cli.py:2592-2596) call exit(1) cleanly. The KeyboardInterrupt/asyncio.CancelledError and broad Exception handlers (recce/cli.py:2668-2678) are unchanged.
  • The kwargs["cloud"] = False reset is gated by if not is_cloud_session (recce/cli.py:2638), not by if is_cloud_snapshot — important, because it also covers the local case where kwargs["cloud"] was already False (no-op, harmless) and prevents any future code path that mutates cloud after patch_derived_args from leaking into run_mcp_server.

Pass E: Test Coverage & Quality — PASS with NOTE

  • Two new tests cover both the cloud-snapshot env-var path and the cloud-session explicit-flag path with non-vacuous assertions.
  • All 6 TestCommandMCPServer tests pass; full CLI suite (tests/test_cli.py, tests/test_cli_cache.py, tests/test_cli_mcp_optional.py) — 64 passed.
  • Codecov reports 5 uncovered lines on the patch (89.58%). I traced them: 4 are SSE/stdio banner console.print calls in the new branches (recce/cli.py:2644-2645, 2655, 2658-2659) and 1 is the --cloud is required when using --session error branch (recce/cli.py:2596-2597). All 4 banner lines are pure prints; the validation branch is symmetric to the already-tested --cloud without --session case.

Pass F: Diff-Specific Checks — PASS

  • Banner messages updated for all three modes including the SSE variant; no orphaned print branches.
  • cloud_session = kwargs.pop("cloud_session", None) or kwargs.get("session_id") (pre-PR) became cloud_session = kwargs.pop("cloud_session", None) (post-PR) — the or kwargs.get("session_id") fallback was deliberately removed to separate the public flag-driven path from the internal env-driven path. Verified by running production caller search (see "Production caller verification" below) — no breakage.

Pass G: Performance — N/A

CLI startup path; no hot loops, queries, or memory concerns.

Pass H: Async/Concurrency — N/A

asyncio.run(run_mcp_server(...)) unchanged; no new awaits, tasks, or shared state.

Verification Results

  • uv run pytest tests/test_cli.py::TestCommandMCPServer -v6 passed
  • uv run pytest tests/test_cli.py tests/test_cli_cache.py tests/test_cli_mcp_optional.py64 passed
  • make flake8 — clean

Production caller verification

The recce mcp-server command is spawned by recce_instance_launcher/src/recce_ai_summary_entrypoint:95 in the recce-cloud-infra repo:

recce mcp-server --sse --debug 2> /dev/null &

with RECCE_SESSION_ID set in the container env. No --cloud flag, no --session flag — exactly the cloud-snapshot path this PR establishes. The PR's tightening (rejecting --cloud without --session) does not affect this caller because the entrypoint never passed --cloud. The 401 regression (DRC-3384) traced exactly to this entrypoint hitting the live-spawn branch after e7b7f815 collapsed the two cloud paths.

Verdict: GO

No BLOCKERs, no ISSUEs. The dispatcher is correct on all combinations I traced, the production caller is exercised by the cloud-snapshot path, the full CLI suite passes, and lint is clean.

Notes

  1. Help text doesn't reflect the new modes. recce mcp-server --help (recce/cli.py:2492-2538) only shows local-mode examples. The --session Click option help (recce/cli.py:2483: "Recce Cloud session ID for cloud MCP mode") is vague and doesn't distinguish cloud-session from cloud-snapshot. Copilot already flagged this. UX/docs only — not blocking.
  2. session_id is forwarded into run_mcp_server in cloud-snapshot mode. After the kwargs["cloud"] = False reset, kwargs["session_id"] (from env) remains. It flows through **kwargs into load_context(**kwargs) (recce/mcp_server.py:2551) but is ignored there because state_loader is provided. Matches pre-PR behavior; harmless. Worth knowing if load_context ever starts validating unknown kwargs.
  3. No negative test for --session sess-X without --cloud. The error branch (recce/cli.py:2595-2596) is symmetric to test_cmd_mcp_server_cloud_requires_session and trivially correct, but uncovered (one of the 5 codecov lines). Worth a 3-line test if you're touching this file again.
  4. Behavior tightening — pre-PR vs post-PR. Pre-PR, --cloud plus RECCE_SESSION_ID env (without --session) was accepted via the removed or kwargs.get("session_id") fallback. Post-PR, this combination errors at "--session is required when using --cloud." Verified the only known production caller (recce_ai_summary_entrypoint) doesn't pass --cloud when only the env var is set, so this is safe — but flagging in case any other tooling out there mixed them.

What I could not verify

  • End-to-end behavior against a real CloudStateLoader S3 download — both new tests mock create_state_loader_by_args. The actual cloud-snapshot fetch happens out of reach from a local review.
  • Whether any non-public tooling beyond recce_ai_summary_entrypoint invokes recce mcp-server with --cloud alongside RECCE_SESSION_ID (Note 4).

What I looked for and did not find

  • Stale is_cloud_mcp references after the rename (none).
  • Order-of-operations bug between patch_derived_args, create_state_loader_by_args, and the kwargs["cloud"] = False reset (order is correct).
  • Type/signature drift between run_mcp_server and the CLI call site.
  • Test mocks drifting from the real function signatures (create_state_loader_by_args, run_mcp_server).
  • Logic inversions in is_cloud_session / is_cloud_snapshot predicates.
  • Resource leaks, swallowed exceptions, or partial-failure paths in the new branches.
  • Hidden interaction with RECCE_SHARE_URL env (case 8 above — handled correctly via session_label fallback).
  • Click option-name collisions — --session (mapped to cloud_session kwarg, line 2483) vs hidden --session-id (mapped to session_id kwarg, line 234). Distinct keys; no shadowing.

Copy link
Copy Markdown
Contributor

@even-wei even-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review (re-review at user's request): No critical issues found. Dispatcher is correct on all 9 traced combinations; production caller (recce_ai_summary_entrypoint) confirmed to use the cloud-snapshot path. 4 NOTEs about help-text UX, session_id forwarding, missing negative test, and a behavior tightening — all non-blocking. See review comment for details.

@kentwelcome kentwelcome merged commit 0d39a55 into main May 7, 2026
26 checks passed
@kentwelcome kentwelcome deleted the feature/drc-3384-recce-gitlab-summary-generation-hits-401-auth-error-at-cloud branch May 7, 2026 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants