Skip to content

Improve the re-connect handling mechanism for MCP with auth#875

Merged
rapids-bot[bot] merged 10 commits intoNVIDIA:release/1.3from
yczhang-nv:yuchen-fix-reconnect-mcp-auth
Sep 30, 2025
Merged

Improve the re-connect handling mechanism for MCP with auth#875
rapids-bot[bot] merged 10 commits intoNVIDIA:release/1.3from
yczhang-nv:yuchen-fix-reconnect-mcp-auth

Conversation

@yczhang-nv
Copy link
Contributor

@yczhang-nv yczhang-nv commented Sep 29, 2025

Description

Added auth_flow_timeout to better handle the mcp Client timeout and reconnection logic with auth.

Two timeout options now: tool_call_timeout (shorter) and auth_flow_timeout (longer). The rules are:

  • Uses shorter timeout (default 60s) when auth token is cached
  • Uses longer timeout (default 300s) when authentication may be needed
  • Prevents reconnection during active authentication flows

The user experience will be like: the first tool call with auth will get 300s timeout, and the client won't try to re-connect if it hits the timeout. After the first success authentication, the following tool calls with cached token will get 60s timeout, and the client will try to reconnect after it hits 60s.

Closes AIQ-1966

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Configurable authentication flow timeout and reconnection options.
    • Dynamic tool-call timeouts based on authentication state; extended time during interactive auth.
    • Increased default tool-call timeout for more resilient operations.
    • Clearer errors on authentication timeouts and prevention of reconnects during active authentication.
  • Tests

    • Comprehensive test coverage for authentication timeout handling, token caching scenarios, timeout selection, and reconnect behavior.

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv yczhang-nv self-assigned this Sep 29, 2025
@coderabbitai
Copy link

coderabbitai bot commented Sep 29, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Introduces authentication-state tracking and dynamic timeout resolution in the MCP client. Adds auth_flow_timeout, adjusts tool_call_timeout defaults, and alters reconnect logic to skip reconnect during active authentication. Updates constructors and tool invocation paths accordingly. Adds comprehensive tests for auth token caching, timeout selection, and reconnect behavior.

Changes

Cohort / File(s) Summary
MCP client auth/timeout and reconnect logic
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py
Adds is_authenticating flag to AuthAdapter; introduces auth_flow_timeout; increases default tool_call_timeout to 60s; implements _has_cached_auth_token and _get_tool_call_timeout; wires dynamic timeouts into call_tool and call_tool_with_meta; updates constructors for MCPBaseClient and transports; modifies _with_reconnect to skip reconnect during active auth and handle auth timeouts distinctly; updates get_tools and MCPToolClient usage to rely on parent for timeouts.
Tests for authentication timeout and reconnect behavior
tests/nat/mcp/test_mcp_auth_timeout.py
New tests covering default/configured timeouts, token cache scenarios, timeout resolution, AuthAdapter state, reconnect behavior during auth, and integration paths validating read_timeout_seconds selection and state transitions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant App
  participant MCPClient
  participant AuthAdapter
  participant Transport

  App->>MCPClient: call_tool(tool, args)
  MCPClient->>MCPClient: _has_cached_auth_token()
  alt No cached token
    MCPClient->>MCPClient: timeout = auth_flow_timeout
    MCPClient->>AuthAdapter: set is_authenticating = true
  else Cached token
    MCPClient->>MCPClient: timeout = tool_call_timeout
  end

  MCPClient->>Transport: send tool call (timeout=resolved)
  par Response or 401
    Transport-->>MCPClient: result or 401
  and Timeout
    Transport--x MCPClient: TimeoutError
  end

  alt Timeout during active auth
    MCPClient->>MCPClient: do not reconnect
    MCPClient-->>App: raise auth timeout
  else Other errors/timeouts
    MCPClient->>MCPClient: _with_reconnect() per policy
    MCPClient->>Transport: reconnect and retry (if enabled)
    Transport-->>MCPClient: result or error
  end

  MCPClient->>AuthAdapter: finally set is_authenticating = false
  MCPClient-->>App: tool result or error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

bug, non-breaking

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.94% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title “Improve the re-connect handling mechanism for MCP with auth” uses the imperative verb “Improve,” stays under the 72-character limit, and clearly summarizes the main focus of the changeset—enhancing the reconnect logic in the MCP client when authentication is involved. It directly reflects the PR objectives around modifying reconnect behavior during authentication flows and does not introduce unrelated or generic terminology. This makes it a concise and descriptive title that aligns well with the established requirements.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Sep 29, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai bot added bug Something isn't working non-breaking Non-breaking change labels Sep 29, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/nat/mcp/test_mcp_auth_timeout.py (1)

489-605: LGTM - Comprehensive integration test coverage.

The integration tests excellently verify:

  • Correct timeout used with cached token
  • Extended timeout used without cached token
  • Dynamic timeout switching after authentication completes

Optional: Silence unused variable warnings.

Lines 527 and 566 unpack args but never use it. Consider using _args instead to indicate intentionally unused:

-        args, kwargs = call_args[0]
+        _args, kwargs = call_args[0]
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py (1)

316-339: Encapsulate token cache check behind a public API
Rather than reaching into _auth_code_provider._authenticated_tokens, add a public has_valid_cached_token() (e.g. on AuthProviderBase and implement it in MCPOAuth2Provider) so clients don’t depend on provider internals.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb08c6e and 4e4610b.

📒 Files selected for processing (2)
  • packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py (14 hunks)
  • tests/nat/mcp/test_mcp_auth_timeout.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py
  • tests/nat/mcp/test_mcp_auth_timeout.py
packages/**/*

⚙️ CodeRabbit configuration file

packages/**/*: - This directory contains optional plugin packages for the toolkit, each should contain a pyproject.toml file. - The pyproject.toml file should declare a dependency on nvidia-nat or another package with a name starting
with nvidia-nat-. This dependency should be declared using ~=<version>, and the version should be a two
digit version (ex: ~=1.0).

  • Not all packages contain Python code, if they do they should also contain their own set of tests, in a
    tests/ directory at the same level as the pyproject.toml file.

Files:

  • packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py
tests/**/*.py

⚙️ CodeRabbit configuration file

tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using the test_ prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using the fixture_ prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass

Files:

  • tests/nat/mcp/test_mcp_auth_timeout.py
🪛 Ruff (0.13.1)
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py

298-298: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


299-301: Avoid specifying long messages outside the exception class

(TRY003)


303-303: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


336-336: Consider moving this statement to an else block

(TRY300)


337-337: Do not catch blind exception: Exception

(BLE001)

tests/nat/mcp/test_mcp_auth_timeout.py

121-121: Possible hardcoded password assigned to argument: "client_secret"

(S106)


145-145: Possible hardcoded password assigned to argument: "client_secret"

(S106)


170-170: Possible hardcoded password assigned to argument: "client_secret"

(S106)


191-191: Possible hardcoded password assigned to argument: "client_secret"

(S106)


240-240: Possible hardcoded password assigned to argument: "client_secret"

(S106)


270-270: Possible hardcoded password assigned to argument: "client_secret"

(S106)


299-299: Possible hardcoded password assigned to argument: "client_secret"

(S106)


313-313: Possible hardcoded password assigned to argument: "client_secret"

(S106)


332-332: Possible hardcoded password assigned to argument: "client_secret"

(S106)


353-353: Avoid specifying long messages outside the exception class

(TRY003)


369-369: Possible hardcoded password assigned to argument: "client_secret"

(S106)


390-390: Avoid specifying long messages outside the exception class

(TRY003)


406-406: Possible hardcoded password assigned to argument: "client_secret"

(S106)


432-432: Avoid specifying long messages outside the exception class

(TRY003)


448-448: Possible hardcoded password assigned to argument: "client_secret"

(S106)


474-474: Avoid specifying long messages outside the exception class

(TRY003)


498-498: Possible hardcoded password assigned to argument: "client_secret"

(S106)


527-527: Unpacked variable args is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


540-540: Possible hardcoded password assigned to argument: "client_secret"

(S106)


566-566: Unpacked variable args is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


579-579: Possible hardcoded password assigned to argument: "client_secret"

(S106)

🔇 Additional comments (16)
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py (10)

62-63: LGTM - Clean authentication state tracking.

The addition of the is_authenticating flag provides a clear mechanism to coordinate reconnection logic with active authentication flows.


90-104: LGTM - Proper authentication flow state management.

The implementation correctly:

  • Sets the authentication flag before the retry attempt
  • Resets the flag in a finally block to ensure cleanup even on exceptions
  • Uses raise to preserve the stack trace (per coding guidelines)

166-178: LGTM - Well-designed timeout configuration.

The timeout parameters are well-chosen:

  • 60s default for normal tool calls (increased from previous 5s) is reasonable
  • 300s for interactive authentication flows provides adequate time for user interaction
  • Clear documentation distinguishes between the two timeouts

Also applies to: 197-198


341-355: LGTM - Clean dynamic timeout resolution.

The method appropriately selects timeout based on authentication state and provides helpful debug logging for the extended timeout case.


286-314: LGTM - Correct authentication-aware reconnect logic.

The implementation properly:

  • Skips reconnection during active authentication flows
  • Provides clear timeout messaging for abandoned authentication
  • Uses logger.error() with bare raise (correct per coding guidelines for re-raising)
  • Falls through to normal reconnect logic for non-auth errors

Note: The static analysis hints suggesting logger.exception() are incorrect in this context—the coding guidelines explicitly state to use logger.error() when re-raising to avoid duplicate stack traces.


383-391: LGTM - Proper parent client wiring.

The change from passing tool_call_timeout directly to passing parent_client=self enables dynamic timeout resolution based on authentication state.


460-474: LGTM - Consistent timeout parameter propagation across transport clients.

All three transport implementations (SSE, Stdio, StreamableHTTP) correctly:

  • Accept the new timeout parameters
  • Use consistent defaults (60s, 300s)
  • Propagate parameters to the base class

Also applies to: 509-525, 569-585


622-636: LGTM - Simplified tool client with proper parent delegation.

The refactoring correctly removes timeout management from MCPToolClient and delegates to the parent client, which can dynamically determine the appropriate timeout based on authentication state.

Also applies to: 664-699


433-439: Confirm request_read_timeout_seconds accepts a timedelta. The code currently passes a datetime.timedelta—if ClientSession.send_request expects a numeric value, convert via timeout.total_seconds().


444-449: Verify read_timeout_seconds accepts timedelta
We weren’t able to locate ClientSession.call_tool’s signature in this repo—please confirm whether it accepts a timedelta for read_timeout_seconds or if you need to pass timeout.total_seconds() instead.

tests/nat/mcp/test_mcp_auth_timeout.py (6)

43-74: LGTM - Well-structured test mocks.

The mock infrastructure properly:

  • Implements the MCPBaseClient interface
  • Provides async context manager protocol
  • Uses AsyncMock for session mocking
  • Supports side effect injection for testing different scenarios

Note: Static analysis warnings about hardcoded passwords are false positives—these are test fixtures, not production secrets.


82-98: LGTM - Essential configuration validation.

The tests appropriately verify:

  • Custom timeout parameters are stored correctly
  • Default values match expectations (60s tool, 300s auth)

106-211: LGTM - Comprehensive token cache validation tests.

Excellent test coverage including:

  • No auth provider case
  • Valid cached token
  • Expired token
  • Empty token cache
  • Multiple tokens with mixed validity

The tests properly mock the internal auth provider structure and verify all edge cases.


219-285: LGTM - Complete timeout selection logic coverage.

The tests verify all three timeout selection scenarios:

  • No auth provider → normal timeout
  • Cached token → normal timeout
  • No cached token → extended auth timeout

293-318: LGTM - Essential AuthAdapter state verification.

The tests verify:

  • Initial authentication state is False
  • AuthAdapter properly stores the auth provider reference

326-481: LGTM - Thorough reconnect behavior validation.

Excellent test coverage of the reconnect logic matrix:

  • ✅ Timeout during auth → no reconnect, specific error message
  • ✅ Error during auth → no reconnect, propagates original error
  • ✅ Timeout when not authenticating → reconnect attempted
  • ✅ Error when not authenticating → reconnect attempted

The tests properly verify reconnect call counts and use fast backoff timings for efficient test execution.

@yczhang-nv yczhang-nv changed the base branch from develop to release/1.3 September 30, 2025 16:18
…econnect-mcp-auth

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv yczhang-nv marked this pull request as ready for review September 30, 2025 17:36
@yczhang-nv yczhang-nv requested a review from a team as a code owner September 30, 2025 17:36
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 26353b3 into NVIDIA:release/1.3 Sep 30, 2025
17 checks passed
@yczhang-nv yczhang-nv deleted the yuchen-fix-reconnect-mcp-auth branch October 1, 2025 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants