Improve the re-connect handling mechanism for MCP with auth#875
Conversation
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughIntroduces authentication-state tracking and dynamic timeout resolution in the MCP client. Adds auth_flow_timeout, adjusts tool_call_timeout defaults, and alters reconnect logic to skip reconnect during active authentication. Updates constructors and tool invocation paths accordingly. Adds comprehensive tests for auth token caching, timeout selection, and reconnect behavior. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant App
participant MCPClient
participant AuthAdapter
participant Transport
App->>MCPClient: call_tool(tool, args)
MCPClient->>MCPClient: _has_cached_auth_token()
alt No cached token
MCPClient->>MCPClient: timeout = auth_flow_timeout
MCPClient->>AuthAdapter: set is_authenticating = true
else Cached token
MCPClient->>MCPClient: timeout = tool_call_timeout
end
MCPClient->>Transport: send tool call (timeout=resolved)
par Response or 401
Transport-->>MCPClient: result or 401
and Timeout
Transport--x MCPClient: TimeoutError
end
alt Timeout during active auth
MCPClient->>MCPClient: do not reconnect
MCPClient-->>App: raise auth timeout
else Other errors/timeouts
MCPClient->>MCPClient: _with_reconnect() per policy
MCPClient->>Transport: reconnect and retry (if enabled)
Transport-->>MCPClient: result or error
end
MCPClient->>AuthAdapter: finally set is_authenticating = false
MCPClient-->>App: tool result or error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/nat/mcp/test_mcp_auth_timeout.py (1)
489-605: LGTM - Comprehensive integration test coverage.The integration tests excellently verify:
- Correct timeout used with cached token
- Extended timeout used without cached token
- Dynamic timeout switching after authentication completes
Optional: Silence unused variable warnings.
Lines 527 and 566 unpack
argsbut never use it. Consider using_argsinstead to indicate intentionally unused:- args, kwargs = call_args[0] + _args, kwargs = call_args[0]packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py (1)
316-339: Encapsulate token cache check behind a public API
Rather than reaching into_auth_code_provider._authenticated_tokens, add a publichas_valid_cached_token()(e.g. on AuthProviderBase and implement it in MCPOAuth2Provider) so clients don’t depend on provider internals.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py(14 hunks)tests/nat/mcp/test_mcp_auth_timeout.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,
and should contain an Apache License 2.0 header comment at the top of each file.
- Confirm that copyright years are up-to date whenever a file is changed.
Files:
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.pytests/nat/mcp/test_mcp_auth_timeout.py
packages/**/*
⚙️ CodeRabbit configuration file
packages/**/*: - This directory contains optional plugin packages for the toolkit, each should contain apyproject.tomlfile. - Thepyproject.tomlfile should declare a dependency onnvidia-nator another package with a name starting
withnvidia-nat-. This dependency should be declared using~=<version>, and the version should be a two
digit version (ex:~=1.0).
- Not all packages contain Python code, if they do they should also contain their own set of tests, in a
tests/directory at the same level as thepyproject.tomlfile.
Files:
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py
tests/**/*.py
⚙️ CodeRabbit configuration file
tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using thetest_prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using thefixture_prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass
Files:
tests/nat/mcp/test_mcp_auth_timeout.py
🪛 Ruff (0.13.1)
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py
298-298: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
299-301: Avoid specifying long messages outside the exception class
(TRY003)
303-303: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
336-336: Consider moving this statement to an else block
(TRY300)
337-337: Do not catch blind exception: Exception
(BLE001)
tests/nat/mcp/test_mcp_auth_timeout.py
121-121: Possible hardcoded password assigned to argument: "client_secret"
(S106)
145-145: Possible hardcoded password assigned to argument: "client_secret"
(S106)
170-170: Possible hardcoded password assigned to argument: "client_secret"
(S106)
191-191: Possible hardcoded password assigned to argument: "client_secret"
(S106)
240-240: Possible hardcoded password assigned to argument: "client_secret"
(S106)
270-270: Possible hardcoded password assigned to argument: "client_secret"
(S106)
299-299: Possible hardcoded password assigned to argument: "client_secret"
(S106)
313-313: Possible hardcoded password assigned to argument: "client_secret"
(S106)
332-332: Possible hardcoded password assigned to argument: "client_secret"
(S106)
353-353: Avoid specifying long messages outside the exception class
(TRY003)
369-369: Possible hardcoded password assigned to argument: "client_secret"
(S106)
390-390: Avoid specifying long messages outside the exception class
(TRY003)
406-406: Possible hardcoded password assigned to argument: "client_secret"
(S106)
432-432: Avoid specifying long messages outside the exception class
(TRY003)
448-448: Possible hardcoded password assigned to argument: "client_secret"
(S106)
474-474: Avoid specifying long messages outside the exception class
(TRY003)
498-498: Possible hardcoded password assigned to argument: "client_secret"
(S106)
527-527: Unpacked variable args is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
540-540: Possible hardcoded password assigned to argument: "client_secret"
(S106)
566-566: Unpacked variable args is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
579-579: Possible hardcoded password assigned to argument: "client_secret"
(S106)
🔇 Additional comments (16)
packages/nvidia_nat_mcp/src/nat/plugins/mcp/client_base.py (10)
62-63: LGTM - Clean authentication state tracking.The addition of the
is_authenticatingflag provides a clear mechanism to coordinate reconnection logic with active authentication flows.
90-104: LGTM - Proper authentication flow state management.The implementation correctly:
- Sets the authentication flag before the retry attempt
- Resets the flag in a
finallyblock to ensure cleanup even on exceptions- Uses
raiseto preserve the stack trace (per coding guidelines)
166-178: LGTM - Well-designed timeout configuration.The timeout parameters are well-chosen:
- 60s default for normal tool calls (increased from previous 5s) is reasonable
- 300s for interactive authentication flows provides adequate time for user interaction
- Clear documentation distinguishes between the two timeouts
Also applies to: 197-198
341-355: LGTM - Clean dynamic timeout resolution.The method appropriately selects timeout based on authentication state and provides helpful debug logging for the extended timeout case.
286-314: LGTM - Correct authentication-aware reconnect logic.The implementation properly:
- Skips reconnection during active authentication flows
- Provides clear timeout messaging for abandoned authentication
- Uses
logger.error()with bareraise(correct per coding guidelines for re-raising)- Falls through to normal reconnect logic for non-auth errors
Note: The static analysis hints suggesting
logger.exception()are incorrect in this context—the coding guidelines explicitly state to uselogger.error()when re-raising to avoid duplicate stack traces.
383-391: LGTM - Proper parent client wiring.The change from passing
tool_call_timeoutdirectly to passingparent_client=selfenables dynamic timeout resolution based on authentication state.
460-474: LGTM - Consistent timeout parameter propagation across transport clients.All three transport implementations (SSE, Stdio, StreamableHTTP) correctly:
- Accept the new timeout parameters
- Use consistent defaults (60s, 300s)
- Propagate parameters to the base class
Also applies to: 509-525, 569-585
622-636: LGTM - Simplified tool client with proper parent delegation.The refactoring correctly removes timeout management from
MCPToolClientand delegates to the parent client, which can dynamically determine the appropriate timeout based on authentication state.Also applies to: 664-699
433-439: Confirmrequest_read_timeout_secondsaccepts atimedelta. The code currently passes adatetime.timedelta—ifClientSession.send_requestexpects a numeric value, convert viatimeout.total_seconds().
444-449: Verifyread_timeout_secondsacceptstimedelta
We weren’t able to locateClientSession.call_tool’s signature in this repo—please confirm whether it accepts atimedeltaforread_timeout_secondsor if you need to passtimeout.total_seconds()instead.tests/nat/mcp/test_mcp_auth_timeout.py (6)
43-74: LGTM - Well-structured test mocks.The mock infrastructure properly:
- Implements the
MCPBaseClientinterface- Provides async context manager protocol
- Uses
AsyncMockfor session mocking- Supports side effect injection for testing different scenarios
Note: Static analysis warnings about hardcoded passwords are false positives—these are test fixtures, not production secrets.
82-98: LGTM - Essential configuration validation.The tests appropriately verify:
- Custom timeout parameters are stored correctly
- Default values match expectations (60s tool, 300s auth)
106-211: LGTM - Comprehensive token cache validation tests.Excellent test coverage including:
- No auth provider case
- Valid cached token
- Expired token
- Empty token cache
- Multiple tokens with mixed validity
The tests properly mock the internal auth provider structure and verify all edge cases.
219-285: LGTM - Complete timeout selection logic coverage.The tests verify all three timeout selection scenarios:
- No auth provider → normal timeout
- Cached token → normal timeout
- No cached token → extended auth timeout
293-318: LGTM - Essential AuthAdapter state verification.The tests verify:
- Initial authentication state is
False- AuthAdapter properly stores the auth provider reference
326-481: LGTM - Thorough reconnect behavior validation.Excellent test coverage of the reconnect logic matrix:
- ✅ Timeout during auth → no reconnect, specific error message
- ✅ Error during auth → no reconnect, propagates original error
- ✅ Timeout when not authenticating → reconnect attempted
- ✅ Error when not authenticating → reconnect attempted
The tests properly verify reconnect call counts and use fast backoff timings for efficient test execution.
…econnect-mcp-auth Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
|
/merge |
Description
Added
auth_flow_timeoutto better handle the mcp Client timeout and reconnection logic with auth.Two timeout options now:
tool_call_timeout(shorter) andauth_flow_timeout(longer). The rules are:The user experience will be like: the first tool call with auth will get 300s timeout, and the client won't try to re-connect if it hits the timeout. After the first success authentication, the following tool calls with cached token will get 60s timeout, and the client will try to reconnect after it hits 60s.
Closes AIQ-1966
By Submitting this PR I confirm:
Summary by CodeRabbit
New Features
Tests