Prevent retry storms in nested method calls#803
Prevent retry storms in nested method calls#803rapids-bot[bot] merged 3 commits intoNVIDIA:developfrom
Conversation
Added `instance_context_aware` flag to the retry decorator to avoid retry storms caused by nested method calls. This ensures retries are not triggered recursively in the same instance's execution context. Additional tests were implemented to verify functionality across edge cases, including nested, async, deep nesting, and multi-instance scenarios. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
|
Warning Rate limit exceeded@dnandakumar-nv has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 18 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughIntroduces an instance-aware retry context in the retry decorator to prevent nested retry storms. Adds a flag to track per-instance retry context, short-circuiting inner retries during outer retries across sync, async, and generator wrappers. Updates patch_with_retry to enable this behavior and adds comprehensive tests validating nested and async scenarios. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor C as Caller
participant O as NestedService.outer_method
participant D as RetryDecorator (outer)
participant I as NestedService.inner_method
participant DI as RetryDecorator (inner)
Note over D: instance_context_aware=true
C->>D: call outer_method(...)
activate D
D->>D: check _in_retry_context(self)==False
D->>D: set _in_retry_context(self)=True
D->>O: invoke wrapped outer_method
activate O
O->>DI: call inner_method(...)
activate DI
DI->>DI: detect _in_retry_context(self)==True
DI->>I: execute once (no retries)
deactivate DI
O-->>D: result or exception
deactivate O
alt error and retryable
D->>D: backoff + retry (exponential)
D->>O: re-invoke outer_method
end
D->>D: clear _in_retry_context(self)
D-->>C: final result or raised error
deactivate D
Note over DI: Inner retries only when called<br/>outside any existing context
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test 0f77e53 |
willkill07
left a comment
There was a problem hiding this comment.
One style nit, but otherwise LGTM
Replaces the previous retry context tracking mechanism with a dedicated `_RetryContext` class using a context manager. This improves code readability, ensures proper cleanup of retry state, and avoids redundant flags and manual state management. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/nat/utils/test_retry_wrapper.py (1)
175-180: Add@pytest.mark.asyncioor enableasyncio_mode=auto— pytest-asyncio defaults to strictpyproject.toml contains
pytest-asyncio==0.24.*(pyproject.toml:212); with noasyncio_modeset, pytest-asyncio defaults tostrict, so async tests must be explicitly marked or the config must be changed. (pytest-asyncio.readthedocs.io)+@pytest.mark.asyncio async def test_async_retry():
♻️ Duplicate comments (4)
src/nat/utils/exception_handlers/automatic_retries.py (4)
145-161: Fix: instance flag isn’t concurrency-safe; use ContextVar (per-task) instead.setting
_in_retry_contexton the instance leaks across concurrent calls on the same object (async tasks/threads), causing unrelated top-level calls to skip their own retries. Scope the “in-retry” marker to the current task/thread via ContextVar and remove attribute mutation entirely.Apply this diff to replace the helpers:
- def _check_retry_context(args) -> bool: - """Check if we're already in a retry context for this instance.""" - if not use_context_aware or not args: - return False - obj = args[0] - return getattr(obj, '_in_retry_context', False) - - def _set_retry_context(args, value: bool) -> None: - """Set the retry context flag on the instance.""" - if use_context_aware and args: - obj = args[0] - try: - object.__setattr__(obj, '_in_retry_context', value) - except Exception: - # If we can't set the attribute, just continue without context tracking - pass + # Per-task/thread context to avoid cross-call interference. + _IN_RETRY_CTX: "ContextVar[frozenset[int]]" + try: + _IN_RETRY_CTX # type: ignore[misc] + except NameError: + from contextvars import ContextVar + _IN_RETRY_CTX = ContextVar("_in_retry_ctx", default=frozenset()) + + def _in_retry_context(args) -> bool: + if not use_context_aware or not args: + return False + return id(args[0]) in _IN_RETRY_CTX.get() + + def _enter_retry_context(args): + if not use_context_aware or not args: + return None + current = _IN_RETRY_CTX.get() + return _IN_RETRY_CTX.set(current | {id(args[0])}) + + def _exit_retry_context(token) -> None: + if token is not None: + _IN_RETRY_CTX.reset(token)Note: add
from contextvars import ContextVarnear imports if not present.
211-232: Refactor sync-generator wrapper to use task/thread-local context.Prevents cross-call interference and removes attribute mutation.
- # Skip retries if already in retry context - if _check_retry_context(args): + # Skip retries if already in a retry context for this instance + if _in_retry_context(args): yield from fn(*args, **kw) return - - _set_retry_context(args, True) - try: + token = _enter_retry_context(args) + try: delay = base_delay for attempt in range(retries): call_args = copy.deepcopy(args) if use_deepcopy else args call_kwargs = copy.deepcopy(kw) if use_deepcopy else kw try: yield from fn(*call_args, **call_kwargs) return except retry_on as exc: if (not _want_retry(exc, code_patterns=retry_codes, msg_substrings=retry_on_messages) or attempt == retries - 1): raise time.sleep(delay) delay *= backoff finally: - _set_retry_context(args, False) + _exit_retry_context(token)
163-182: Refactor async wrapper to use task-local retry context.Removes fragile instance flag, eliminates try/finally boilerplate, and prevents cross-call bleed.
- # Skip retries if already in retry context - if _check_retry_context(args): + # Skip retries if already in a retry context for this instance + if _in_retry_context(args): return await fn(*args, **kw) - - _set_retry_context(args, True) - try: - delay = base_delay - for attempt in range(retries): - call_args = copy.deepcopy(args) if use_deepcopy else args - call_kwargs = copy.deepcopy(kw) if use_deepcopy else kw - try: - return await fn(*call_args, **call_kwargs) - except retry_on as exc: - if (not _want_retry(exc, code_patterns=retry_codes, msg_substrings=retry_on_messages) - or attempt == retries - 1): - raise - await asyncio.sleep(delay) - delay *= backoff - finally: - _set_retry_context(args, False) + token = _enter_retry_context(args) + try: + delay = base_delay + for attempt in range(retries): + call_args = copy.deepcopy(args) if use_deepcopy else args + call_kwargs = copy.deepcopy(kw) if use_deepcopy else kw + try: + return await fn(*call_args, **call_kwargs) + except retry_on as exc: + if (not _want_retry(exc, code_patterns=retry_codes, msg_substrings=retry_on_messages) + or attempt == retries - 1): + raise + await asyncio.sleep(delay) + delay *= backoff + finally: + _exit_retry_context(token)
235-254: Refactor sync wrapper to use task/thread-local context.Same fix as above.
- # Skip retries if already in retry context - if _check_retry_context(args): + # Skip retries if already in a retry context for this instance + if _in_retry_context(args): return fn(*args, **kw) - - _set_retry_context(args, True) - try: + token = _enter_retry_context(args) + try: delay = base_delay for attempt in range(retries): call_args = copy.deepcopy(args) if use_deepcopy else args call_kwargs = copy.deepcopy(kw) if use_deepcopy else kw try: return fn(*call_args, **call_kwargs) except retry_on as exc: if (not _want_retry(exc, code_patterns=retry_codes, msg_substrings=retry_on_messages) or attempt == retries - 1): raise time.sleep(delay) delay *= backoff finally: - _set_retry_context(args, False) + _exit_retry_context(token)
🧹 Nitpick comments (7)
src/nat/utils/exception_handlers/automatic_retries.py (3)
135-139: Docstring: clarify concurrency semantics ofinstance_context_aware.Explicitly state the context is task/thread-local (ContextVar), avoids cross-call interference, and only gates nested calls on the same instance.
instance_context_aware: - If True, the decorator will check for a retry context flag on the first - argument (assumed to be 'self'). If the flag is set, retries are skipped - to prevent retry storms in nested method calls. + If True, the decorator gates retries by a per‑task/thread context keyed + to the first argument (assumed to be `self`). When the instance is + already “in retry” within the current task/thread, inner calls skip + their own retry loops to prevent nested retry storms.
299-309: Behavioral change enabled by default: confirm no regressions or add an opt-out.
patch_with_retrynow always enables instance-aware gating. Confirm downstream callers aren’t relying on independent inner retries, or consider adding an opt-out parameter (defaulting to True) and documenting it.
92-101: Type-safety: guardcode is Nonebefore matching patterns.Prevents passing
Noneinto_code_matches(helps pyright). No behavior change.Suggested change (outside the current diff):
code = _extract_status_code(exc) if code is not None and any(_code_matches(code, p) for p in code_patterns): logger.info("Retrying on exception %s with matched code %s", exc, code) return Truetests/nat/utils/test_retry_wrapper.py (4)
123-139: Align fixture naming with repo guidelines.Use
fixture_prefix and setname=...in the decorator.-@pytest.fixture(autouse=True) -def fast_sleep(monkeypatch): +@pytest.fixture(name="fast_sleep", autouse=True) +def fixture_fast_sleep(monkeypatch):
253-254: Fix lint: replace ambiguous multiplication sign×in comments.Ruff RUF003 flags the symbol; use plain
x.- # This would result in inner_calls = 4 (2 attempts × 2 retries) + # This would result in inner_calls = 4 (2 attempts x 2 retries) ... - # The key point: inner_calls is NOT 9 (3 outer attempts × 3 inner retries each) + # The key point: inner_calls is NOT 9 (3 outer attempts x 3 inner retries each)Also applies to: 328-330
260-262: Comment correction: attempt count math.
outer_calls == 3corresponds to initial + 2 retries (inner failure forced a second outer retry), not “+ 1 retry”.- assert svc.outer_calls == 3 # Initial + 1 retry + assert svc.outer_calls == 3 # Initial + 2 retries (inner failure triggers second outer retry)
183-401: Add a concurrency test to prevent regressions.Ensure two concurrent top-level calls on the same instance don’t suppress each other’s retries (the ContextVar fix should make this pass).
@pytest.mark.asyncio async def test_concurrent_calls_do_not_share_retry_context(): svc = NestedService() svc = ar.patch_with_retry(svc, retries=2, base_delay=0, retry_codes=["5xx"]) svc.outer_failures = 1 svc.inner_failures = 1 async def run(): return await svc.async_outer() # Two concurrent top-level calls on the same instance r1, r2 = await asyncio.gather(run(), run()) assert r1 == "async-outer(async-inner-ok)" assert r2 == "async-outer(async-inner-ok)" # Each call should have done its own retries independently. assert svc.outer_calls == 4 assert svc.inner_calls == 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/nat/utils/exception_handlers/automatic_retries.py(3 hunks)tests/nat/utils/test_retry_wrapper.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (9)
src/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
src/**/*.py: All importable Python code must live under src/
All public APIs in src/ require Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise with a period; surround code entities with backticks
Files:
src/nat/utils/exception_handlers/automatic_retries.py
src/nat/**/*
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Core functionality under src/nat should prioritize backward compatibility when changed
Files:
src/nat/utils/exception_handlers/automatic_retries.py
⚙️ CodeRabbit configuration file
This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.
Files:
src/nat/utils/exception_handlers/automatic_retries.py
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.py: Follow PEP 8/20 style; format with yapf (column_limit=120) and use 4-space indentation; end files with a single newline
Run ruff (ruff check --fix) per pyproject.toml; fix warnings unless explicitly ignored; ruff is linter-only
Use snake_case for functions/variables, PascalCase for classes, and UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: preserve stack traces and avoid duplicate logging
When re-raising exceptions, use bareraiseand log with logger.error(), not logger.exception()
When catching and not re-raising, log with logger.exception() to capture stack trace
Validate and sanitize all user input; prefer httpx with SSL verification and follow OWASP Top‑10
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile/mprof; cache with functools.lru_cache or external cache; leverage NumPy vectorization when beneficial
**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).
Files:
src/nat/utils/exception_handlers/automatic_retries.pytests/nat/utils/test_retry_wrapper.py
**/*.{py,sh,md,yml,yaml,toml,ini,json,ipynb,txt,rst}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/*.{py,sh,md,yml,yaml,toml,ini,json,ipynb,txt,rst}: Every file must start with the standard SPDX Apache-2.0 header; keep copyright years up‑to‑date
All source files must include the SPDX Apache‑2.0 header; do not bypass CI header checks
Files:
src/nat/utils/exception_handlers/automatic_retries.pytests/nat/utils/test_retry_wrapper.py
**/*.{py,md}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Never hard‑code version numbers in code or docs; versions are derived by setuptools‑scm
Files:
src/nat/utils/exception_handlers/automatic_retries.pytests/nat/utils/test_retry_wrapper.py
**/*.{py,yaml,yml}
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.
Files:
src/nat/utils/exception_handlers/automatic_retries.pytests/nat/utils/test_retry_wrapper.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,
and should contain an Apache License 2.0 header comment at the top of each file.
- Confirm that copyright years are up-to date whenever a file is changed.
Files:
src/nat/utils/exception_handlers/automatic_retries.pytests/nat/utils/test_retry_wrapper.py
tests/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Unit tests must live under tests/ and use configured markers (e2e, integration, etc.)
Files:
tests/nat/utils/test_retry_wrapper.py
⚙️ CodeRabbit configuration file
tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using thetest_prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using thefixture_prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass
Files:
tests/nat/utils/test_retry_wrapper.py
**/tests/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
**/tests/**/*.py: Test functions must use the test_ prefix and snake_case
Extract repeated test code into pytest fixtures; fixtures should set name=... in @pytest.fixture and functions named with fixture_ prefix
Mark expensive tests with @pytest.mark.slow or @pytest.mark.integration
Use pytest with pytest-asyncio for async code; mock external services with pytest_httpserver or unittest.mock
Files:
tests/nat/utils/test_retry_wrapper.py
🧬 Code graph analysis (1)
tests/nat/utils/test_retry_wrapper.py (1)
src/nat/utils/exception_handlers/automatic_retries.py (1)
patch_with_retry(274-347)
🪛 Ruff (0.12.2)
src/nat/utils/exception_handlers/automatic_retries.py
158-160: try-except-pass detected, consider logging the exception
(S110)
158-158: Do not catch blind exception: Exception
(BLE001)
200-200: Consider moving this statement to an else block
(TRY300)
224-224: Consider moving this statement to an else block
(TRY300)
tests/nat/utils/test_retry_wrapper.py
253-253: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF003)
328-328: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: CI Pipeline / Test (arm64, 3.13)
- GitHub Check: CI Pipeline / Test (arm64, 3.12)
- GitHub Check: CI Pipeline / Test (amd64, 3.12)
- GitHub Check: CI Pipeline / Test (arm64, 3.11)
|
/ok to test fdc8776 |
|
/merge |
Added
instance_context_awareflag to the retry decorator to avoid retry storms caused by nested method calls. This ensures retries are not triggered recursively in the same instance's execution context. Additional tests were implemented to verify functionality across edge cases, including nested, async, deep nesting, and multi-instance scenarios.Description
Added
instance_context_awareflag to the retry decorator to avoid retry storms caused by nested method calls. This ensures retries are not triggered recursively in the same instance's execution context. Additional tests were implemented to verify functionality across edge cases, including nested, async, deep nesting, and multi-instance scenarios.Summary
Problem
patch_with_retrywrapped every public method with exponential backoff.Approach
instance_context_awareto_retry_decorator._in_retry_contextflag on the instance.patch_with_retryby default.Implementation
src/nat/utils/exception_handlers/automatic_retries.py_retry_decorator(...):instance_context_aware: bool = False._check_retry_context,_set_retry_context.patch_with_retry(...):instance_context_aware=True.tests/nat/utils/test_retry_wrapper.pytest_nested_retry_preventiontest_inner_method_can_still_retry_when_called_directlytest_deep_nestingtest_async_nested_retry_preventiontest_multiple_instances_dont_interferetest_exception_propagation_in_nested_callstest_retry_storm_prevention_with_all_methods_failingBehavior Change
Backward Compatibility
Performance/Robustness
Edge Cases
_in_retry_contextcannot be set on an object (e.g., restricted attributes), we gracefully skip context tracking for that instance; behavior reverts to previous semantics for that object.Testing
Files Changed
src/nat/utils/exception_handlers/automatic_retries.pytests/nat/utils/test_retry_wrapper.pyChecklist
Risk
By Submitting this PR I confirm:
Summary by CodeRabbit
Improvements
Bug Fixes
Tests