Prevent retry storms in nested method calls by dnandakumar-nv · Pull Request #803 · NVIDIA/NeMo-Agent-Toolkit

dnandakumar-nv · 2025-09-16T20:13:47Z

Added instance_context_aware flag to the retry decorator to avoid retry storms caused by nested method calls. This ensures retries are not triggered recursively in the same instance's execution context. Additional tests were implemented to verify functionality across edge cases, including nested, async, deep nesting, and multi-instance scenarios.

Description

Added instance_context_aware flag to the retry decorator to avoid retry storms caused by nested method calls. This ensures retries are not triggered recursively in the same instance's execution context. Additional tests were implemented to verify functionality across edge cases, including nested, async, deep nesting, and multi-instance scenarios.

Summary

Fixes exponential “retry storms” when a public method (with retries) calls other public methods (also wrapped with retries) on the same instance.
Ensures, per instance, only the outermost call performs retries; nested calls execute once per outer attempt.
Adds comprehensive tests for sync, async, deep nesting, multi-instance isolation, and exception propagation.

Problem

patch_with_retry wrapped every public method with exponential backoff.
When method A calls method B (both wrapped), failures caused compounded retries (e.g., 3×3 = 9 calls), creating a retry storm and unexpected behavior.

Approach

Introduce an instance-level retry context to gate nested retries:
- Add instance_context_aware to _retry_decorator.
- When a wrapped method starts retries, set a _in_retry_context flag on the instance.
- Nested calls on the same instance detect the flag and skip their own retry loops, letting the outermost call control retries.
- Flag is set/unset with try/finally to avoid leaks; if setting the flag fails (immutable/slots), we safely no-op and continue without context tracking.
Enable this behavior from patch_with_retry by default.

Implementation

File: src/nat/utils/exception_handlers/automatic_retries.py
- _retry_decorator(...):
  - New param: instance_context_aware: bool = False.
  - New helpers: _check_retry_context, _set_retry_context.
  - Apply context gating to sync, async, and (a)generator wrappers.
- patch_with_retry(...):
  - Pass instance_context_aware=True.
Tests: tests/nat/utils/test_retry_wrapper.py
- Added/updated:
  - test_nested_retry_prevention
  - test_inner_method_can_still_retry_when_called_directly
  - test_deep_nesting
  - test_async_nested_retry_prevention
  - test_multiple_instances_dont_interfere
  - test_exception_propagation_in_nested_calls
  - test_retry_storm_prevention_with_all_methods_failing

Behavior Change

Previously: nested public calls each executed their own retry loops (e.g., O(retries^depth)).
Now: only the outermost call retries; nested calls run once per outer attempt (O(retries × depth)). Direct calls to methods still retry as configured.

Backward Compatibility

Public API unchanged.
Behavior change only affects nested calls within the same instance; direct method calls retain original retry behavior.
If any user relied on nested retries compounding, they’ll now see fewer total attempts (intended fix).

Performance/Robustness

Eliminates exponential blow-ups in requests and sleep delays.
Reduces load, latency, and noise from nested backoff.

Edge Cases

If _in_retry_context cannot be set on an object (e.g., restricted attributes), we gracefully skip context tracking for that instance; behavior reverts to previous semantics for that object.

Testing

Covers sync, async, generator pathways.
Verifies:
- Nested calls don’t compound retries.
- Direct calls still retry as configured.
- Deep nesting behaves linearly with the outer retry count.
- Multiple instances are isolated.
- Exceptions propagate as expected.

Files Changed

src/nat/utils/exception_handlers/automatic_retries.py
tests/nat/utils/test_retry_wrapper.py

Checklist

Prevent retry storms via instance context
Sync/async/(a)gen parity
Unit tests added and adjusted
No public API changes
Comments and naming are clear

Risk

Low. Targeted change with comprehensive tests; default behavior remains intuitive and safer under nesting.

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

Improvements
- Instance-aware retry handling to prevent redundant retries in nested calls (sync and async), improving reliability, performance, and reducing duplicate requests and rate-limit pressure.
Bug Fixes
- Eliminated retry storms when a retried method invokes other retried methods on the same object; outer retries proceed while inner calls execute once as intended.
Tests
- Comprehensive tests covering nested, deep, async, multi-instance, and failure-propagation scenarios to ensure correct behavior and stability.

Added `instance_context_aware` flag to the retry decorator to avoid retry storms caused by nested method calls. This ensures retries are not triggered recursively in the same instance's execution context. Additional tests were implemented to verify functionality across edge cases, including nested, async, deep nesting, and multi-instance scenarios. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>

…storms

copy-pr-bot · 2025-09-16T20:13:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-09-16T20:13:54Z

Warning

Rate limit exceeded

@dnandakumar-nv has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 18 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 0f77e53 and fdc8776.

📒 Files selected for processing (1)

src/nat/utils/exception_handlers/automatic_retries.py (3 hunks)

Walkthrough

Introduces an instance-aware retry context in the retry decorator to prevent nested retry storms. Adds a flag to track per-instance retry context, short-circuiting inner retries during outer retries across sync, async, and generator wrappers. Updates patch_with_retry to enable this behavior and adds comprehensive tests validating nested and async scenarios.

Changes

Cohort / File(s)	Summary
Retry decorator: instance-aware context `src/nat/utils/exception_handlers/automatic_retries.py`	Adds instance_context_aware flag to `_retry_decorator`; implements per-instance `_in_retry_context` via `_check_retry_context` and `_set_retry_context`; updates `_call_with_retry_async`, `_agen_with_retry`, `_gen_with_retry`, `_sync_with_retry` to short-circuit retries when already in context; `patch_with_retry` passes `instance_context_aware=True`; updates docstring.
Tests: nested retry storm prevention `tests/nat/utils/test_retry_wrapper.py`	Adds `NestedService` with sync/async nested methods and counters; introduces tests covering nested sync/async flows, deep nesting, direct inner retries, all-failing cases, instance isolation, and exception propagation; uses `ar.patch_with_retry(retries=3, base_delay=0, retry_codes=...)`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor C as Caller
  participant O as NestedService.outer_method
  participant D as RetryDecorator (outer)
  participant I as NestedService.inner_method
  participant DI as RetryDecorator (inner)

  Note over D: instance_context_aware=true

  C->>D: call outer_method(...)
  activate D
  D->>D: check _in_retry_context(self)==False
  D->>D: set _in_retry_context(self)=True
  D->>O: invoke wrapped outer_method
  activate O

  O->>DI: call inner_method(...)
  activate DI
  DI->>DI: detect _in_retry_context(self)==True
  DI->>I: execute once (no retries)
  deactivate DI

  O-->>D: result or exception
  deactivate O

  alt error and retryable
    D->>D: backoff + retry (exponential)
    D->>O: re-invoke outer_method
  end

  D->>D: clear _in_retry_context(self)
  D-->>C: final result or raised error
  deactivate D

  Note over DI: Inner retries only when called<br/>outside any existing context

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 73.91% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Prevent retry storms in nested method calls" is concise, uses the imperative mood, and accurately summarizes the primary change (adding instance-level retry context to avoid nested retry storms); it is specific and remains well under the ~72-character guideline. It directly reflects the PR objectives and is clear for reviewers.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

dnandakumar-nv · 2025-09-16T20:14:44Z

/ok to test 0f77e53

willkill07

One style nit, but otherwise LGTM

src/nat/utils/exception_handlers/automatic_retries.py

Replaces the previous retry context tracking mechanism with a dedicated `_RetryContext` class using a context manager. This improves code readability, ensures proper cleanup of retry state, and avoids redundant flags and manual state management. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/nat/utils/test_retry_wrapper.py (1)
175-180: Add @pytest.mark.asyncio or enable asyncio_mode=auto — pytest-asyncio defaults to strict

pyproject.toml contains pytest-asyncio==0.24.* (pyproject.toml:212); with no asyncio_mode set, pytest-asyncio defaults to strict, so async tests must be explicitly marked or the config must be changed. (pytest-asyncio.readthedocs.io)
+@pytest.mark.asyncio
 async def test_async_retry():

♻️ Duplicate comments (4)

src/nat/utils/exception_handlers/automatic_retries.py (4)

145-161: Fix: instance flag isn’t concurrency-safe; use ContextVar (per-task) instead.

setting _in_retry_context on the instance leaks across concurrent calls on the same object (async tasks/threads), causing unrelated top-level calls to skip their own retries. Scope the “in-retry” marker to the current task/thread via ContextVar and remove attribute mutation entirely.

Apply this diff to replace the helpers:

-        def _check_retry_context(args) -> bool:
-            """Check if we're already in a retry context for this instance."""
-            if not use_context_aware or not args:
-                return False
-            obj = args[0]
-            return getattr(obj, '_in_retry_context', False)
-
-        def _set_retry_context(args, value: bool) -> None:
-            """Set the retry context flag on the instance."""
-            if use_context_aware and args:
-                obj = args[0]
-                try:
-                    object.__setattr__(obj, '_in_retry_context', value)
-                except Exception:
-                    # If we can't set the attribute, just continue without context tracking
-                    pass
+        # Per-task/thread context to avoid cross-call interference.
+        _IN_RETRY_CTX: "ContextVar[frozenset[int]]"
+        try:
+            _IN_RETRY_CTX  # type: ignore[misc]
+        except NameError:
+            from contextvars import ContextVar
+            _IN_RETRY_CTX = ContextVar("_in_retry_ctx", default=frozenset())
+
+        def _in_retry_context(args) -> bool:
+            if not use_context_aware or not args:
+                return False
+            return id(args[0]) in _IN_RETRY_CTX.get()
+
+        def _enter_retry_context(args):
+            if not use_context_aware or not args:
+                return None
+            current = _IN_RETRY_CTX.get()
+            return _IN_RETRY_CTX.set(current | {id(args[0])})
+
+        def _exit_retry_context(token) -> None:
+            if token is not None:
+                _IN_RETRY_CTX.reset(token)

Note: add from contextvars import ContextVar near imports if not present.

211-232: Refactor sync-generator wrapper to use task/thread-local context.

Prevents cross-call interference and removes attribute mutation.

-            # Skip retries if already in retry context
-            if _check_retry_context(args):
+            # Skip retries if already in a retry context for this instance
+            if _in_retry_context(args):
                 yield from fn(*args, **kw)
                 return
-
-            _set_retry_context(args, True)
-            try:
+            token = _enter_retry_context(args)
+            try:
                 delay = base_delay
                 for attempt in range(retries):
                     call_args = copy.deepcopy(args) if use_deepcopy else args
                     call_kwargs = copy.deepcopy(kw) if use_deepcopy else kw
                     try:
                         yield from fn(*call_args, **call_kwargs)
                         return
                     except retry_on as exc:
                         if (not _want_retry(exc, code_patterns=retry_codes, msg_substrings=retry_on_messages)
                                 or attempt == retries - 1):
                             raise
                         time.sleep(delay)
                         delay *= backoff
             finally:
-                _set_retry_context(args, False)
+                _exit_retry_context(token)

163-182: Refactor async wrapper to use task-local retry context.

Removes fragile instance flag, eliminates try/finally boilerplate, and prevents cross-call bleed.

-            # Skip retries if already in retry context
-            if _check_retry_context(args):
+            # Skip retries if already in a retry context for this instance
+            if _in_retry_context(args):
                 return await fn(*args, **kw)
-
-            _set_retry_context(args, True)
-            try:
-                delay = base_delay
-                for attempt in range(retries):
-                    call_args = copy.deepcopy(args) if use_deepcopy else args
-                    call_kwargs = copy.deepcopy(kw) if use_deepcopy else kw
-                    try:
-                        return await fn(*call_args, **call_kwargs)
-                    except retry_on as exc:
-                        if (not _want_retry(exc, code_patterns=retry_codes, msg_substrings=retry_on_messages)
-                                or attempt == retries - 1):
-                            raise
-                        await asyncio.sleep(delay)
-                        delay *= backoff
-            finally:
-                _set_retry_context(args, False)
+            token = _enter_retry_context(args)
+            try:
+                delay = base_delay
+                for attempt in range(retries):
+                    call_args = copy.deepcopy(args) if use_deepcopy else args
+                    call_kwargs = copy.deepcopy(kw) if use_deepcopy else kw
+                    try:
+                        return await fn(*call_args, **call_kwargs)
+                    except retry_on as exc:
+                        if (not _want_retry(exc, code_patterns=retry_codes, msg_substrings=retry_on_messages)
+                                or attempt == retries - 1):
+                            raise
+                        await asyncio.sleep(delay)
+                        delay *= backoff
+            finally:
+                _exit_retry_context(token)

235-254: Refactor sync wrapper to use task/thread-local context.

Same fix as above.

-            # Skip retries if already in retry context
-            if _check_retry_context(args):
+            # Skip retries if already in a retry context for this instance
+            if _in_retry_context(args):
                 return fn(*args, **kw)
-
-            _set_retry_context(args, True)
-            try:
+            token = _enter_retry_context(args)
+            try:
                 delay = base_delay
                 for attempt in range(retries):
                     call_args = copy.deepcopy(args) if use_deepcopy else args
                     call_kwargs = copy.deepcopy(kw) if use_deepcopy else kw
                     try:
                         return fn(*call_args, **call_kwargs)
                     except retry_on as exc:
                         if (not _want_retry(exc, code_patterns=retry_codes, msg_substrings=retry_on_messages)
                                 or attempt == retries - 1):
                             raise
                         time.sleep(delay)
                         delay *= backoff
             finally:
-                _set_retry_context(args, False)
+                _exit_retry_context(token)

🧹 Nitpick comments (7)

src/nat/utils/exception_handlers/automatic_retries.py (3)
135-139: Docstring: clarify concurrency semantics of instance_context_aware.

Explicitly state the context is task/thread-local (ContextVar), avoids cross-call interference, and only gates nested calls on the same instance.
     instance_context_aware:
-        If True, the decorator will check for a retry context flag on the first
-        argument (assumed to be 'self'). If the flag is set, retries are skipped
-        to prevent retry storms in nested method calls.
+        If True, the decorator gates retries by a per‑task/thread context keyed
+        to the first argument (assumed to be `self`). When the instance is
+        already “in retry” within the current task/thread, inner calls skip
+        their own retry loops to prevent nested retry storms.
299-309: Behavioral change enabled by default: confirm no regressions or add an opt-out.

patch_with_retry now always enables instance-aware gating. Confirm downstream callers aren’t relying on independent inner retries, or consider adding an opt-out parameter (defaulting to True) and documenting it.

92-101: Type-safety: guard code is None before matching patterns.

Prevents passing None into _code_matches (helps pyright). No behavior change.

Suggested change (outside the current diff):
code = _extract_status_code(exc)
if code is not None and any(_code_matches(code, p) for p in code_patterns):
    logger.info("Retrying on exception %s with matched code %s", exc, code)
    return True
tests/nat/utils/test_retry_wrapper.py (4)
123-139: Align fixture naming with repo guidelines.

Use fixture_ prefix and set name=... in the decorator.
-@pytest.fixture(autouse=True)
-def fast_sleep(monkeypatch):
+@pytest.fixture(name="fast_sleep", autouse=True)
+def fixture_fast_sleep(monkeypatch):
253-254: Fix lint: replace ambiguous multiplication sign × in comments.

Ruff RUF003 flags the symbol; use plain x.
-    # This would result in inner_calls = 4 (2 attempts × 2 retries)
+    # This would result in inner_calls = 4 (2 attempts x 2 retries)
...
-    # The key point: inner_calls is NOT 9 (3 outer attempts × 3 inner retries each)
+    # The key point: inner_calls is NOT 9 (3 outer attempts x 3 inner retries each)
Also applies to: 328-330

260-262: Comment correction: attempt count math.

outer_calls == 3 corresponds to initial + 2 retries (inner failure forced a second outer retry), not “+ 1 retry”.
-    assert svc.outer_calls == 3  # Initial + 1 retry
+    assert svc.outer_calls == 3  # Initial + 2 retries (inner failure triggers second outer retry)
183-401: Add a concurrency test to prevent regressions.

Ensure two concurrent top-level calls on the same instance don’t suppress each other’s retries (the ContextVar fix should make this pass).
@pytest.mark.asyncio
async def test_concurrent_calls_do_not_share_retry_context():
    svc = NestedService()
    svc = ar.patch_with_retry(svc, retries=2, base_delay=0, retry_codes=["5xx"])
    svc.outer_failures = 1
    svc.inner_failures = 1

    async def run():
        return await svc.async_outer()

    # Two concurrent top-level calls on the same instance
    r1, r2 = await asyncio.gather(run(), run())
    assert r1 == "async-outer(async-inner-ok)"
    assert r2 == "async-outer(async-inner-ok)"
    # Each call should have done its own retries independently.
    assert svc.outer_calls == 4
    assert svc.inner_calls == 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc32961 and 0f77e53.

📒 Files selected for processing (2)

src/nat/utils/exception_handlers/automatic_retries.py (3 hunks)
tests/nat/utils/test_retry_wrapper.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (9)

src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

src/**/*.py: All importable Python code must live under src/
All public APIs in src/ require Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise with a period; surround code entities with backticks

Files:

src/nat/utils/exception_handlers/automatic_retries.py

src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Core functionality under src/nat should prioritize backward compatibility when changed

Files:

src/nat/utils/exception_handlers/automatic_retries.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

src/nat/utils/exception_handlers/automatic_retries.py

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 8/20 style; format with yapf (column_limit=120) and use 4-space indentation; end files with a single newline
Run ruff (ruff check --fix) per pyproject.toml; fix warnings unless explicitly ignored; ruff is linter-only
Use snake_case for functions/variables, PascalCase for classes, and UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: preserve stack traces and avoid duplicate logging
When re-raising exceptions, use bare raise and log with logger.error(), not logger.exception()
When catching and not re-raising, log with logger.exception() to capture stack trace
Validate and sanitize all user input; prefer httpx with SSL verification and follow OWASP Top‑10
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile/mprof; cache with functools.lru_cache or external cache; leverage NumPy vectorization when beneficial

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

Files:

src/nat/utils/exception_handlers/automatic_retries.py
tests/nat/utils/test_retry_wrapper.py

**/*.{py,sh,md,yml,yaml,toml,ini,json,ipynb,txt,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,sh,md,yml,yaml,toml,ini,json,ipynb,txt,rst}: Every file must start with the standard SPDX Apache-2.0 header; keep copyright years up‑to‑date
All source files must include the SPDX Apache‑2.0 header; do not bypass CI header checks

Files:

src/nat/utils/exception_handlers/automatic_retries.py
tests/nat/utils/test_retry_wrapper.py

**/*.{py,md}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Never hard‑code version numbers in code or docs; versions are derived by setuptools‑scm

Files:

src/nat/utils/exception_handlers/automatic_retries.py
tests/nat/utils/test_retry_wrapper.py

**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

src/nat/utils/exception_handlers/automatic_retries.py
tests/nat/utils/test_retry_wrapper.py

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

Confirm that copyright years are up-to date whenever a file is changed.

Files:

src/nat/utils/exception_handlers/automatic_retries.py
tests/nat/utils/test_retry_wrapper.py

tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Unit tests must live under tests/ and use configured markers (e2e, integration, etc.)

Files:

tests/nat/utils/test_retry_wrapper.py

⚙️ CodeRabbit configuration file

tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using the test_ prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using the fixture_ prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass

Files:

tests/nat/utils/test_retry_wrapper.py

**/tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/tests/**/*.py: Test functions must use the test_ prefix and snake_case
Extract repeated test code into pytest fixtures; fixtures should set name=... in @pytest.fixture and functions named with fixture_ prefix
Mark expensive tests with @pytest.mark.slow or @pytest.mark.integration
Use pytest with pytest-asyncio for async code; mock external services with pytest_httpserver or unittest.mock

Files:

tests/nat/utils/test_retry_wrapper.py

🧬 Code graph analysis (1)

tests/nat/utils/test_retry_wrapper.py (1)

src/nat/utils/exception_handlers/automatic_retries.py (1)

patch_with_retry (274-347)

🪛 Ruff (0.12.2)

src/nat/utils/exception_handlers/automatic_retries.py

158-160: try-except-pass detected, consider logging the exception

(S110)

158-158: Do not catch blind exception: Exception

(BLE001)

200-200: Consider moving this statement to an else block

(TRY300)

224-224: Consider moving this statement to an else block

(TRY300)

tests/nat/utils/test_retry_wrapper.py

253-253: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)

328-328: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: CI Pipeline / Test (arm64, 3.13)
GitHub Check: CI Pipeline / Test (arm64, 3.12)
GitHub Check: CI Pipeline / Test (amd64, 3.12)
GitHub Check: CI Pipeline / Test (arm64, 3.11)

src/nat/utils/exception_handlers/automatic_retries.py

tests/nat/utils/test_retry_wrapper.py

dnandakumar-nv · 2025-09-16T20:36:36Z

/ok to test fdc8776

dnandakumar-nv · 2025-09-16T20:49:38Z

/merge

dnandakumar-nv added 2 commits September 16, 2025 16:11

Merge remote-tracking branch 'upstream/develop' into bug-fixes/retry-…

0f77e53

…storms

dnandakumar-nv requested a review from a team as a code owner September 16, 2025 20:13

dnandakumar-nv added bug Something isn't working non-breaking Non-breaking change labels Sep 16, 2025

willkill07 approved these changes Sep 16, 2025

View reviewed changes

src/nat/utils/exception_handlers/automatic_retries.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

src/nat/utils/exception_handlers/automatic_retries.py Outdated Show resolved Hide resolved

tests/nat/utils/test_retry_wrapper.py Show resolved Hide resolved

rapids-bot bot merged commit 7276c00 into NVIDIA:develop Sep 16, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent retry storms in nested method calls#803

Prevent retry storms in nested method calls#803
rapids-bot[bot] merged 3 commits intoNVIDIA:developfrom
dnandakumar-nv:bug-fixes/retry-storms

dnandakumar-nv commented Sep 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Sep 16, 2025

Uh oh!

coderabbitai bot commented Sep 16, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

dnandakumar-nv commented Sep 16, 2025

Uh oh!

willkill07 left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

Uh oh!

Uh oh!

dnandakumar-nv commented Sep 16, 2025

Uh oh!

dnandakumar-nv commented Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dnandakumar-nv commented Sep 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Problem

Approach

Implementation

Behavior Change

Backward Compatibility

Performance/Robustness

Edge Cases

Testing

Files Changed

Checklist

Risk

By Submitting this PR I confirm:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 16, 2025

Uh oh!

coderabbitai bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

dnandakumar-nv commented Sep 16, 2025

Uh oh!

willkill07 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

Uh oh!

Uh oh!

dnandakumar-nv commented Sep 16, 2025

Uh oh!

dnandakumar-nv commented Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dnandakumar-nv commented Sep 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 16, 2025 •

edited

Loading