Optimize retry logic with memory management improvements by dnandakumar-nv · Pull Request #1014 · NVIDIA/NeMo-Agent-Toolkit

dnandakumar-nv · 2025-10-15T20:20:27Z

Introduced optimizations to reduce memory overhead during retries. Added features include clearing exception tracebacks, periodic garbage collection, and support for shallow/deep copy of arguments. Accompanied by extensive test coverage for both new and existing functionality.

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

New Features
- Enhanced retry behavior with configurable argument copying (shallow vs. deep), adjustable garbage-collection frequency, and optional traceback clearing between attempts.
- Consistent improvements across sync, async, and generator flows; last error re-raised after retry exhaustion.
Documentation
- Updated API documentation to describe new retry options and defaults.
Tests
- Added extensive tests for memory optimizations, weak-reference handling, GC controls, traceback clearing, and async/generator compatibility.

Introduced optimizations to reduce memory overhead during retries. Added features include clearing exception tracebacks, periodic garbage collection, and support for shallow/deep copy of arguments. Accompanied by extensive test coverage for both new and existing functionality. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>

copy-pr-bot · 2025-10-15T20:20:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-15T20:20:56Z

Walkthrough

Adds memory-optimized retry handling with shallow/deep copy controls, weakref-based instance gating, optional traceback clearing, and periodic GC during retries. Updates patch_with_retry signature to include deep_copy, gc_frequency, and clear_tracebacks. Applies changes across sync/async (including generators). Introduces comprehensive tests covering new behaviors and configurations.

Changes

Cohort / File(s)	Summary
Retry infrastructure and memory management `src/nat/utils/exception_handlers/automatic_retries.py`	Refactors retry internals: adds shallow/deep argument copy helpers, traceback clearing, periodic GC, weakref-based instance-context gating, last_exception tracking, and updated retry loops. Public API change: patch_with_retry now accepts deep_copy, gc_frequency, clear_tracebacks; internal decorator updated accordingly.
Tests for retry memory features `tests/nat/utils/test_retry_wrapper.py`	Adds tests for shallow vs deep copy, traceback clearing toggle, GC frequency, weakref handling, non-weakref objects, sync/async and generator compatibility, nested retry behavior, and deterministic validation via monkeypatching sleeps and GC.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant API as patch_with_retry
  participant Deco as _retry_decorator
  participant Wrap as WrappedFn
  participant Target as Target Function
  participant GC as Garbage Collector

  Caller->>API: call patch_with_retry(fn, deep_copy, gc_frequency, clear_tracebacks, ...)
  API->>Deco: configure shallow_copy = not deep_copy, options
  Deco->>Caller: returns WrappedFn

  Caller->>Wrap: invoke with args
  rect rgba(230,240,255,0.5)
    note right of Wrap: Attempt loop (1..max_retries)
    Wrap->>Wrap: prepare args (shallow/deep copy)
    alt instance-context gating (weakref)
      Wrap->>Wrap: verify/skip based on weakref availability
    end
    Wrap->>Target: call
    alt success
      Target-->>Wrap: result
      Wrap-->>Caller: return result
    else failure (Exception e)
      Wrap->>Wrap: record last_exception
      opt clear_tracebacks
        Wrap->>Wrap: detach traceback chains
      end
      opt periodic GC (every gc_frequency)
        Wrap->>GC: gc.collect()
      end
      Wrap->>Wrap: _want_retry?
      alt retry allowed
        Wrap->>Wrap: backoff/sleep
      else no more retries
        Wrap-->>Caller: raise last_exception
      end
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

breaking, improvement

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.16% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly summarizes the main change, uses the imperative mood, and stays under the character limit, clearly reflecting the memory-optimization enhancements to the retry logic.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/nat/utils/exception_handlers/automatic_retries.py (1)
421-438: Fix docstring parameter name.

The docstring references deepcopy: (line 435), but the parameter is named deep_copy (line 421). This inconsistency will confuse users and break documentation tools that parse docstrings.

Apply this diff:
-    deepcopy:
+    deep_copy:
         If True, each retry receives deep‑copied *args and **kwargs* to avoid
         mutating shared state between attempts.

🧹 Nitpick comments (2)

src/nat/utils/exception_handlers/automatic_retries.py (2)
47-59: Prefer unpacking over tuple concatenation.

Line 57 concatenates tuples using +, which creates an intermediate tuple. Using unpacking is more idiomatic and efficient.

Apply this diff:
-        return (args[0], ) + copy.deepcopy(args[1:]), copy.deepcopy(kwargs)
+        return (args[0], *copy.deepcopy(args[1:])), copy.deepcopy(kwargs)
240-247: Consider logging exceptions in cleanup path.

The try-except-pass block silently swallows all exceptions during retry context cleanup. While this defensive approach prevents cleanup failures from propagating, it can hide real issues.

Consider logging at debug level to aid troubleshooting:
                        try:
                            object.__setattr__(obj, "_in_retry_context", False)
                        except Exception:
-                            pass
+                            logger.debug(
+                                "Failed to clear retry context flag during cleanup",
+                                exc_info=True
+                            )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c39e258 and 43a7da5.

📒 Files selected for processing (2)

src/nat/utils/exception_handlers/automatic_retries.py (8 hunks)
tests/nat/utils/test_retry_wrapper.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

tests/nat/utils/test_retry_wrapper.py
src/nat/utils/exception_handlers/automatic_retries.py

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

tests/nat/utils/test_retry_wrapper.py
src/nat/utils/exception_handlers/automatic_retries.py

tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Unit tests reside under tests/ and should use markers defined in pyproject.toml (e.g., integration)

Files:

tests/nat/utils/test_retry_wrapper.py

⚙️ CodeRabbit configuration file

tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using the test_ prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using the fixture_ prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass

Files:

tests/nat/utils/test_retry_wrapper.py

{tests/**/*.py,examples/*/tests/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{tests/**/*.py,examples/*/tests/**/*.py}: Use pytest (with pytest-asyncio for async); name test files test_*.py; test functions start with test_; extract repeated code into fixtures; fixtures must set name in decorator and be named with fixture_ prefix
Mock external services with pytest_httpserver or unittest.mock; do not hit live endpoints
Mark expensive tests with @pytest.mark.slow or @pytest.mark.integration

Files:

tests/nat/utils/test_retry_wrapper.py

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

Confirm that copyright years are up-to date whenever a file is changed.

Files:

tests/nat/utils/test_retry_wrapper.py
src/nat/utils/exception_handlers/automatic_retries.py

src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All importable Python code must live under src/ (or packages//src/)

Files:

src/nat/utils/exception_handlers/automatic_retries.py

src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Changes in src/nat should prioritize backward compatibility

Files:

src/nat/utils/exception_handlers/automatic_retries.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

src/nat/utils/exception_handlers/automatic_retries.py

{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful

Files:

src/nat/utils/exception_handlers/automatic_retries.py

🧬 Code graph analysis (2)

tests/nat/utils/test_retry_wrapper.py (1)

src/nat/utils/exception_handlers/automatic_retries.py (3)

_clear_exception_context (61-80)

_retry_decorator (168-409)

patch_with_retry (412-493)

src/nat/utils/exception_handlers/automatic_retries.py (1)

src/nat/llm/utils/thinking.py (1)

decorate (86-116)

🪛 Ruff (0.14.0)

tests/nat/utils/test_retry_wrapper.py

440-440: Abstract raise to an inner function

(TRY301)

440-440: Avoid specifying long messages outside the exception class

(TRY003)

src/nat/utils/exception_handlers/automatic_retries.py

57-57: Consider (args[0], *copy.deepcopy(args[1:])) instead of concatenation

Replace with (args[0], *copy.deepcopy(args[1:]))

(RUF005)

203-203: _RetryContext.__slots__ is not sorted

Apply a natural sort to _RetryContext.__slots__

(RUF023)

234-234: Consider moving this statement to an else block

(TRY300)

235-235: Do not catch blind exception: Exception

(BLE001)

246-247: try-except-pass detected, consider logging the exception

(S110)

246-246: Do not catch blind exception: Exception

(BLE001)

305-305: Consider moving this statement to an else block

(TRY300)

342-342: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (9)

tests/nat/utils/test_retry_wrapper.py (3)

452-530: LGTM! Comprehensive coverage of traceback clearing feature.

Both test functions effectively validate the clear_tracebacks configuration:

test_traceback_clearing confirms that traceback cleanup is invoked on each failed retry attempt.

test_traceback_not_cleared_when_disabled ensures the feature respects the disabled flag.

The use of monkeypatch to track calls while preserving the original behavior is a good testing pattern.

532-576: LGTM! Clear validation of shallow vs deep copy semantics.

Both tests effectively demonstrate the difference between shallow and deep copy behavior:

With deep_copy=False (default), modifications to mutable arguments during failed attempts persist.

With deep_copy=True, each retry receives independent copies, preventing cross-attempt pollution.

This is critical for ensuring predictable retry behavior when dealing with mutable state.

578-720: LGTM! Thorough coverage of memory optimization features.

These tests validate critical aspects of the memory management implementation:

test_gc_frequency confirms garbage collection is invoked at the correct intervals (attempts 3 and 6 with frequency=3).

test_weak_reference_cleanup verifies that using weak references doesn't prevent object cleanup.

Generator and async tests ensure memory optimizations work across all function types.

test_retry_context_with_non_weakref_objects correctly handles built-in types like list that don't support weak references.

Note: Line 440 has static analysis hints (TRY301, TRY003) about exception handling style. These are acceptable for test helper code and don't need to be addressed.

src/nat/utils/exception_handlers/automatic_retries.py (6)

61-81: LGTM! Robust exception traceback cleanup.

The recursive clearing of __traceback__, __cause__, and __context__ is essential for breaking reference cycles and freeing memory. The defensive exception handling appropriately guards against read-only attributes or other edge cases.

83-87: LGTM! Appropriate periodic garbage collection.

The implementation correctly triggers GC at the specified frequency, skipping the initial attempt (attempt 0) to avoid unnecessary overhead before any failures occur.

168-198: LGTM! Well-designed memory optimization parameters.

The updated signature provides sensible defaults:

shallow_copy=True prioritizes performance (shallow copy is much faster than deep copy).

gc_frequency=3 balances memory cleanup with GC overhead.

clear_tracebacks=True enables memory optimization by default.

The skip_self_in_deepcopy logic correctly prevents deep copying the instance object in instance methods, which would be wasteful and potentially problematic.

249-285: LGTM! Consistent memory optimization in async retry path.

The implementation correctly:

Applies shallow or deep copy based on configuration.

Clears exception tracebacks when enabled.

Runs periodic garbage collection.

Propagates the last exception if all retries are exhausted.

The final if last_exception: raise last_exception (lines 283-284) serves as a defensive fallback, though it should be unreachable when retries > 0 because the loop always returns or raises within the retry logic.

286-396: LGTM! Consistent memory optimizations across all wrapper variants.

All three retry wrappers (async generator, sync generator, and sync function) correctly apply the same memory optimization pattern:

Conditional shallow/deep copy of arguments.

Traceback clearing when enabled.

Periodic garbage collection.

Proper exception tracking and propagation.

The consistent implementation across all execution paths ensures uniform behavior regardless of whether the wrapped function is sync, async, or a generator.

440-453: LGTM! Clean parameter translation to internal API.

The inversion of deep_copy to shallow_copy (line 441) correctly translates the public API to the internal implementation. The new memory optimization parameters (gc_frequency, clear_tracebacks) are properly passed to the decorator, and instance_context_aware=True (line 453) ensures retry storm prevention is enabled for patched instances.

dnandakumar-nv · 2025-10-15T20:25:22Z

/ok to test 43a7da5

dnandakumar-nv · 2025-10-15T22:03:04Z

/merge

dnandakumar-nv requested a review from a team as a code owner October 15, 2025 20:20

coderabbitai bot added breaking Breaking change improvement Improvement to existing functionality labels Oct 15, 2025

dnandakumar-nv added non-breaking Non-breaking change and removed breaking Breaking change labels Oct 15, 2025

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

AnuradhaKaruppiah approved these changes Oct 15, 2025

View reviewed changes

rapids-bot bot merged commit d0ac939 into NVIDIA:develop Oct 15, 2025
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize retry logic with memory management improvements#1014

Optimize retry logic with memory management improvements#1014
rapids-bot[bot] merged 1 commit intoNVIDIA:developfrom
dnandakumar-nv:rettry_optimizations

dnandakumar-nv commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

dnandakumar-nv commented Oct 15, 2025

Uh oh!

dnandakumar-nv commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dnandakumar-nv commented Oct 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

By Submitting this PR I confirm:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 15, 2025

Uh oh!

coderabbitai bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

dnandakumar-nv commented Oct 15, 2025

Uh oh!

dnandakumar-nv commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dnandakumar-nv commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 15, 2025 •

edited

Loading