Optimize retry logic with memory management improvements#1014
Optimize retry logic with memory management improvements#1014rapids-bot[bot] merged 1 commit intoNVIDIA:developfrom
Conversation
Introduced optimizations to reduce memory overhead during retries. Added features include clearing exception tracebacks, periodic garbage collection, and support for shallow/deep copy of arguments. Accompanied by extensive test coverage for both new and existing functionality. Signed-off-by: dnandakumar-nv <dnandakumar@nvidia.com>
WalkthroughAdds memory-optimized retry handling with shallow/deep copy controls, weakref-based instance gating, optional traceback clearing, and periodic GC during retries. Updates patch_with_retry signature to include deep_copy, gc_frequency, and clear_tracebacks. Applies changes across sync/async (including generators). Introduces comprehensive tests covering new behaviors and configurations. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant API as patch_with_retry
participant Deco as _retry_decorator
participant Wrap as WrappedFn
participant Target as Target Function
participant GC as Garbage Collector
Caller->>API: call patch_with_retry(fn, deep_copy, gc_frequency, clear_tracebacks, ...)
API->>Deco: configure shallow_copy = not deep_copy, options
Deco->>Caller: returns WrappedFn
Caller->>Wrap: invoke with args
rect rgba(230,240,255,0.5)
note right of Wrap: Attempt loop (1..max_retries)
Wrap->>Wrap: prepare args (shallow/deep copy)
alt instance-context gating (weakref)
Wrap->>Wrap: verify/skip based on weakref availability
end
Wrap->>Target: call
alt success
Target-->>Wrap: result
Wrap-->>Caller: return result
else failure (Exception e)
Wrap->>Wrap: record last_exception
opt clear_tracebacks
Wrap->>Wrap: detach traceback chains
end
opt periodic GC (every gc_frequency)
Wrap->>GC: gc.collect()
end
Wrap->>Wrap: _want_retry?
alt retry allowed
Wrap->>Wrap: backoff/sleep
else no more retries
Wrap-->>Caller: raise last_exception
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/nat/utils/exception_handlers/automatic_retries.py (1)
421-438: Fix docstring parameter name.The docstring references
deepcopy:(line 435), but the parameter is nameddeep_copy(line 421). This inconsistency will confuse users and break documentation tools that parse docstrings.Apply this diff:
- deepcopy: + deep_copy: If True, each retry receives deep‑copied *args and **kwargs* to avoid mutating shared state between attempts.
🧹 Nitpick comments (2)
src/nat/utils/exception_handlers/automatic_retries.py (2)
47-59: Prefer unpacking over tuple concatenation.Line 57 concatenates tuples using
+, which creates an intermediate tuple. Using unpacking is more idiomatic and efficient.Apply this diff:
- return (args[0], ) + copy.deepcopy(args[1:]), copy.deepcopy(kwargs) + return (args[0], *copy.deepcopy(args[1:])), copy.deepcopy(kwargs)
240-247: Consider logging exceptions in cleanup path.The
try-except-passblock silently swallows all exceptions during retry context cleanup. While this defensive approach prevents cleanup failures from propagating, it can hide real issues.Consider logging at debug level to aid troubleshooting:
try: object.__setattr__(obj, "_in_retry_context", False) except Exception: - pass + logger.debug( + "Failed to clear retry context flag during cleanup", + exc_info=True + )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/nat/utils/exception_handlers/automatic_retries.py(8 hunks)tests/nat/utils/test_retry_wrapper.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{py,yaml,yml}
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.
Files:
tests/nat/utils/test_retry_wrapper.pysrc/nat/utils/exception_handlers/automatic_retries.py
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).
**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial
Files:
tests/nat/utils/test_retry_wrapper.pysrc/nat/utils/exception_handlers/automatic_retries.py
tests/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Unit tests reside under tests/ and should use markers defined in pyproject.toml (e.g., integration)
Files:
tests/nat/utils/test_retry_wrapper.py
⚙️ CodeRabbit configuration file
tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using thetest_prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using thefixture_prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass
Files:
tests/nat/utils/test_retry_wrapper.py
{tests/**/*.py,examples/*/tests/**/*.py}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
{tests/**/*.py,examples/*/tests/**/*.py}: Use pytest (with pytest-asyncio for async); name test files test_*.py; test functions start with test_; extract repeated code into fixtures; fixtures must set name in decorator and be named with fixture_ prefix
Mock external services with pytest_httpserver or unittest.mock; do not hit live endpoints
Mark expensive tests with @pytest.mark.slow or @pytest.mark.integration
Files:
tests/nat/utils/test_retry_wrapper.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,
and should contain an Apache License 2.0 header comment at the top of each file.
- Confirm that copyright years are up-to date whenever a file is changed.
Files:
tests/nat/utils/test_retry_wrapper.pysrc/nat/utils/exception_handlers/automatic_retries.py
src/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All importable Python code must live under src/ (or packages//src/)
Files:
src/nat/utils/exception_handlers/automatic_retries.py
src/nat/**/*
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Changes in src/nat should prioritize backward compatibility
Files:
src/nat/utils/exception_handlers/automatic_retries.py
⚙️ CodeRabbit configuration file
This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.
Files:
src/nat/utils/exception_handlers/automatic_retries.py
{src/**/*.py,packages/*/src/**/*.py}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful
Files:
src/nat/utils/exception_handlers/automatic_retries.py
🧬 Code graph analysis (2)
tests/nat/utils/test_retry_wrapper.py (1)
src/nat/utils/exception_handlers/automatic_retries.py (3)
_clear_exception_context(61-80)_retry_decorator(168-409)patch_with_retry(412-493)
src/nat/utils/exception_handlers/automatic_retries.py (1)
src/nat/llm/utils/thinking.py (1)
decorate(86-116)
🪛 Ruff (0.14.0)
tests/nat/utils/test_retry_wrapper.py
440-440: Abstract raise to an inner function
(TRY301)
440-440: Avoid specifying long messages outside the exception class
(TRY003)
src/nat/utils/exception_handlers/automatic_retries.py
57-57: Consider (args[0], *copy.deepcopy(args[1:])) instead of concatenation
Replace with (args[0], *copy.deepcopy(args[1:]))
(RUF005)
203-203: _RetryContext.__slots__ is not sorted
Apply a natural sort to _RetryContext.__slots__
(RUF023)
234-234: Consider moving this statement to an else block
(TRY300)
235-235: Do not catch blind exception: Exception
(BLE001)
246-247: try-except-pass detected, consider logging the exception
(S110)
246-246: Do not catch blind exception: Exception
(BLE001)
305-305: Consider moving this statement to an else block
(TRY300)
342-342: Consider moving this statement to an else block
(TRY300)
🔇 Additional comments (9)
tests/nat/utils/test_retry_wrapper.py (3)
452-530: LGTM! Comprehensive coverage of traceback clearing feature.Both test functions effectively validate the
clear_tracebacksconfiguration:
test_traceback_clearingconfirms that traceback cleanup is invoked on each failed retry attempt.test_traceback_not_cleared_when_disabledensures the feature respects the disabled flag.The use of monkeypatch to track calls while preserving the original behavior is a good testing pattern.
532-576: LGTM! Clear validation of shallow vs deep copy semantics.Both tests effectively demonstrate the difference between shallow and deep copy behavior:
- With
deep_copy=False(default), modifications to mutable arguments during failed attempts persist.- With
deep_copy=True, each retry receives independent copies, preventing cross-attempt pollution.This is critical for ensuring predictable retry behavior when dealing with mutable state.
578-720: LGTM! Thorough coverage of memory optimization features.These tests validate critical aspects of the memory management implementation:
test_gc_frequencyconfirms garbage collection is invoked at the correct intervals (attempts 3 and 6 with frequency=3).test_weak_reference_cleanupverifies that using weak references doesn't prevent object cleanup.- Generator and async tests ensure memory optimizations work across all function types.
test_retry_context_with_non_weakref_objectscorrectly handles built-in types likelistthat don't support weak references.Note: Line 440 has static analysis hints (TRY301, TRY003) about exception handling style. These are acceptable for test helper code and don't need to be addressed.
src/nat/utils/exception_handlers/automatic_retries.py (6)
61-81: LGTM! Robust exception traceback cleanup.The recursive clearing of
__traceback__,__cause__, and__context__is essential for breaking reference cycles and freeing memory. The defensive exception handling appropriately guards against read-only attributes or other edge cases.
83-87: LGTM! Appropriate periodic garbage collection.The implementation correctly triggers GC at the specified frequency, skipping the initial attempt (attempt 0) to avoid unnecessary overhead before any failures occur.
168-198: LGTM! Well-designed memory optimization parameters.The updated signature provides sensible defaults:
shallow_copy=Trueprioritizes performance (shallow copy is much faster than deep copy).gc_frequency=3balances memory cleanup with GC overhead.clear_tracebacks=Trueenables memory optimization by default.The
skip_self_in_deepcopylogic correctly prevents deep copying the instance object in instance methods, which would be wasteful and potentially problematic.
249-285: LGTM! Consistent memory optimization in async retry path.The implementation correctly:
- Applies shallow or deep copy based on configuration.
- Clears exception tracebacks when enabled.
- Runs periodic garbage collection.
- Propagates the last exception if all retries are exhausted.
The final
if last_exception: raise last_exception(lines 283-284) serves as a defensive fallback, though it should be unreachable whenretries > 0because the loop always returns or raises within the retry logic.
286-396: LGTM! Consistent memory optimizations across all wrapper variants.All three retry wrappers (async generator, sync generator, and sync function) correctly apply the same memory optimization pattern:
- Conditional shallow/deep copy of arguments.
- Traceback clearing when enabled.
- Periodic garbage collection.
- Proper exception tracking and propagation.
The consistent implementation across all execution paths ensures uniform behavior regardless of whether the wrapped function is sync, async, or a generator.
440-453: LGTM! Clean parameter translation to internal API.The inversion of
deep_copytoshallow_copy(line 441) correctly translates the public API to the internal implementation. The new memory optimization parameters (gc_frequency,clear_tracebacks) are properly passed to the decorator, andinstance_context_aware=True(line 453) ensures retry storm prevention is enabled for patched instances.
|
/ok to test 43a7da5 |
|
/merge |
Introduced optimizations to reduce memory overhead during retries. Added features include clearing exception tracebacks, periodic garbage collection, and support for shallow/deep copy of arguments. Accompanied by extensive test coverage for both new and existing functionality.
By Submitting this PR I confirm:
Summary by CodeRabbit
New Features
Documentation
Tests