Skip to content

Fixing the repeated step id bug#1032

Merged
zhongxuanwang-nv merged 4 commits intoNVIDIA:release/1.3from
zhongxuanwang-nv:zxw_repeated_msg_fix
Oct 17, 2025
Merged

Fixing the repeated step id bug#1032
zhongxuanwang-nv merged 4 commits intoNVIDIA:release/1.3from
zhongxuanwang-nv:zxw_repeated_msg_fix

Conversation

@zhongxuanwang-nv
Copy link
Member

@zhongxuanwang-nv zhongxuanwang-nv commented Oct 17, 2025

Description

Closes #1030 , AIQ-2117

Thanks a lot @willkill07 for proposing the @cached_property!!

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • Chores

    • Optimized internal caching mechanism for improved performance
    • Enhanced diagnostic warning messages with additional context for better troubleshooting
  • Tests

    • Added verification test for manager instance consistency and state tracking accuracy

Signed-off-by: Daniel Wang <daniewang@nvidia.com>
@zhongxuanwang-nv zhongxuanwang-nv self-assigned this Oct 17, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 17, 2025

Walkthrough

Modified Context.intermediate_step_manager from a regular property to a cached property, ensuring the same manager instance is returned across multiple accesses. Enhanced warning messages in the intermediate step manager with additional context. Added test to verify manager instance caching behavior.

Changes

Cohort / File(s) Summary
Manager Instance Caching
src/nat/builder/context.py
Changed intermediate_step_manager from @property to @cached_property with added functools import. Manager instance is now cached after first access rather than created fresh on each call.
Enhanced Logging
src/nat/builder/intermediate_step_manager.py
Augmented two warning log messages to provide additional context: when outstanding start step not found (notes step may have started in different context or completed), and when no matching start step found for chunk (notes chunk may have started in different context).
Verification Test
tests/nat/builder/test_intermediate_step_manager.py
Added test_context_returns_same_manager_instance() to verify that Context.intermediate_step_manager returns the same instance on multiple accesses and correctly tracks START/END events across shared references.

Sequence Diagram(s)

sequenceDiagram
    participant Code as Application Code
    participant Ctx as Context Instance
    participant Mgr as Intermediate Step Manager
    participant Events as Event Handler

    rect rgb(220, 240, 220)
    Note over Code,Mgr: Before: Manager recreated each access
    Code->>Ctx: .intermediate_step_manager (access 1)
    Ctx->>Mgr: create new Manager A
    Mgr-->>Code: Manager A
    Code->>Mgr: push_intermediate_step(START)
    activate Mgr
    Mgr->>Mgr: record in _outstanding_start_steps
    deactivate Mgr
    end

    rect rgb(240, 220, 220)
    Code->>Ctx: .intermediate_step_manager (access 2)
    Ctx->>Mgr: create new Manager B
    Mgr-->>Code: Manager B
    Code->>Mgr: push_intermediate_step(END)
    activate Mgr
    Mgr->>Mgr: search _outstanding_start_steps (EMPTY!)
    Mgr-->>Events: WARNING: step not found
    deactivate Mgr
    end

    rect rgb(220, 240, 220)
    Note over Code,Mgr: After: Manager instance reused
    Code->>Ctx: .intermediate_step_manager (access 1)
    Ctx->>Mgr: return cached Manager
    Mgr-->>Code: same Manager instance
    Code->>Mgr: push_intermediate_step(START)
    activate Mgr
    Mgr->>Mgr: record in _outstanding_start_steps
    deactivate Mgr
    end

    rect rgb(200, 230, 255)
    Code->>Ctx: .intermediate_step_manager (access 2)
    Ctx->>Mgr: return cached Manager
    Mgr-->>Code: same Manager instance
    Code->>Mgr: push_intermediate_step(END)
    activate Mgr
    Mgr->>Mgr: search _outstanding_start_steps (FOUND!)
    Mgr->>Mgr: clear entry
    deactivate Mgr
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "Fixing the repeated step id bug" is concise at 31 characters, well under the 72-character limit, and clearly describes the main change—addressing the repeated step ID bug. While the title uses gerund form ("Fixing") rather than strict imperative mood ("Fix"), it is directly related to the changeset which implements a cached property mechanism to ensure consistent manager instances and prevent step ID duplication. The title accurately conveys the core objective of the changes.
Linked Issues Check ✅ Passed The changes directly address the objectives in linked issue #1030. The core fix—converting intermediate_step_manager from a regular @property to @cached_property—ensures the same manager instance is returned on multiple accesses, which resolves the root cause of step IDs being unrecognized. The new test test_context_returns_same_manager_instance validates that the same instance is consistently returned and that START/END events are properly tracked across multiple references. The enhanced warning messages provide better debugging context without introducing functional changes, supporting the overall goal of ensuring workflow spans are properly closed and preventing "Not all spans were closed" errors.
Out of Scope Changes Check ✅ Passed All changes in this pull request are directly scoped to fixing the repeated step ID bug. The @cached_property change in context.py implements the core fix, the enhanced warning messages in intermediate_step_manager.py provide debugging support without introducing new functionality, and the new test specifically validates the fix. No unrelated modifications, refactoring, or feature additions are present. Each change contributes to resolving the underlying issue where step IDs were not being found due to multiple manager instances being created.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b55c00e and e512b0c.

📒 Files selected for processing (3)
  • src/nat/builder/context.py (2 hunks)
  • src/nat/builder/intermediate_step_manager.py (2 hunks)
  • tests/nat/builder/test_intermediate_step_manager.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • tests/nat/builder/test_intermediate_step_manager.py
  • src/nat/builder/context.py
  • src/nat/builder/intermediate_step_manager.py
tests/**/*.py

⚙️ CodeRabbit configuration file

tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using the test_ prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using the fixture_ prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass

Files:

  • tests/nat/builder/test_intermediate_step_manager.py
src/nat/**/*

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/builder/context.py
  • src/nat/builder/intermediate_step_manager.py
🔇 Additional comments (3)
src/nat/builder/context.py (1)

22-22: LGTM! Core fix correctly implements instance caching.

Converting intermediate_step_manager to a @cached_property ensures that a single manager instance is returned across multiple accesses within the same Context. This fixes the root cause where START and END events were inadvertently routed to different manager instances, preventing spans from closing properly. The caching is per-Context instance (not global), which is correct for maintaining isolation between workflow runs.

Also applies to: 171-171

src/nat/builder/intermediate_step_manager.py (1)

94-97: LGTM! Enhanced warning messages provide better diagnostics.

The updated warning messages now include context about why a step might not be found (e.g., started in a different context or already completed). This improves debuggability and helps developers understand legitimate edge cases versus actual bugs.

Also applies to: 153-154

tests/nat/builder/test_intermediate_step_manager.py (1)

99-123: LGTM! Test effectively validates the caching behavior.

The test correctly verifies that Context.intermediate_step_manager returns the same instance across multiple accesses and that START/END events are properly tracked when the manager is accessed through different references. This directly validates the fix for the span-not-closed issue.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Daniel Wang <daniewang@nvidia.com>
Signed-off-by: Daniel Wang <daniewang@nvidia.com>
@zhongxuanwang-nv zhongxuanwang-nv added bug Something isn't working non-breaking Non-breaking change labels Oct 17, 2025
@zhongxuanwang-nv
Copy link
Member Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Oct 17, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@zhongxuanwang-nv zhongxuanwang-nv marked this pull request as ready for review October 17, 2025 21:11
@zhongxuanwang-nv zhongxuanwang-nv requested a review from a team as a code owner October 17, 2025 21:11
@willkill07 willkill07 linked an issue Oct 17, 2025 that may be closed by this pull request
2 tasks
@zhongxuanwang-nv zhongxuanwang-nv merged commit 143757f into NVIDIA:release/1.3 Oct 17, 2025
16 of 17 checks passed
bbednarski9 pushed a commit to bbednarski9/NeMo-Agent-Toolkit that referenced this pull request Oct 20, 2025
## Description
<!-- Note: The pull request title will be included in the CHANGELOG. -->
<!-- Provide a standalone description of changes in this PR. -->
<!-- Reference any issues closed by this PR with "closes NVIDIA#1234". All PRs
should have an issue they close-->
Closes NVIDIA#1030 , AIQ-2117

Thanks a lot @willkill07 for proposing the `@cached_property`!!


## By Submitting this PR I confirm:
- I am familiar with the [Contributing
Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This
certifies that the contribution is your original work, or you have
rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will
not be accepted.
- When the PR is ready for review, new or existing tests cover these
changes.
- When the PR is ready for review, the documentation is up to date with
these changes.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
  * Optimized internal caching mechanism for improved performance
* Enhanced diagnostic warning messages with additional context for
better troubleshooting

* **Tests**
* Added verification test for manager instance consistency and state
tracking accuracy

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Daniel Wang <daniewang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

phoenix telemetry span not closed

2 participants