Skip to content

[https://nvbugs/6104831][fix] Detach pruned trie children#13572

Open
chienchunhung wants to merge 1 commit intoNVIDIA:mainfrom
chienchunhung:dev/nvbug-6104831-cascade-prune-fix
Open

[https://nvbugs/6104831][fix] Detach pruned trie children#13572
chienchunhung wants to merge 1 commit intoNVIDIA:mainfrom
chienchunhung:dev/nvbug-6104831-cascade-prune-fix

Conversation

@chienchunhung
Copy link
Copy Markdown
Collaborator

@chienchunhung chienchunhung commented Apr 28, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Fixed a critical assertion error in batch manager operations that could cause crashes during memory block removal and reallocation sequences.
  • Tests

    • Added comprehensive unit tests covering batch manager edge cases to prevent assertion failure regression.

Summary

Fixes the KV-cache-block trie invariant violation from NVBugs 6104831 by detaching a removed child node's parent back-pointer before erasing it from the parent's child map.

This prevents later clearValue() / freeBlockAndAllDescendants() cascade pruning from walking back to a parent that no longer owns the child and firing cascade prune: parent did not find this node as a child.

Related PR

Depends on #13571, which adds the reproducer tests.

Test Coverage

Updated the four NVBugs 6104831 RadixBlockTreeTest reproducers from EXPECT_THROW to EXPECT_NO_THROW, so they now verify the fixed behavior for:

  • orphaned block detach
  • orphaned subtree eviction via freeBlockAndAllDescendants
  • storeBlocks-style block re-keying
  • prefix-overlapping insert / evict / reuse stress loop

Also verified the full radixBlockTreeTest suite remains green.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Add unit tests that isolate the KV-cache-block trie invariant violation behind the cascade-prune assertion seen in NVBugs 6104831.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung chienchunhung force-pushed the dev/nvbug-6104831-cascade-prune-fix branch from a873694 to d301166 Compare April 28, 2026 20:48
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Apr 30, 2026
…rmanent wedge

Document the multi-signature disaggregated-serving wedge surfaced by the
rc11 deployment. The report covers the 1P1D reproducer harness, the six
labelled failure signatures (sender-side broken-promise after ready,
trie cascade-prune assertion, decode-side bad optional access, gen-side
checkGenTransferStatus blocking on at_least_num=1, receiver-side queued
cancel broken-promise, and the suspected control-path send stall),
their mapping to chained test/fix PR pairs (NVIDIA#13571/NVIDIA#13572 for sig NVIDIA#2,
NVIDIA#13639/NVIDIA#13640 for sig #1), the in-flight fixes for sig NVIDIA#4 and sig NVIDIA#5,
and the relationship to the unrelated companion fixes NVIDIA#12718 and NVIDIA#13119
which are not in rc11. Includes an investigation timeline that explains
why each signature surfaced only after the previous one was fixed, and
a test-coverage analysis of why the existing unit and integration tests
did not catch any of these bugs.

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
@chienchunhung chienchunhung requested a review from thorjohnsen May 1, 2026 00:32
@chienchunhung chienchunhung marked this pull request as ready for review May 1, 2026 00:32
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b6ece539-5c1b-44c7-9cf5-37e53f2ee667

📥 Commits

Reviewing files that changed from the base of the PR and between c30d9c7 and d301166.

📒 Files selected for processing (2)
  • cpp/include/tensorrt_llm/batch_manager/templatedTrie.h
  • cpp/tests/unit_tests/batch_manager/radixBlockTreeTest.cpp

📝 Walkthrough

Walkthrough

This PR adds a bidirectional edge cleanup in the Node::clearNode method to properly detach child nodes from their parent by resetting back-pointers, and introduces four comprehensive unit tests to validate the fix prevents cascade-prune assertion failures during block reattachment operations.

Changes

Cohort / File(s) Summary
Bidirectional edge cleanup and regression tests
cpp/include/tensorrt_llm/batch_manager/templatedTrie.h, cpp/tests/unit_tests/batch_manager/radixBlockTreeTest.cpp
Added bidirectional edge update in Node::clearNode to reset child node back-pointers before removal, and introduced four new unit tests covering orphaned block detachment, cascade-prune scenarios, block re-keying sequences, and stress testing for overlapping prefix operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly identifies the fix: detaching pruned trie children to prevent cascade-prune assertion errors.
Description check ✅ Passed The PR description includes all required sections: summary of the fix, related PR dependency, comprehensive test coverage details, and completed PR checklist.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@chienchunhung chienchunhung requested a review from SimengLiu-nv May 1, 2026 17:31
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46515 [ run ] triggered by Bot. Commit: d301166 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46515 [ run ] completed with state SUCCESS. Commit: d301166
/LLM/main/L0_MergeRequest_PR pipeline #36574 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants