[None][fix] Revert the transport backend to UCX #9352

Shixiaowei02 · 2025-11-21T03:50:27Z

Summary by CodeRabbit

Configuration Changes
- Default backend for KV cache transmission changed from NIXL to UCX
- Updated environment variable fallback logic for backend selection priority
Documentation
- Updated configuration documentation to reflect new default backend selection

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-21T03:53:52Z

📝 Walkthrough

Walkthrough

The default backend for KV cache transceiver is changed from NIXL to UCX across the codebase. This includes updates to the C++ implementation, Python implementation, and corresponding documentation to reflect the new default backend selection behavior.

Changes

Cohort / File(s)	Summary
C++ Cache Transceiver `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`	Default backend selection updated: when `CacheTransceiverConfig::BackendType` is `DEFAULT`, the fallback now selects UCX instead of NIXL
Python Cache Transceiver `tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py`	Default backend changed from NIXL to UCX; environment variable fallback mapping reordered to prioritize `TR_TL_LM_USE_NIXL_KVCACHE` before MPI when applicable
Documentation `docs/source/features/disagg-serving.md`	Updated default backend description in `cache_transceiver_config` docs from "NIXL" to "UCX"
Example Configuration `examples/disaggregated/README.md`	Updated example YAML mapping for DEFAULT backend from "(i.e., NIXL)" to "(i.e., UCX)"

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Changes follow a consistent pattern of replacing the default backend value across multiple files
Documentation updates are straightforward descriptive changes
Environment variable fallback logic is reordered but maintains the same conditional structure and warning behavior

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete; it contains only '@coderabbitai summary' without any actual explanation of the issue, solution, test coverage, or checklist items required by the template.	Add a comprehensive description explaining why the backend is being reverted to UCX, list relevant tests that safeguard the changes, and complete the PR checklist items.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: reverting the transport backend default from NIXL back to UCX, which is the primary change across all modified files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39e6418 and 673f1bb.

📒 Files selected for processing (4)

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1 hunks)
docs/source/features/disagg-serving.md (1 hunks)
examples/disaggregated/README.md (1 hunks)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1 hunks)

🧰 Additional context used

🧠 Learnings (6)

📚 Learning: 2025-08-20T06:56:02.889Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp

📚 Learning: 2025-08-21T09:41:49.347Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp

📚 Learning: 2025-08-20T06:48:45.368Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)

92-92: LGTM! Default backend correctly changed to UCX.

The fallback logic is clear and consistent with the PR objective to revert from NIXL to UCX as the default backend.

docs/source/features/disagg-serving.md (1)

121-121: LGTM! Documentation accurately updated.

The documentation now correctly states that UCX is the default backend, consistent with the code changes.

examples/disaggregated/README.md (1)

15-15: LGTM! Documentation accurately updated.

The comment now correctly indicates that DEFAULT maps to UCX, consistent with the code changes.

tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py

reasonsolo · 2025-11-21T04:20:31Z

Should we unwaive some (maybe not all) disagg tests?

Shixiaowei02 · 2025-11-21T08:02:03Z

Should we unwaive some (maybe not all) disagg tests?

I suggest making the problem clues more explicit (after establishing the causal relationship) before unwaiving, to avoid interference and backlash on CI.

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Shixiaowei02 · 2025-11-21T08:22:49Z

/bot run

tensorrt-cicd · 2025-11-21T08:28:56Z

PR_Github #25334 [ run ] triggered by Bot. Commit: 25e0c8e

bo-nv · 2025-11-21T09:11:13Z

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp

        else
        {
-            backendType = executor::CacheTransceiverConfig::BackendType::NIXL;
+            backendType = executor::CacheTransceiverConfig::BackendType::UCX;


Can we only revert #9247?

Can we only revert #9247?

OK, let’s not rush to fully revert for now. We'll try to find more clues before making a decision. Thanks! @bo-nv

tensorrt-cicd · 2025-11-21T12:37:19Z

PR_Github #25334 [ run ] completed with state SUCCESS. Commit: 25e0c8e
/LLM/main/L0_MergeRequest_PR pipeline #19162 completed with status: 'FAILURE'

Shixiaowei02 requested review from a team as code owners November 21, 2025 03:50

Shixiaowei02 requested review from QiJune, Tabrizian, bo-nv, chuangz0, kaiyux, pcastonguay and reasonsolo November 21, 2025 03:50

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py Show resolved Hide resolved

reasonsolo approved these changes Nov 21, 2025

View reviewed changes

Shixiaowei02 added 2 commits November 21, 2025 16:22

Revert the transport backend to UCX

b93883f

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

update

25e0c8e

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Shixiaowei02 force-pushed the user/xiaoweis/revert-nixl branch from 76f7959 to 25e0c8e Compare November 21, 2025 08:22

bo-nv reviewed Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][fix] Revert the transport backend to UCX #9352

[None][fix] Revert the transport backend to UCX #9352

Shixiaowei02 commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 21, 2025

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

reasonsolo commented Nov 21, 2025

Uh oh!

Shixiaowei02 commented Nov 21, 2025

Uh oh!

Shixiaowei02 commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

bo-nv Nov 21, 2025

Uh oh!

Shixiaowei02 Nov 21, 2025 •

edited

Loading

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[None][fix] Revert the transport backend to UCX #9352

Are you sure you want to change the base?

[None][fix] Revert the transport backend to UCX #9352

Conversation

Shixiaowei02 commented Nov 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 21, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

reasonsolo commented Nov 21, 2025

Uh oh!

Shixiaowei02 commented Nov 21, 2025

Uh oh!

Shixiaowei02 commented Nov 21, 2025

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

bo-nv Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Shixiaowei02 Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shixiaowei02 commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

Shixiaowei02 Nov 21, 2025 •

edited

Loading