Skip to content

[https://nvbugs/6074014][fix] Min-reduce available host memory to ensure that all ranks agree about whether prefetch is enabled#13161

Merged
achartier merged 1 commit intoNVIDIA:mainfrom
dhansen-nvidia:enable_prefetch_fix
Apr 21, 2026
Merged

[https://nvbugs/6074014][fix] Min-reduce available host memory to ensure that all ranks agree about whether prefetch is enabled#13161
achartier merged 1 commit intoNVIDIA:mainfrom
dhansen-nvidia:enable_prefetch_fix

Conversation

@dhansen-nvidia
Copy link
Copy Markdown
Collaborator

@dhansen-nvidia dhansen-nvidia commented Apr 17, 2026

Summary by CodeRabbit

  • Bug Fixes
    • Fixed inconsistent prefetch decisions during weight loading where different local ranks could independently make different prefetching choices. Now implements synchronized host memory availability calculations across all local ranks, ensuring all ranks make coordinated prefetch decisions for more reliable weight loading behavior in multi-device configurations.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…ure that all ranks agree about whether prefetch is enabled

Signed-off-by: Dan Hansen <1+dhansen-nvidia@users.noreply.github.com>
@dhansen-nvidia dhansen-nvidia requested a review from a team as a code owner April 17, 2026 19:20
@dhansen-nvidia dhansen-nvidia requested a review from brb-nv April 17, 2026 19:20
@dhansen-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d1edf0ae-e5bb-4e40-8ecb-05176dd12ffd

📥 Commits

Reviewing files that changed from the base of the PR and between 813d877 and 299b34b.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py

📝 Walkthrough

Walkthrough

Added MPI-aware memory detection to weight loader prefetch logic. Introduces a static method that computes available host memory with collective synchronization across local ranks when multi-device mode is enabled, replacing direct per-rank memory queries for consistent prefetch decisions.

Changes

Cohort / File(s) Summary
Memory-aware prefetch logic
tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
Added _get_local_available_host_memory() static method with MPI allreduce (minimum) synchronization when ENABLE_MULTI_DEVICE is enabled. Updated load_weights() prefetch condition to compare total prefetch size against 90% of synchronized memory value instead of per-rank local memory, ensuring consistent prefetch decisions across local ranks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is the template with uncompleted sections. The 'Description', 'Test Coverage', and required checklist items are not filled in; only the template placeholder comments remain. Complete the Description section explaining the issue and solution, add Test Coverage details listing relevant tests, and provide substantive checklist answers beyond the template.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title is clear and specific, accurately describing the main change: performing a min-reduce of available host memory to ensure consistent prefetch decisions across ranks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44065 [ run ] triggered by Bot. Commit: 299b34b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44065 [ run ] completed with state SUCCESS. Commit: 299b34b
/LLM/main/L0_MergeRequest_PR pipeline #34496 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dhansen-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44131 [ run ] triggered by Bot. Commit: 299b34b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44131 [ run ] completed with state SUCCESS. Commit: 299b34b
/LLM/main/L0_MergeRequest_PR pipeline #34557 completed with status: 'SUCCESS'

CI Report

Link to invocation

Copy link
Copy Markdown
Collaborator

@brb-nv brb-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@achartier achartier merged commit 96bb8b7 into NVIDIA:main Apr 21, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants