[#12257][fix] Use the first non-None result returned by hf download workers by kev-bi · Pull Request #12259 · NVIDIA/TensorRT-LLM

kev-bi · 2026-03-16T23:47:44Z

Summary by CodeRabbit

Bug Fixes
- Enhanced robustness of model directory selection in distributed model download operations.

Description

This PR addresses the issue reported in #12257.

There seems to be an edge case when running the TensorRT-LLM backend in dynamo where model_dirs[0] does not hold the rank-0 result. So rather than always use the model_dirs[0] result change the code to iterate through non-None values in model_dirs and use the first is encountered. Since only the rank-0 worker is still downloading the model and all other workers will return None there should be only one non-None value in model_dirs to use. Now we should handle both the assumed case where model_dirs[0] does hold the result and the edge case where some other entry in the list holds the result.

Test Coverage

The existing tests should cover, the model_dir is still fetched and assigned, it's just that the manner it is fetched is different. Let me know if it would be preferable to add a test where the edge case is triggered.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-03-16T23:51:21Z

📝 Walkthrough

Walkthrough

Enhanced robustness in model directory selection by filtering out None values when distributing models across nodes, and simplified the accompanying docstring for clarity.

Changes

Cohort / File(s)	Summary
Model directory selection robustness `tensorrt_llm/llmapi/llm_utils.py`	Updated `_download_hf_model_if_needed` to select the first non-None model directory instead of always using the first element, preventing potential None value handling issues. Simplified docstring by removing rank-specific note.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the specific fix: using the first non-None result from hf download workers instead of assuming the first element. It's concise, references the issue `#12257`, and accurately reflects the main change.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The PR description adequately explains the issue, solution, test coverage, and includes all required checklist items from the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can enforce grammar and style rules using `languagetool`.

Configure the reviews.tools.languagetool setting to enable/disable rules and categories. Refer to the LanguageTool Community to learn more.

Superjomn

LGTM, thanks for the contribution.

Superjomn · 2026-03-20T08:52:44Z

/bot run

tensorrt-cicd · 2026-03-20T08:58:29Z

PR_Github #39724 [ run ] triggered by Bot. Commit: 233d8f9 Link to invocation

tensorrt-cicd · 2026-03-20T13:12:01Z

PR_Github #39724 [ run ] completed with state SUCCESS. Commit: 233d8f9
/LLM/main/L0_MergeRequest_PR pipeline #30920 completed with status: 'FAILURE'

Conversation

kev-bi commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

Superjomn left a comment

Choose a reason for hiding this comment

Uh oh!

Superjomn commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

Superjomn commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

Superjomn commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

Superjomn commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

Superjomn commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

Superjomn commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

Superjomn commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

Superjomn commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

Superjomn commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

Superjomn commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

Superjomn commented Apr 3, 2026

Uh oh!

kev-bi commented Mar 16, 2026 •

edited

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading