[#12593][feat] AutoDeploy: onboard DeepSeek-R1#12601
Conversation
129f9de to
4b25712
Compare
📝 WalkthroughWalkthroughThis change introduces registry-based configuration management for DeepSeek-R1 model deployment, including updated runtime parameters, dequantization hooks for FP8 weights, refined sharding logic for non-divisible world sizes, and corresponding accuracy reference entries and tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py (1)
1-1:⚠️ Potential issue | 🟠 MajorAdd the required NVIDIA copyright header.
This modified Python test file is missing the required NVIDIA OSS copyright header at the top.
As per coding guidelines: “Add NVIDIA copyright header to ALL new files; update year on modified files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py` at line 1, Add the required NVIDIA OSS copyright header to the top of the test file (tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py) by inserting the standard NVIDIA header block as the first lines of the file; if this is a modified file rather than new, update the copyright year range in that header accordingly so it reflects the current year.tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)
1-1:⚠️ Potential issue | 🟠 MajorAdd the required NVIDIA copyright header.
This modified Python file is missing the required NVIDIA OSS copyright header at the top.
As per coding guidelines: “All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` at line 1, This file is missing the required NVIDIA OSS copyright header: add the standard NVIDIA copyright header as the very first lines of tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (before the module docstring """Transformations to support graph sharding."""), include the correct year of latest meaningful modification and the canonical NVIDIA wording used across the repo, and ensure the header remains a top-of-file comment block so tools and license scanners recognize it.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.py (1)
1-1:⚠️ Potential issue | 🟡 MinorUpdate the copyright year on this modified file.
This file changed in this PR but still carries a 2025 header. Please bump it to 2026.
Possible fix
-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines, "Add NVIDIA copyright header to ALL new files; update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.py` at line 1, Update the copyright header at the top of tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.py by changing the year from 2025 to 2026 so the file header reflects the modification year; locate the file's top-of-file copyright comment and replace "2025" with "2026".
🧹 Nitpick comments (5)
tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/triton_rms_norm.py (1)
54-59: Good fix for the stride mismatch bug.The contiguity guard correctly prevents out-of-bounds writes when
hidden_statesis a non-contiguous view. The comment clearly explains the root cause.Consider applying the same fix to
triton_fused_add_rms_norm_quant_fp8.py. Bothrms_norm_quant_fp8()(lines 75–77) andfused_add_rms_norm_quant_fp8()(lines 181–200) compute strides from reshaped tensors without ensuring contiguity first, exhibiting the same vulnerability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/triton_rms_norm.py` around lines 54 - 59, Add the same contiguity guard used in triton_rms_norm.py to triton_fused_add_rms_norm_quant_fp8.py: inside rms_norm_quant_fp8() (around lines where hidden_states is reshaped and strides are computed) and inside fused_add_rms_norm_quant_fp8() (before computing strides from the reshaped input tensors), check if the relevant input tensors (e.g., hidden_states or the addend) are contiguous and call .contiguous() to replace them if not; this prevents input_stride != output_stride mismatches that lead to out-of-bounds writes in the Triton kernel.tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)
606-616: Harden_split_scaleagainst invalidworld_size/rankinputs.Line [615] can raise
ZeroDivisionErrorforworld_size <= 0, and out-of-range ranks can silently map to the wrong shard group. Add explicit argument checks up front.Proposed patch
scale_dim = scale.shape[dim] if scale_dim <= 0: raise ValueError(f"Invalid scale dimension ({scale_dim}) for dim={dim}.") + if world_size <= 0: + raise ValueError(f"world_size must be > 0, got {world_size}.") + if rank < 0 or rank >= world_size: + raise ValueError(f"rank must be in [0, {world_size}), got {rank}.") if scale_dim >= world_size: return torch.tensor_split(scale, world_size, dim=dim)[rank]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` around lines 606 - 616, The _split_scale function lacks validation for world_size and rank which can cause ZeroDivisionError or incorrect grouping; add upfront argument checks: validate that world_size is an integer > 0 and raise ValueError if not, and validate that rank is an integer in [0, world_size-1] and raise IndexError or ValueError if out of range; keep the existing logic (use torch.tensor_split and proportional binning) after these checks so torch.tensor_split(scale, world_size, dim=dim)[rank] and the group computation cannot divide by zero or map an invalid rank to the wrong group.tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py (1)
106-117: Nice regression test; consider covering the new invalid-dimension guard too.This test covers the non-divisible mapping path well. Add a companion case for empty scale rows to assert the
ValueErrorpath introduced in_split_scale.Suggested additional test
+def test_finegrained_fp8_split_scale_invalid_scale_dim(): + scale = torch.empty(0, 4, dtype=torch.float32) + with pytest.raises(ValueError, match="Invalid scale dimension"): + FineGrainedFP8WeightShardingInfo._split_scale( + scale=scale, dim=0, rank=0, world_size=8 + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py` around lines 106 - 117, Add a companion unit test that asserts FineGrainedFP8WeightShardingInfo._split_scale raises a ValueError when the computed row slice would be empty: construct a small scale tensor with fewer rows than the mapping requires (e.g., scale with shape (N, ...) and a larger world_size so some ranks map to group >= N), call _split_scale for a rank that maps outside the available rows, and use pytest.raises(ValueError) to verify the guard path.examples/auto_deploy/model_registry/generate_csv.py (2)
9-9: Unused constantREPO_ROOT.
REPO_ROOTis defined but never used in the script. Consider removing it to avoid confusion.-REPO_ROOT = Path(__file__).resolve().parents[3]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/auto_deploy/model_registry/generate_csv.py` at line 9, The constant REPO_ROOT defined as Path(__file__).resolve().parents[3] in generate_csv.py is unused; remove the REPO_ROOT declaration (and its import usage if only used for that) to avoid dead code and confusion—search for the symbol REPO_ROOT in this file and delete the assignment and any now-unneeded imports (e.g., Path) or replace with a local path usage if it was intended to be used.
21-32: Consider adding basic error handling for robustness.The script doesn't handle cases where
models.yamlis missing or malformed. While this is a development utility, adding minimal error handling would improve usability.🛡️ Optional: Add error handling
def main(): + if not MODELS_YAML.exists(): + print(f"Error: {MODELS_YAML} not found") + return + with open(MODELS_YAML) as f: data = yaml.safe_load(f) + if not data or "models" not in data: + print("Error: Invalid models.yaml format") + return + rows = [] for entry in data.get("models", []): - name = entry["name"] + name = entry.get("name") + if not name: + continue hf_link = f"{HF_BASE_URL}/{name}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/auto_deploy/model_registry/generate_csv.py` around lines 21 - 32, Wrap file reading and YAML parsing in try/except to handle missing or malformed MODELS_YAML and validate the structure before iterating: in main() catch FileNotFoundError when opening MODELS_YAML and yaml.YAMLError from yaml.safe_load, log/print a clear error and exit/return with non‑zero status; also validate that data.get("models") is a list before looping and handle missing keys (entry["name"]) by skipping or reporting the malformed entry; keep references to MODELS_YAML, yaml.safe_load, build_command, HF_BASE_URL and rows so the fixes are made in the same scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/auto_deploy/model_registry/generate_csv.py`:
- Around line 1-6: Add the required NVIDIA copyright header at the top of the
generate_csv.py module above the module docstring; include the current year (or
year of latest meaningful modification) and follow the project's standard NVIDIA
copyright header template so the file-level docstring and imports remain
unchanged.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py`:
- Around line 107-115: The code currently silently casts FP8 kv_b_proj weights
to bfloat16 when the corresponding scale_key is missing (see state_dict,
scale_key, w_key, kv_b_proj, layer_idx, and the w.to(torch.bfloat16) branch);
change this to fail fast by raising a clear exception instead of continuing:
replace the silent fallback (and possibly the ad_logger.warning_once) with
raising a RuntimeError (or ValueError) that includes the missing scale_key,
w_key, and layer_idx so loading stops immediately and the checkpoint-format
problem is surfaced.
- Around line 119-130: The code currently expands FP8 scales assuming scale has
shape [N/block_n, K/block_k], but triton_fp8_quantize_1x128 emits scales with
shape [ceil(K/128), N] (K-chunk major), so the axes are swapped; fix by treating
scale as (num_k_chunks, N), transpose it to (N, num_k_chunks), then
repeat_interleave along dim=1 by the FP8 chunk size (128) to expand K, and
finally slice to [:N, :K]; update the expansion using the existing symbols
scale, w (N, K), and state_dict[w_key] so dequantization uses the correct
per-128-K-chunk scales (assume block_k = 128 as in triton_fp8_quantize_1x128).
In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py`:
- Around line 954-964: The _get_registry_yaml_extra() helper currently only
parses world_size from filenames; modify it to first open each YAML in the
provided yaml_extra list, safe-load the YAML content (e.g., yaml.safe_load) and,
if a top-level integer key "world_size" exists, use that value as world_size,
then only if not found fall back to the existing regex filename extraction for
patterns like world_size_N.yaml to preserve backward compatibility; ensure you
reference and update the logic inside the _get_registry_yaml_extra function so
tests like deepseek-ai/DeepSeek-R1 (yaml_extra: ['deepseek-r1.yaml'] with
world_size: 8) return world_size=8.
---
Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.py`:
- Line 1: Update the copyright header at the top of
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.py by changing
the year from 2025 to 2026 so the file header reflects the modification year;
locate the file's top-of-file copyright comment and replace "2025" with "2026".
In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Line 1: This file is missing the required NVIDIA OSS copyright header: add the
standard NVIDIA copyright header as the very first lines of
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (before the module
docstring """Transformations to support graph sharding."""), include the correct
year of latest meaningful modification and the canonical NVIDIA wording used
across the repo, and ensure the header remains a top-of-file comment block so
tools and license scanners recognize it.
In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py`:
- Line 1: Add the required NVIDIA OSS copyright header to the top of the test
file
(tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py)
by inserting the standard NVIDIA header block as the first lines of the file; if
this is a modified file rather than new, update the copyright year range in that
header accordingly so it reflects the current year.
---
Nitpick comments:
In `@examples/auto_deploy/model_registry/generate_csv.py`:
- Line 9: The constant REPO_ROOT defined as Path(__file__).resolve().parents[3]
in generate_csv.py is unused; remove the REPO_ROOT declaration (and its import
usage if only used for that) to avoid dead code and confusion—search for the
symbol REPO_ROOT in this file and delete the assignment and any now-unneeded
imports (e.g., Path) or replace with a local path usage if it was intended to be
used.
- Around line 21-32: Wrap file reading and YAML parsing in try/except to handle
missing or malformed MODELS_YAML and validate the structure before iterating: in
main() catch FileNotFoundError when opening MODELS_YAML and yaml.YAMLError from
yaml.safe_load, log/print a clear error and exit/return with non‑zero status;
also validate that data.get("models") is a list before looping and handle
missing keys (entry["name"]) by skipping or reporting the malformed entry; keep
references to MODELS_YAML, yaml.safe_load, build_command, HF_BASE_URL and rows
so the fixes are made in the same scope.
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/triton_rms_norm.py`:
- Around line 54-59: Add the same contiguity guard used in triton_rms_norm.py to
triton_fused_add_rms_norm_quant_fp8.py: inside rms_norm_quant_fp8() (around
lines where hidden_states is reshaped and strides are computed) and inside
fused_add_rms_norm_quant_fp8() (before computing strides from the reshaped input
tensors), check if the relevant input tensors (e.g., hidden_states or the
addend) are contiguous and call .contiguous() to replace them if not; this
prevents input_stride != output_stride mismatches that lead to out-of-bounds
writes in the Triton kernel.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Around line 606-616: The _split_scale function lacks validation for world_size
and rank which can cause ZeroDivisionError or incorrect grouping; add upfront
argument checks: validate that world_size is an integer > 0 and raise ValueError
if not, and validate that rank is an integer in [0, world_size-1] and raise
IndexError or ValueError if out of range; keep the existing logic (use
torch.tensor_split and proportional binning) after these checks so
torch.tensor_split(scale, world_size, dim=dim)[rank] and the group computation
cannot divide by zero or map an invalid rank to the wrong group.
In
`@tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py`:
- Around line 106-117: Add a companion unit test that asserts
FineGrainedFP8WeightShardingInfo._split_scale raises a ValueError when the
computed row slice would be empty: construct a small scale tensor with fewer
rows than the mapping requires (e.g., scale with shape (N, ...) and a larger
world_size so some ranks map to group >= N), call _split_scale for a rank that
maps outside the available rows, and use pytest.raises(ValueError) to verify the
guard path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fa984ac1-742f-4595-9d94-8584a5065b8d
📒 Files selected for processing (12)
examples/auto_deploy/build_and_run_ad.pyexamples/auto_deploy/model_registry/configs/deepseek-r1.yamlexamples/auto_deploy/model_registry/generate_csv.pyexamples/auto_deploy/model_registry/models.yamltensorrt_llm/_torch/auto_deploy/custom_ops/normalization/triton_rms_norm.pytensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek.pytensorrt_llm/_torch/auto_deploy/transform/library/sharding.pytests/integration/defs/accuracy/references/gsm8k.yamltests/integration/defs/accuracy/references/mmlu.yamltests/integration/defs/accuracy/test_llm_api_autodeploy.pytests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py
11cf343 to
b55bfb1
Compare
taylor-yb-lee
left a comment
There was a problem hiding this comment.
LGTM, except question about the accuracy value commits
|
/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1" |
|
PR_Github #41490 [ run ] triggered by Bot. Commit: |
|
PR_Github #41490 [ run ] completed with state |
|
/bot run |
|
PR_Github #41505 [ run ] triggered by Bot. Commit: |
|
PR_Github #41505 [ run ] completed with state
|
2744cf2 to
cef786a
Compare
|
/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1" |
|
PR_Github #41642 [ run ] triggered by Bot. Commit: |
|
PR_Github #41642 [ run ] completed with state |
|
/bot run |
|
PR_Github #41661 [ run ] triggered by Bot. Commit: |
|
PR_Github #41661 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42093 [ run ] triggered by Bot. Commit: |
|
PR_Github #42093 [ run ] completed with state
|
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
|
/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1,DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" |
|
PR_Github #42135 [ run ] triggered by Bot. Commit: |
|
PR_Github #42135 [ run ] completed with state
|
|
/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1,DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --reuse-test --disbale-fail-fast |
|
PR_Github #42146 Bot args parsing error: usage: /bot [-h] |
|
/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1,DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --reuse-test --disable-fail-fast |
|
PR_Github #42149 [ run ] triggered by Bot. Commit: |
|
PR_Github #42149 [ run ] completed with state
|
|
/bot run --stage-list "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1,DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --reuse-test --disable-fail-fast |
|
PR_Github #42178 [ run ] triggered by Bot. Commit: |
|
PR_Github #42178 [ run ] completed with state |
|
/bot run |
|
PR_Github #42202 [ run ] triggered by Bot. Commit: |
|
PR_Github #42202 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42283 [ run ] triggered by Bot. Commit: |
|
PR_Github #42283 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42361 [ run ] triggered by Bot. Commit: |
|
PR_Github #42361 [ run ] completed with state
|
|
/bot run --reuse-test |
|
PR_Github #42483 [ run ] triggered by Bot. Commit: |
|
PR_Github #42483 [ run ] completed with state |
Summary by CodeRabbit
New Features
--use-registryflag.Bug Fixes
Tests
Description
Registers DeepSeek-R1 in the AutoDeploy model registry and fixes all blocking issues needed to run it end-to-end with torch-cudagraph backend.
Added accuracy test for development (not added to test lists yet).
Output is no longer gibberish, MMLU and GSM8K are both ~75 - but we need to check if higher accuracy can be recovered (TRT-LLM report higher accuracy for the NVFP4 checkpoint)
Fixes:
Test Coverage
tests/integration/defs/accuracy/test_llm_api_autodeploy.py::TestModelRegistryAccuracy::test_autodeploy_from_registry[deepseek-ai_DeepSeek-R1-True]
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.