[https://nvbugs/6244474][fix] AutoDeploy: skip explicit shape-prop after MLIR elementwise fusion by tensorrt-cicd · Pull Request #14795 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-05-31T08:36:42Z

Summary

Fixes nvbugs/6244474: Llama-3.1-8B-Instruct-FP8 AutoDeploy pipeline crashes at attention output reshape (modeling_llama3.py:189) when both fuse_rope_into_trtllm_attention and mlir_elementwise_fusion are enabled.

Root cause

Both transforms run in the post_load_fusion stage. In post_load_fusion, fuse_rope_into_trtllm_attention deliberately rewires Q/K/V to a single fused-QKV tensor of shape (B, S, 6144) and records _trtllm_fused_qkv in node.meta; the actual op swap to trtllm_mha_with_cache happens later at cache_init. While the graph sits in this intermediate state, mlir_elementwise_fusion._apply was unconditionally calling run_shape_prop(new_gm) after FX reconstruction. FakeTensorProp then re-evaluated torch_attention.register_fake with query=key=value=fused_qkv, producing an output of shape (B, S, 6144), which fails the downstream attn_output.reshape(B, S, num_heads * head_dim = 4096).

The transform's YAML config already declares run_shape_prop: false (see mlir/agent_learnings.md §6: "Prefer run_shape_prop: false unless the transform specifically needs re-propagated shapes"). The redundant inline call contradicted that intent — a stale leftover from before the YAML flag was added in 7a4752df2e. The combination
became reachable only after #13859 moved mlir_elementwise_fusion to run after fuse_rope_into_trtllm_attention in the YAML ordering, and was first triggered by the Llama-3.1-8B FP8 config tuning in #14622.

Fix

Drop the inline canonicalize_graph + run_shape_prop calls in MLIRElementwiseFusion._apply.
Return is_clean=False, has_valid_shapes=False so the framework's _run_cleanup runs canonicalization (since run_graph_cleanup defaults to True) and downstream
transforms with requires_shape_prop=True re-derive shapes via the framework's standard path — by which point insert_cached_attention has performed the real op swap
and the graph is in a valid state.
Remove the stale waiver for perf/test_perf_sanity.py::test_e2e[aggr_upload-llama3_1_8b_fp8_ad_hopper-llama3_1_8b_ad_ws1_1k1k].

Summary by CodeRabbit

Refactor
- Optimized internal graph transformation pipeline by deferring validation operations to later processing stages for improved efficiency during model compilation.

coderabbitai · 2026-05-31T08:38:43Z

📝 Walkthrough

Walkthrough

This pull request modifies the MLIR elementwise fusion transform to defer graph cleanup and shape propagation operations to later framework stages rather than performing them immediately. The returned TransformInfo now marks the intermediate graph as non-canonical and lacking valid shapes.

Changes

Deferred graph cleanup

Layer / File(s)	Summary
Step 6 deferred cleanup logic `tensorrt_llm/_torch/auto_deploy/transform/library/mlir_elementwise_fusion.py`	Step 6 removes immediate graph canonicalization and shape propagation calls, instead deferring these cleanup operations to later stages. `TransformInfo` now reports `is_clean=False` and `has_valid_shapes=False` to indicate the graph is in an intermediate state where tensor ranks may be intentionally invalid during later Q/K/V rewiring.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly and specifically describes the main change: skipping explicit shape propagation after MLIR elementwise fusion, which matches the file modified and the PR's core objective.
Description check	✅ Passed	The PR description provides comprehensive context including root cause analysis, detailed explanation of the fix, and references to related issues and commits.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

MrGeva · 2026-05-31T14:07:17Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-05-31T14:13:44Z

PR_Github #51255 [ run ] triggered by Bot. Commit: ae001c3 Link to invocation

tensorrt-cicd · 2026-05-31T21:38:06Z

PR_Github #51255 [ run ] completed with state FAILURE. Commit: ae001c3
/LLM/main/L0_MergeRequest_PR pipeline #40676 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva · 2026-06-01T03:58:19Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-06-01T04:03:52Z

PR_Github #51302 [ run ] triggered by Bot. Commit: ae001c3 Link to invocation

tensorrt-cicd · 2026-06-01T08:46:31Z

PR_Github #51302 [ run ] completed with state SUCCESS. Commit: ae001c3
/LLM/main/L0_MergeRequest_PR pipeline #40719 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva · 2026-06-01T12:19:46Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

MrGeva · 2026-06-02T13:50:33Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-06-02T15:57:58Z

PR_Github #51628 [ run ] triggered by Bot. Commit: ae001c3 Link to invocation

tensorrt-cicd · 2026-06-02T20:40:00Z

PR_Github #51628 [ run ] completed with state FAILURE. Commit: ae001c3
/LLM/main/L0_MergeRequest_PR pipeline #41013 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva · 2026-06-03T05:23:39Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-06-03T05:29:33Z

PR_Github #51756 [ run ] triggered by Bot. Commit: ae001c3 Link to invocation

… fusion The MLIR elementwise fusion transform unconditionally ran fake-tensor shape propagation on the FX-reconstructed graph. On Llama-3.1-8B FP8 (post-NVIDIA#14622 YAML), fuse_rope_into_trtllm_attention rewires Q/K/V to a single fused-QKV tensor of shape (B, S, 6144) and stores _trtllm_fused_qkv in node.meta; the actual op swap happens later at cache_init. While the graph sits in this deliberately invalid intermediate state at post_load_fusion, FakeTensorProp re-evaluates torch_attention.register_fake with query=value=fused_qkv and produces (B, S, 6144), which then fails the downstream attn_output.reshape(B, S, num_heads * head_dim = 4096) at modeling_llama3.py:189. The transform's YAML config already declares run_shape_prop: false (see mlir/agent_learnings.md section 6: 'Prefer run_shape_prop: false unless the transform specifically needs re-propagated shapes'). The redundant inline call contradicts that intent. Drop it and return has_valid_shapes=False so downstream transforms that require shape metadata re-derive it via the framework's _run_cleanup, which handles these intermediate states correctly. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

MrGeva · 2026-06-03T05:53:00Z

/bot run

tensorrt-cicd · 2026-06-03T05:59:28Z

PR_Github #51768 [ run ] triggered by Bot. Commit: 30979b5 Link to invocation

tensorrt-cicd · 2026-06-03T06:02:59Z

PR_Github #51756 [ run ] completed with state ABORTED. Commit: ae001c3

Link to invocation

tensorrt-cicd · 2026-06-03T07:07:38Z

PR_Github #51768 [ run ] completed with state SUCCESS. Commit: 30979b5
/LLM/main/L0_MergeRequest_PR pipeline #41135 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva · 2026-06-03T07:57:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-03T08:02:42Z

PR_Github #51799 [ run ] triggered by Bot. Commit: 30979b5 Link to invocation

tensorrt-cicd · 2026-06-03T13:10:14Z

PR_Github #51799 [ run ] completed with state SUCCESS. Commit: 30979b5
/LLM/main/L0_MergeRequest_PR pipeline #41163 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva · 2026-06-03T14:26:07Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-03T14:32:01Z

PR_Github #51860 [ run ] triggered by Bot. Commit: 30979b5 Link to invocation

tensorrt-cicd · 2026-06-03T15:56:56Z

PR_Github #51860 [ run ] completed with state SUCCESS. Commit: 30979b5
/LLM/main/L0_MergeRequest_PR pipeline #41218 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

MrGeva · 2026-06-04T06:35:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-04T06:42:27Z

PR_Github #51997 [ run ] triggered by Bot. Commit: 30979b5 Link to invocation

tensorrt-cicd · 2026-06-04T08:45:31Z

PR_Github #51997 [ run ] completed with state SUCCESS. Commit: 30979b5
/LLM/main/L0_MergeRequest_PR pipeline #41340 completed with status: 'SUCCESS'

CI Report

Link to invocation

…ter MLIR elementwise fusion (NVIDIA#14795) Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com> Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>

tensorrt-cicd requested a review from a team as a code owner May 31, 2026 08:36

tensorrt-cicd requested a review from suyoggupta May 31, 2026 08:36

tensorrt-cicd assigned MrGeva May 31, 2026

github-actions Bot assigned tensorrt-cicd May 31, 2026

tensorrt-cicd force-pushed the repair-bot-bug6244474 branch from 4e2a6bc to ae001c3 Compare May 31, 2026 10:43

MrGeva approved these changes May 31, 2026

View reviewed changes

MrGeva changed the title ~~[https://nvbugs/6244474][fix] Remove the inline run_shape_prop(new_gm) call and report has_valid_shapes=False~~ [https://nvbugs/6244474] [fix] Remove the inline run_shape_prop(new_gm) call and report has_valid_shapes=False May 31, 2026

MrGeva changed the title ~~[https://nvbugs/6244474] [fix] Remove the inline run_shape_prop(new_gm) call and report has_valid_shapes=False~~ [https://nvbugs/6244474][fix] Remove the inline run_shape_prop(new_gm) call and report has_valid_shapes=False May 31, 2026

MrGeva changed the title ~~[https://nvbugs/6244474][fix] Remove the inline run_shape_prop(new_gm) call and report has_valid_shapes=False~~ [https://nvbugs/6244474][fix] AutoDeploy: skip explicit shape-prop after MLIR elementwise fusion May 31, 2026

suyoggupta approved these changes Jun 1, 2026

View reviewed changes

MrGeva enabled auto-merge (squash) June 1, 2026 12:41

tensorrt-cicd added 2 commits June 3, 2026 08:51

[nvbugs/6244474][chore] Remove stale waiver after fix

30979b5

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

MrGeva force-pushed the repair-bot-bug6244474 branch from ae001c3 to 30979b5 Compare June 3, 2026 05:51

MrGeva merged commit 8b0eba9 into NVIDIA:main Jun 4, 2026
7 checks passed

Conversation

tensorrt-cicd commented May 31, 2026 • edited by MrGeva Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

MrGeva commented May 31, 2026

Uh oh!

tensorrt-cicd commented May 31, 2026

Uh oh!

tensorrt-cicd commented May 31, 2026

Uh oh!

MrGeva commented Jun 1, 2026

Uh oh!

tensorrt-cicd commented Jun 1, 2026

Uh oh!

tensorrt-cicd commented Jun 1, 2026

Uh oh!

MrGeva commented Jun 1, 2026

Uh oh!

MrGeva commented Jun 2, 2026

Uh oh!

tensorrt-cicd commented Jun 2, 2026

Uh oh!

tensorrt-cicd commented Jun 2, 2026

Uh oh!

MrGeva commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

MrGeva commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

MrGeva commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

MrGeva commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

tensorrt-cicd commented Jun 3, 2026

Uh oh!

MrGeva commented Jun 4, 2026

Uh oh!

tensorrt-cicd commented Jun 4, 2026

Uh oh!

tensorrt-cicd commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tensorrt-cicd commented May 31, 2026 •

edited by MrGeva

Loading

coderabbitai Bot commented May 31, 2026 •

edited

Loading