[https://nvbugs/6244474][fix] AutoDeploy: skip explicit shape-prop after MLIR elementwise fusion#14795
Conversation
📝 WalkthroughWalkthroughThis pull request modifies the MLIR elementwise fusion transform to defer graph cleanup and shape propagation operations to later framework stages rather than performing them immediately. The returned ChangesDeferred graph cleanup
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
4e2a6bc to
ae001c3
Compare
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #51255 [ run ] triggered by Bot. Commit: |
|
PR_Github #51255 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #51302 [ run ] triggered by Bot. Commit: |
|
PR_Github #51302 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #51628 [ run ] triggered by Bot. Commit: |
|
PR_Github #51628 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #51756 [ run ] triggered by Bot. Commit: |
… fusion The MLIR elementwise fusion transform unconditionally ran fake-tensor shape propagation on the FX-reconstructed graph. On Llama-3.1-8B FP8 (post-NVIDIA#14622 YAML), fuse_rope_into_trtllm_attention rewires Q/K/V to a single fused-QKV tensor of shape (B, S, 6144) and stores _trtllm_fused_qkv in node.meta; the actual op swap happens later at cache_init. While the graph sits in this deliberately invalid intermediate state at post_load_fusion, FakeTensorProp re-evaluates torch_attention.register_fake with query=value=fused_qkv and produces (B, S, 6144), which then fails the downstream attn_output.reshape(B, S, num_heads * head_dim = 4096) at modeling_llama3.py:189. The transform's YAML config already declares run_shape_prop: false (see mlir/agent_learnings.md section 6: 'Prefer run_shape_prop: false unless the transform specifically needs re-propagated shapes'). The redundant inline call contradicts that intent. Drop it and return has_valid_shapes=False so downstream transforms that require shape metadata re-derive it via the framework's _run_cleanup, which handles these intermediate states correctly. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
ae001c3 to
30979b5
Compare
|
/bot run |
|
PR_Github #51768 [ run ] triggered by Bot. Commit: |
|
PR_Github #51756 [ run ] completed with state |
|
PR_Github #51768 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #51799 [ run ] triggered by Bot. Commit: |
|
PR_Github #51799 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #51860 [ run ] triggered by Bot. Commit: |
|
PR_Github #51860 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #51997 [ run ] triggered by Bot. Commit: |
|
PR_Github #51997 [ run ] completed with state |
…ter MLIR elementwise fusion (NVIDIA#14795) Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com> Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Summary
Fixes nvbugs/6244474: Llama-3.1-8B-Instruct-FP8 AutoDeploy pipeline crashes at attention output reshape (
modeling_llama3.py:189) when bothfuse_rope_into_trtllm_attentionandmlir_elementwise_fusionare enabled.Root cause
Both transforms run in the
post_load_fusionstage. Inpost_load_fusion,fuse_rope_into_trtllm_attentiondeliberately rewires Q/K/V to a single fused-QKV tensor of shape(B, S, 6144)and records_trtllm_fused_qkvinnode.meta; the actual op swap totrtllm_mha_with_cachehappens later atcache_init. While the graph sits in this intermediate state,mlir_elementwise_fusion._applywas unconditionally callingrun_shape_prop(new_gm)after FX reconstruction.FakeTensorPropthen re-evaluatedtorch_attention.register_fakewithquery=key=value=fused_qkv, producing an output of shape(B, S, 6144), which fails the downstreamattn_output.reshape(B, S, num_heads * head_dim = 4096).The transform's YAML config already declares
run_shape_prop: false(seemlir/agent_learnings.md§6: "Preferrun_shape_prop: falseunless the transform specifically needs re-propagated shapes"). The redundant inline call contradicted that intent — a stale leftover from before the YAML flag was added in7a4752df2e. The combinationbecame reachable only after #13859 moved
mlir_elementwise_fusionto run afterfuse_rope_into_trtllm_attentionin the YAML ordering, and was first triggered by the Llama-3.1-8B FP8 config tuning in #14622.Fix
canonicalize_graph+run_shape_propcalls inMLIRElementwiseFusion._apply.is_clean=False, has_valid_shapes=Falseso the framework's_run_cleanupruns canonicalization (sincerun_graph_cleanupdefaults toTrue) and downstreamtransforms with
requires_shape_prop=Truere-derive shapes via the framework's standard path — by which pointinsert_cached_attentionhas performed the real op swapand the graph is in a valid state.
perf/test_perf_sanity.py::test_e2e[aggr_upload-llama3_1_8b_fp8_ad_hopper-llama3_1_8b_ad_ws1_1k1k].Summary by CodeRabbit