fix: Float32RMSNorm torch.compile crash on PyTorch 2.11+ by hemildesai · Pull Request #1650 · NVIDIA-NeMo/Automodel

hemildesai · 2026-04-01T23:43:43Z

Summary

Extract Float32RMSNorm.forward into a standalone _float32_rms_norm_fwd() compiled function to reduce dynamo guard-state combinations (eliminates guards on self and module state)
Bump default dynamo cache_size_limit to 64 unconditionally in compile_utils.py so per-method @torch.compile decorators don't hit FailOnRecompileLimitHit from varying ndim/autocast/grad_mode combinations
Root cause: PyTorch 2.11 is stricter about recompilation limits (default=8), and MoE training with variable-length sequences triggers 8+ guard-state combinations

Test plan

Verified fix runs 50 steps of Qwen3 MoE 30B SFT with rms_norm: torch_fp32 on PyTorch 2.11 (torch 2.11.0a0+eb65b36914.nv26.02) without crash
TPS/gpu: 3138 (fixed torch_fp32) vs 3218 (TE RMSNorm) — comparable performance
Pre-commit hooks pass (ruff, ruff-format)
CI unit tests

🤖 Generated with Claude Code

copy-pr-bot · 2026-04-01T23:43:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

hemildesai · 2026-04-01T23:43:49Z

/claude review

hemildesai · 2026-04-01T23:43:50Z

/ok to test 3f567d4

claude

Clean refactor — extracting the forward into a standalone compiled function to reduce dynamo guards makes sense.

One concern flagged inline: the module-level configure_torch_dynamo(cache_size_limit=64) sets a limit lower than the existing default of 256 used by CompileConfig. This is only effective when full model compile is disabled (otherwise apply_torch_compile overrides it). The comment says "bump" which is misleading — worth clarifying the intent or aligning the values.

hemildesai · 2026-04-01T23:48:44Z

/claude review

hemildesai · 2026-04-01T23:48:45Z

/ok to test 3406485

hemildesai · 2026-04-01T23:51:54Z

/claude review

hemildesai · 2026-04-01T23:51:55Z

/ok to test b18eaae

Fixes L0_Unit_Tests_CPU failure introduced by 980f23d on main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

hemildesai · 2026-04-02T01:10:23Z

/claude review

hemildesai · 2026-04-02T01:10:24Z

/ok to test a2f4ee8

claude

LGTM

hemildesai requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa and pthombre as code owners April 1, 2026 23:43

copy-pr-bot Bot had a problem deploying to test April 1, 2026 23:44 Error

copy-pr-bot Bot temporarily deployed to nemo-ci April 1, 2026 23:44 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 1, 2026 23:44 Error

hemildesai changed the title ~~Fix Float32RMSNorm torch.compile crash on PyTorch 2.11+~~ fix: Float32RMSNorm torch.compile crash on PyTorch 2.11+ Apr 1, 2026

claude Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread nemo_automodel/components/utils/compile_utils.py Outdated

claude Bot reviewed Apr 1, 2026

View reviewed changes

copy-pr-bot Bot had a problem deploying to test April 1, 2026 23:49 Error

copy-pr-bot Bot had a problem deploying to nemo-ci April 1, 2026 23:49 Error

copy-pr-bot Bot temporarily deployed to nemo-ci April 1, 2026 23:49 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 1, 2026 23:49 Error

copy-pr-bot Bot temporarily deployed to nemo-ci April 1, 2026 23:49 Inactive

claude Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread nemo_automodel/components/utils/compile_utils.py Outdated

copy-pr-bot Bot temporarily deployed to nemo-ci April 2, 2026 01:05 Inactive

copy-pr-bot Bot had a problem deploying to test April 2, 2026 01:05 Error

Add missing recipe target to nemotron hellaswag peft TE config

a2f4ee8

Fixes L0_Unit_Tests_CPU failure introduced by 980f23d on main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

hemildesai dismissed claude[bot]’s stale review via a2f4ee8 April 2, 2026 01:10

copy-pr-bot Bot temporarily deployed to test April 2, 2026 01:10 Inactive

claude Bot approved these changes Apr 2, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci April 2, 2026 01:11 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 2, 2026 01:37 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 2, 2026 01:44 Inactive

akoumpa approved these changes Apr 2, 2026

View reviewed changes

akoumpa enabled auto-merge (squash) April 2, 2026 01:45

copy-pr-bot Bot temporarily deployed to nemo-ci April 2, 2026 02:05 Inactive

akoumpa merged commit ec2f724 into main Apr 2, 2026
53 checks passed

akoumpa deleted the hemild/fix-torch-compile-rmsnorm branch April 2, 2026 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Float32RMSNorm torch.compile crash on PyTorch 2.11+#1650

fix: Float32RMSNorm torch.compile crash on PyTorch 2.11+#1650
akoumpa merged 5 commits intomainfrom
hemild/fix-torch-compile-rmsnorm

hemildesai commented Apr 1, 2026

Uh oh!

copy-pr-bot Bot commented Apr 1, 2026

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

hemildesai commented Apr 2, 2026

Uh oh!

hemildesai commented Apr 2, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hemildesai commented Apr 1, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 1, 2026

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

hemildesai commented Apr 1, 2026

Uh oh!

hemildesai commented Apr 2, 2026

Uh oh!

hemildesai commented Apr 2, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants