feat: add TE FusedAdam QuantizedTensor compatibility patch#1417
Merged
hemildesai merged 6 commits intomainfrom Mar 2, 2026
Merged
feat: add TE FusedAdam QuantizedTensor compatibility patch#1417hemildesai merged 6 commits intomainfrom
hemildesai merged 6 commits intomainfrom
Conversation
Add runtime monkey-patch for Transformer Engine's FusedAdam optimizer to handle QuantizedTensor parameters whose .shape does not carry correct metadata for tensor allocation. The patch dequantizes params before creating optimizer state buffers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
Update the skip-guard to check for all three specific lines from NVIDIA/TransformerEngine#2535 (dequantize, zeros_like, empty_like) instead of just the string "QuantizedTensor". Also update copyright year to 2026 on new files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
Use the existing is_te_min_version utility to skip the monkey-patch entirely when TE >= 2.12, where the fix from NVIDIA/TransformerEngine#2535 is already included. Also update copyright year to 2026 on new files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
Contributor
Author
|
/ok to test e04fdff |
Exercise all code paths in the patched function body: regular params, QuantizedTensor dequantization, zero_buffer, store_param_remainders, scale creation for non-float32, and uint8/FP8 quantizer branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
TE FusedAdam accepts torch.dtype kwargs (master_weight_dtype, exp_avg_dtype, exp_avg_sq_dtype) but YAML configs produce strings. Resolve them via dtype_from_str before instantiation, following the same pattern used in _dist_setup.py for mp_policy dtypes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
Cover all branches of the dtype resolution logic: resolving all three TE FusedAdam dtype kwargs from strings, without torch prefix, preserving existing torch.dtype objects, missing attrs, and partial attrs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com>
Contributor
Author
|
/ok to test 81ff313 |
adil-a
approved these changes
Mar 2, 2026
Collaborator
adil-a
left a comment
There was a problem hiding this comment.
LGTM but I want to ask that we update the pinned version for TE in the next release.
@thomasdhc the patch here is already fixed for TE top-of-tree, so we should fix the pinned version appropriately as well. Thank you!
hemildesai
added a commit
that referenced
this pull request
Mar 4, 2026
* feat: add TE FusedAdam QuantizedTensor compatibility patch Add runtime monkey-patch for Transformer Engine's FusedAdam optimizer to handle QuantizedTensor parameters whose .shape does not carry correct metadata for tensor allocation. The patch dequantizes params before creating optimizer state buffers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix: check full upstream fix lines in TE FusedAdam guard Update the skip-guard to check for all three specific lines from NVIDIA/TransformerEngine#2535 (dequantize, zeros_like, empty_like) instead of just the string "QuantizedTensor". Also update copyright year to 2026 on new files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix: skip FusedAdam patch when TE >= 2.12 using is_te_min_version Use the existing is_te_min_version utility to skip the monkey-patch entirely when TE >= 2.12, where the fix from NVIDIA/TransformerEngine#2535 is already included. Also update copyright year to 2026 on new files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * test: add coverage for _patched_initialize_state behavior Exercise all code paths in the patched function body: regular params, QuantizedTensor dequantization, zero_buffer, store_param_remainders, scale creation for non-float32, and uint8/FP8 quantizer branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * feat: resolve dtype strings in optimizer config for TE FusedAdam TE FusedAdam accepts torch.dtype kwargs (master_weight_dtype, exp_avg_dtype, exp_avg_sq_dtype) but YAML configs produce strings. Resolve them via dtype_from_str before instantiation, following the same pattern used in _dist_setup.py for mp_policy dtypes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * test: add tests for optimizer dtype string resolution in build_optimizer Cover all branches of the dtype resolution logic: resolving all three TE FusedAdam dtype kwargs from strings, without torch prefix, preserving existing torch.dtype objects, missing attrs, and partial attrs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
SwekeR-463
pushed a commit
to SwekeR-463/Automodel
that referenced
this pull request
Mar 11, 2026
…Mo#1417) * feat: add TE FusedAdam QuantizedTensor compatibility patch Add runtime monkey-patch for Transformer Engine's FusedAdam optimizer to handle QuantizedTensor parameters whose .shape does not carry correct metadata for tensor allocation. The patch dequantizes params before creating optimizer state buffers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix: check full upstream fix lines in TE FusedAdam guard Update the skip-guard to check for all three specific lines from NVIDIA/TransformerEngine#2535 (dequantize, zeros_like, empty_like) instead of just the string "QuantizedTensor". Also update copyright year to 2026 on new files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix: skip FusedAdam patch when TE >= 2.12 using is_te_min_version Use the existing is_te_min_version utility to skip the monkey-patch entirely when TE >= 2.12, where the fix from NVIDIA/TransformerEngine#2535 is already included. Also update copyright year to 2026 on new files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * test: add coverage for _patched_initialize_state behavior Exercise all code paths in the patched function body: regular params, QuantizedTensor dequantization, zero_buffer, store_param_remainders, scale creation for non-float32, and uint8/FP8 quantizer branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * feat: resolve dtype strings in optimizer config for TE FusedAdam TE FusedAdam accepts torch.dtype kwargs (master_weight_dtype, exp_avg_dtype, exp_avg_sq_dtype) but YAML configs produce strings. Resolve them via dtype_from_str before instantiation, following the same pattern used in _dist_setup.py for mp_policy dtypes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * test: add tests for optimizer dtype string resolution in build_optimizer Cover all branches of the dtype resolution logic: resolving all three TE FusedAdam dtype kwargs from strings, without torch prefix, preserving existing torch.dtype objects, missing attrs, and partial attrs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
linnanwang
pushed a commit
that referenced
this pull request
Apr 24, 2026
* feat: add TE FusedAdam QuantizedTensor compatibility patch Add runtime monkey-patch for Transformer Engine's FusedAdam optimizer to handle QuantizedTensor parameters whose .shape does not carry correct metadata for tensor allocation. The patch dequantizes params before creating optimizer state buffers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix: check full upstream fix lines in TE FusedAdam guard Update the skip-guard to check for all three specific lines from NVIDIA/TransformerEngine#2535 (dequantize, zeros_like, empty_like) instead of just the string "QuantizedTensor". Also update copyright year to 2026 on new files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * fix: skip FusedAdam patch when TE >= 2.12 using is_te_min_version Use the existing is_te_min_version utility to skip the monkey-patch entirely when TE >= 2.12, where the fix from NVIDIA/TransformerEngine#2535 is already included. Also update copyright year to 2026 on new files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * test: add coverage for _patched_initialize_state behavior Exercise all code paths in the patched function body: regular params, QuantizedTensor dequantization, zero_buffer, store_param_remainders, scale creation for non-float32, and uint8/FP8 quantizer branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * feat: resolve dtype strings in optimizer config for TE FusedAdam TE FusedAdam accepts torch.dtype kwargs (master_weight_dtype, exp_avg_dtype, exp_avg_sq_dtype) but YAML configs produce strings. Resolve them via dtype_from_str before instantiation, following the same pattern used in _dist_setup.py for mp_policy dtypes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> * test: add tests for optimizer dtype string resolution in build_optimizer Cover all branches of the dtype resolution logic: resolving all three TE FusedAdam dtype kwargs from strings, without torch prefix, preserving existing torch.dtype objects, missing attrs, and partial attrs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Hemil Desai <hemild@nvidia.com> --------- Signed-off-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nemo_automodel/shared/te_patches.pywith a runtime monkey-patch for Transformer Engine'sFusedAdam._initialize_stateto handleQuantizedTensorparametersTrainFinetuneRecipeForNextTokenPrediction.setup()(intrain_ft.py), right afterapply_cache_compatibility_patches()Example config:
Context
This is a workaround for NVIDIA/TransformerEngine#2535, specifically
The patch is idempotent and auto-skips if TE is not installed or if the upstream fix is already present.
Test plan
uv run pytest tests/unit_tests/shared/test_te_patches.py -vs— 6/6 passing🤖 Generated with Claude Code