Add opt-in MXFP8 LM-head output projection by gdengk · Pull Request #4825 · NVIDIA/Megatron-LM

gdengk · 2026-05-15T21:50:19Z

What does this PR do ?

Adds an opt-in TE-based LM-head ColumnParallelLinear that runs under the MXFP8 autocast context. Controlled by the new fp8_output_proj config flag; active only when fp8=True and fp8_recipe='mxfp8'. Defaults to off, so existing flows are unaffected.

This is the main-branch equivalent of #4484 and #4489 (merged to 26.04-alpha), ported as a self-contained module instead of layering on the alpha-only LinearCrossEntropyModule wrapper.

Changes

New module megatron/core/transformer/mxfp8_output_proj.py with TELinearCrossEntropyModule and is_te_mxfp8_output_proj_active.
GPTModel.__init__ conditionally swaps the output layer when the gate is on.
New TransformerConfig.fp8_output_proj field (default False).

Tests

tests/unit_tests/transformer/test_mxfp8_output_proj.py:

Gating helper across config combinations (pure-Python).
Constructor validations for unsupported options.
GPTModel default uses ColumnParallelLinear.
GPTModel uses TELinearCrossEntropyModule under mxfp8 (Blackwell-gated).

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code
I have added relevant documentation
I have run the autoformatter.sh on my PR

Introduces an opt-in TE-based LM-head ColumnParallelLinear that runs under the MXFP8 autocast context. Controlled by the new `fp8_output_proj` config flag; active only when `fp8=True` and `fp8_recipe='mxfp8'`. Defaults to off, so existing flows are unaffected. This is the main-branch equivalent of NVIDIA#4484 and NVIDIA#4489 (merged to `26.04-alpha`), ported as a self-contained module instead of layering on the alpha-only `LinearCrossEntropyModule` wrapper.

Covers: * is_te_mxfp8_output_proj_active branches (pure-Python, no GPU) * TELinearCrossEntropyModule constructor validations (early raises, no GPU init required since each fires before super().__init__()) * GPTModel default uses ColumnParallelLinear * GPTModel uses TELinearCrossEntropyModule when fp8_output_proj is enabled under the mxfp8 recipe (Blackwell-only, skipped otherwise)

Reject fp8_output_proj=True when fp8 is off or fp8_recipe is not 'mxfp8' at config-construction time, so misconfiguration fails fast instead of only being caught by the runtime RuntimeError in TELinearCrossEntropyModule. The runtime check is retained as defense-in-depth against later mutation.

copy-pr-bot · 2026-05-15T21:50:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-15T21:50:27Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

dingqingy-nv · 2026-05-24T07:14:45Z

/claude review

claude

LGTM

Phlip79 · 2026-05-24T08:47:54Z

/ok to test c292e98

Phlip79 · 2026-05-24T08:50:04Z

+        # so GPTModel.sharded_state_dict's no-extra-state invariant still holds.
+        return None
+
+    def set_extra_state(self, state):


Add a warning here in case someone attempts to call set_extra_state?

Phlip79 · 2026-05-24T08:57:27Z

/ok to test 859fd91

Phlip79 · 2026-05-24T09:18:02Z

/ok to test 13976e4

Phlip79 · 2026-05-26T20:55:09Z

LGTM, but is this only applicable for GPTModel or is this also relevant for HybridModel?

I'll add support in a follow-up PR for this. There are a few other features that also need to be added to HybridModel.

gdengk · 2026-05-28T16:45:43Z

/ok to test 5bc6d74

gdengk · 2026-05-28T17:03:21Z

/ok to test 2b624cc

svcnvidia-nemo-ci · 2026-05-28T20:41:25Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26600989893

svcnvidia-nemo-ci · 2026-05-28T21:30:16Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26603411168

a added 3 commits May 15, 2026 14:02

gdengk requested review from a team as code owners May 15, 2026 21:50

svcnvidia-nemo-ci marked this pull request as draft May 15, 2026 21:50

Merge branch 'main' into gaod/main/mxfp8-output-proj

c292e98

gdengk marked this pull request as ready for review May 15, 2026 21:50

svcnvidia-nemo-ci requested a review from a team May 15, 2026 21:51

svcnvidia-nemo-ci added the complexity: medium label May 15, 2026

dingqingy-nv added the 26.06 label May 24, 2026

claude Bot approved these changes May 24, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 24, 2026 08:48 Inactive

Phlip79 approved these changes May 24, 2026

View reviewed changes

Apply lint formatting

859fd91

copy-pr-bot Bot temporarily deployed to public May 24, 2026 08:58 Inactive

Phlip79 added 2 commits May 24, 2026 09:17

Fix MXFP8 output projection lint

e644545

Merge branch 'main' into gaod/main/mxfp8-output-proj

13976e4

copy-pr-bot Bot temporarily deployed to public May 24, 2026 09:18 Inactive

copy-pr-bot Bot temporarily deployed to test May 24, 2026 09:18 Inactive

copy-pr-bot Bot temporarily deployed to public May 24, 2026 09:21 Inactive

Phlip79 removed the request for review from a team May 26, 2026 20:56

jaredcasper reviewed May 28, 2026

View reviewed changes

Comment thread megatron/core/transformer/mxfp8_output_proj.py Outdated

Comment thread megatron/core/transformer/mxfp8_output_proj.py Outdated

Comment thread megatron/core/transformer/mxfp8_output_proj.py Outdated

address the comments

5bc6d74

copy-pr-bot Bot temporarily deployed to public May 28, 2026 16:46 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 16:49 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 16:56 Inactive

fix lint

2b624cc

copy-pr-bot Bot temporarily deployed to public May 28, 2026 17:03 Inactive

copy-pr-bot Bot temporarily deployed to test May 28, 2026 17:04 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 17:07 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 17:15 Inactive

jaredcasper approved these changes May 28, 2026

View reviewed changes

ericharper approved these changes May 28, 2026

View reviewed changes

svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels May 28, 2026

ericharper added this pull request to the merge queue May 28, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026

Conversation

gdengk commented May 15, 2026

What does this PR do ?

Changes

Tests

Contribution process

Pre-checks

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

dingqingy-nv commented May 24, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Phlip79 commented May 24, 2026

Uh oh!

Phlip79 May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Phlip79 commented May 24, 2026

Uh oh!

Phlip79 commented May 24, 2026

Uh oh!

Phlip79 commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gdengk commented May 28, 2026

Uh oh!

gdengk commented May 28, 2026

Uh oh!

svcnvidia-nemo-ci commented May 28, 2026

Uh oh!

svcnvidia-nemo-ci commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants