Feat: MLA eagle by h-guo18 · Pull Request #689 · NVIDIA/Model-Optimizer

h-guo18 · 2025-12-15T03:35:47Z

What does this PR do?

Type of change: New Feature

Overview:

Add MLA Eagle support
- Add new argument "eagle_decoder_type" to switch between llama and kimik2 eagle;
- Add patches to load from kimik2 model implementations dynamically;
- new default config for kimi k2;
- Refactor eagle export to support multilayer/multitype eagle export concisely;
  - Rename some modules for simplified export logic;
Other minor improvements;

Usage

# Add a code snippet demonstrating how to use this

Testing

Tested that kimi k2 thinking works with eagle_type=kimik2:

Tested that llama 3.2 1b works with eagle_type=llama:

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

copy-pr-bot · 2025-12-15T03:35:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot · 2025-12-15T23:49:18Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

yeyu-nvidia · 2025-12-16T17:41:01Z

Will need to add eagle_decoder_type support in megatron_eagle.py as well as export support. Can leave to next PR

benchislett

Also, are we loading an MoE layer here? Are we overriding it with MLP somehow?

codecov · 2025-12-18T03:23:11Z

Codecov Report

❌ Patch coverage is 37.50000% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.65%. Comparing base (b286165) to head (11187be).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/speculative/utils.py	28.57%	25 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #689      +/-   ##
==========================================
- Coverage   74.73%   74.65%   -0.09%     
==========================================
  Files         192      192              
  Lines       18870    18909      +39     
==========================================
+ Hits        14103    14117      +14     
- Misses       4767     4792      +25

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yeyu-nvidia · 2025-12-18T03:31:30Z

Please test e2e pipeline for parallel draft: from convert to train to export before merge

yeyu-nvidia

Please address the comments

Discussed offline

yeyu-nvidia · 2025-12-18T19:43:01Z

@@ -43,3 +44,4 @@ def modify(
        self.eagle_report_acc = eagle_report_acc
        self.eagle_reuse_base_decoder = eagle_reuse_base_decoder


This should be removed since you have eagle_decoder_type now

Doesn't megatron still use this argument?

yeyu-nvidia · 2025-12-18T20:05:11Z

Training failed on a8. Command to reproduce: bash launch_train.sh --save_steps 20 --data_path /workspace/scratch.yeyu_hw/Daring-Anteater/llama3.2_1B_fp8.jsonl --training_seq_len 512

[rank7]: Traceback (most recent call last):
[rank7]: File "/workspace/scratch.yeyu_hw/TensorRT-Model-Optimizer/examples/speculative_decoding/main.py", line 263, in
[rank7]: train()
[rank7]: File "/workspace/scratch.yeyu_hw/TensorRT-Model-Optimizer/examples/speculative_decoding/main.py", line 257, in train
[rank7]: trainer.train(resume_from_checkpoint=checkpoint)
[rank7]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2325, in train
[rank7]: return inner_training_loop(
[rank7]: ^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2674, in _inner_training_loop
[rank7]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 4020, in training_step
[rank7]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/workspace/scratch.yeyu_hw/TensorRT-Model-Optimizer/examples/speculative_decoding/eagle_utils.py", line 502, in compute_loss
[rank7]: loss, outputs = super().compute_loss(return_outputs=True, *args, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 4110, in compute_loss
[rank7]: outputs = model(**inputs)
[rank7]: ^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank7]: return self._call_impl(*args, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank7]: return forward_call(*args, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 1633, in forward
[rank7]: inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 1529, in _pre_forward
[rank7]: self._sync_buffers()
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 2166, in _sync_buffers
[rank7]: self._sync_module_buffers(authoritative_rank)
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 2170, in _sync_module_buffers
[rank7]: self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 2192, in _default_broadcast_coalesced
[rank7]: self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 2107, in _distributed_broadcast_coalesced
[rank7]: dist._broadcast_coalesced(
[rank7]: RuntimeError: No backend type associated with device type cpu

yeyu-nvidia · 2025-12-18T22:07:48Z

Please test e2e pipeline for parallel draft: from convert to train to export before merge

tested

h-guo18 · 2025-12-18T22:23:14Z

Training failed on a8. Command to reproduce: bash launch_train.sh --save_steps 20 --data_path /workspace/scratch.yeyu_hw/Daring-Anteater/llama3.2_1B_fp8.jsonl --training_seq_len 512

[rank7]: Traceback (most recent call last):
[rank7]: File "/workspace/scratch.yeyu_hw/TensorRT-Model-Optimizer/examples/speculative_decoding/main.py", line 263, in
[rank7]: train()
[rank7]: File "/workspace/scratch.yeyu_hw/TensorRT-Model-Optimizer/examples/speculative_decoding/main.py", line 257, in train
[rank7]: trainer.train(resume_from_checkpoint=checkpoint)
[rank7]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2325, in train
[rank7]: return inner_training_loop(
[rank7]: ^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2674, in _inner_training_loop
[rank7]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 4020, in training_step
[rank7]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/workspace/scratch.yeyu_hw/TensorRT-Model-Optimizer/examples/speculative_decoding/eagle_utils.py", line 502, in compute_loss
[rank7]: loss, outputs = super().compute_loss(return_outputs=True, *args, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 4110, in compute_loss
[rank7]: outputs = model(**inputs)
[rank7]: ^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank7]: return self._call_impl(*args, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank7]: return forward_call(*args, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 1633, in forward
[rank7]: inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 1529, in _pre_forward
[rank7]: self._sync_buffers()
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 2166, in _sync_buffers
[rank7]: self._sync_module_buffers(authoritative_rank)
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 2170, in _sync_module_buffers
[rank7]: self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 2192, in _default_broadcast_coalesced
[rank7]: self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
[rank7]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/parallel/distributed.py", line 2107, in _distributed_broadcast_coalesced
[rank7]: dist._broadcast_coalesced(
[rank7]: RuntimeError: No backend type associated with device type cpu

fixed

yeyu-nvidia

Megatron will need some API refactoring due to this PR. We will need to add MLA to megatron as well.

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>

Signed-off-by: yeyu-nvidia <yeyu@nvidia.com>

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>

ChenhanYu · 2025-12-18T22:49:35Z

 import torch.distributed
+from huggingface_hub import snapshot_download
 from torch import nn
+from transformers.cache_utils import DynamicCache


These utils require transformers should better be moved to /plugins. No need to change now but a remark.

ChenhanYu

I leave a short comment regarding dependency on transformers. This is great work.

h-guo18 · 2025-12-19T03:33:24Z

Megatron will need some API refactoring due to this PR. We will need to add MLA to megatron as well.

I think we should only refactor something if it's due to the need for new feature. We should not refactor it if it's due to this PR. I would appreciate to know if there is a better way for this feature. Thanks

yeyu-nvidia · 2025-12-19T05:41:03Z

I think we should only refactor something if it's due to the need for new feature. We should not refactor it if it's due to this PR. I would appreciate to know if there is a better way for this feature. Thanks

Isn't MLA a need to Megatron? This PR disables eagle_reuse_base_decoder for HF and introduce MLA decoder. Don't we need to refactor to enable eagle_decoder_type to Megatron as well and deprecate eagle_reuse_base_decoder? What is the definition for "need for new feature" if we don't support something that we claim we support but only half way support it?

h-guo18 self-assigned this Dec 15, 2025

h-guo18 changed the title ~~MLA eagle~~ Feat: MLA eagle Dec 15, 2025

h-guo18 force-pushed the haoguo/mla-eagle branch from d321871 to 86fab2b Compare December 15, 2025 23:49

h-guo18 force-pushed the haoguo/mla-eagle branch from 86fab2b to 6d5e96b Compare December 15, 2025 23:49

h-guo18 requested a review from yeyu-nvidia December 15, 2025 23:50