Add speculative decoding support with MTP layers by santhnm2 · Pull Request #3594 · NVIDIA/Megatron-LM

santhnm2 · 2026-02-25T23:14:06Z

What does this PR do ?

This PR adds speculative decoding support for inference.

How it works

Each generation step proceeds in phases:

Input construction — For each active decode request, 1 + K tokens are fed into the model: the previously sampled token plus K speculative tokens from the MTP heads of the previous step. These are interleaved across requests with matching position IDs.
(dynamic_context.py: update_requests)
Forward pass — The model processes all tokens in one pass. The base decoder produces logits at every position (note: materialize_only_last_token_logits must be off). The MTP heads produce K additional sets of logits from their lightweight transformer/Mamba layers, cached in model._mtp_logits_cache. These are concatenated to produce a [1+K, seq_len, vocab_size] logit tensor.
(text_generation_controller.py: _dynamic_step_forward_logits)
Sampling & verification — Both base and MTP logits are sampled (grouped into temperature/top_k/top_p buckets for efficiency). Then a greedy token-matching verification determines how many speculative tokens to accept: a speculative token at position t+k is accepted iff the base model's output at position t+k-1 equals it. Acceptance is consecutive — once a mismatch occurs, all subsequent speculative tokens for that request are rejected (enforced via cummin).
(text_generation_controller.py: _dynamic_step_sample_logits_and_verify_tokens)
KV cache rewind — For rejected tokens, the KV cache is rolled back: block offsets are decremented, and if the rewind crosses a block boundary, the block is released back to the allocator. For Mamba/hybrid models, SSM recurrent state is restored from intermediate snapshots captured during the Triton kernel execution.
(text_generation_controller.py: _rewind_kv_cache)
Bookkeeping — Sequence lengths advance by accepted_count + 1 (not just 1). Finish conditions (EOS, max length) are checked, and the accepted + sampled tokens are appended to each request's output. New MTP-sampled tokens are staged for the next step.

MTP head architecture

Each MultiTokenPredictionLayer (multi_token_prediction.py) takes the hidden states from the previous depth, concatenates them with shifted input embeddings via a learned projection (eh_proj), and runs the result through a transformer block (or Mamba stack). This is repeated K times to produce predictions at positions t+1 through t+K. A mtp_use_repeated_layer option shares weights across all K layers.

Mamba/hybrid SSM support

SSM models require special handling because they carry recurrent state that must be rollback-able:

Conv state: The circular buffer is enlarged by K slots, so no explicit save/restore is needed.
SSM state: The Triton SSM kernel (mamba_ssm.py) dumps intermediate state snapshots at every speculative step into a pre-allocated [num_layers, max_requests, K+1, ...] buffer — zero extra GPU-CPU sync. On rewind, the correct snapshot is indexed by accepted token count.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot · 2026-02-25T23:14:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-03-02T06:30:03Z

/ok to test 8e3710f

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-03-02T06:49:55Z

/ok to test 9727533

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-03-11T07:08:25Z

/ok to test 6a08b28

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-03-11T07:17:34Z

/ok to test 3296f19

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-03-11T07:54:31Z

/ok to test fade26c

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-03-11T08:03:42Z

/ok to test b578a6a

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-03-11T16:56:33Z

/ok to test 277dfba

copy-pr-bot · 2026-03-11T16:56:38Z

/ok to test 277dfba

@santhnm2, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

santhnm2 · 2026-03-11T16:58:04Z

/ok to test 2169c01

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-03-11T20:26:52Z

/ok to test 5ee472a

svcnvidia-nemo-ci · 2026-03-11T21:50:25Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22976150513

Shanmugam Ramasamy and others added 16 commits February 2, 2026 12:55

Speculative decoding simple implementation

bf1b92f

Speculative decoding vectorized implementation

194e0e4

Added comments and cleaned up code

e911a17

Bug fix

6aab08f

Bug fix

fceb983

Bug fix

6aeaced

Rebase to main

f5d52db

Merge remote-tracking branch 'upstream/main' into spec_mamba

537a1fe

WIP MTP for mamba

b2718b8

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Merge remote-tracking branch 'upstream/main' into spec_mamba

d2ac237

Merge remote-tracking branch 'upstream/main' into spec_mamba

043e7c1

WIP debugging

397fd54

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Add SGLang kernels

af60402

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

More debugging

7795737

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Working causal_conv1d_update triton kernel

ae03747

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Mamba almost working

5915cc2

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 added 7 commits February 25, 2026 15:41

Fix non-consecutive acceptance bug

a011c73

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

More progress

06b08d7

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Working with cuda graphs

6b8835a

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Fix cuda graphs and chunked prefill

9057442

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Merge with main

cfc6282

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Add speculative decode unit tests

4d5fe5d

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Minor fix

8e3710f

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 2, 2026

santhnm2 added 2 commits March 1, 2026 22:33

Minimize diff

867a137

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Formatting

9727533

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 changed the title ~~Speculative decoding MTP~~ Add speculative decoding support with MTP layers Mar 2, 2026

santhnm2 added 3 commits March 10, 2026 23:27

Fix dummy spec decode

d17ad10

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Linting

fffddd6

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Merge remote-tracking branch 'upstream/main' into spec_mamba

6a08b28

copy-pr-bot bot temporarily deployed to test March 11, 2026 07:09 Inactive

Bug fix

3296f19

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 11, 2026 07:18 Inactive

Fix static inference

fade26c

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot bot had a problem deploying to test March 11, 2026 07:55 Error

Fix for mamba model also

b578a6a

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 11, 2026 08:04 Inactive

santhnm2 added 2 commits March 11, 2026 09:35

Merge remote-tracking branch 'upstream/main' into spec_mamba

277dfba

Unit test fix

2169c01

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 11, 2026 16:59 Inactive

santhnm2 added 3 commits March 11, 2026 12:09

Remove unit tests

d2b5a44

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Linting

23fa7a1

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Merge remote-tracking branch 'upstream/main' into spec_mamba

5ee472a

copy-pr-bot bot temporarily deployed to test March 11, 2026 20:28 Inactive

santhnm2 added this pull request to the merge queue Mar 11, 2026

Merged via the queue into NVIDIA:main with commit 8f539df Mar 11, 2026
120 checks passed

santhnm2 deleted the spec_mamba branch March 11, 2026 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add speculative decoding support with MTP layers#3594

Add speculative decoding support with MTP layers#3594
santhnm2 merged 114 commits intoNVIDIA:mainfrom
santhnm2:spec_mamba

santhnm2 commented Feb 25, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 25, 2026

Uh oh!

santhnm2 commented Mar 2, 2026

Uh oh!

santhnm2 commented Mar 2, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

copy-pr-bot bot commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

santhnm2 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

How it works

MTP head architecture

Mamba/hybrid SSM support

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Feb 25, 2026

Uh oh!

santhnm2 commented Mar 2, 2026

Uh oh!

santhnm2 commented Mar 2, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

copy-pr-bot bot commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

santhnm2 commented Mar 11, 2026

Uh oh!

svcnvidia-nemo-ci commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

santhnm2 commented Feb 25, 2026 •

edited

Loading

(Step 1): Add PR label `Expert Review`