[ROADMAP][2026 Q2] Megatron Core MoE Roadmap

## Description

The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.

- **⚠️ This Roadmap is based on the [dev branch](https://github.com/NVIDIA/Megatron-LM/tree/dev); please see the details in its README.**
- **🎉 Our technical report for Megatron Core MoE is published in https://arxiv.org/abs/2603.07685**
- Previous Roadmaps
  - https://github.com/NVIDIA/Megatron-LM/issues/1729

---

# Megatron MoE Supported Features

### Model Support

- ✅ **DeepSeek**
  - ✅ DeepSeek-V2
  - ✅ DeepSeek-V3, including MTP
  - ✅ DeepSeek-V3.2
- ✅ **Qwen**
  - ✅ Qwen2-57B-A14B
  - ✅ Qwen3-235B-A22B
  - ✅ Qwen3.5
- ✅ **Kimi-K2**
- ✅ **Mixtral**
- 🚀 DeepSeek-V4 (On-going)

### Core MoE Functionality

- ✅ **Token dropless MoE** - Advanced routing without token dropping
- ✅ **Top-K Router** with flexible K selection
- ✅ **Load balancing losses** for expert load balancing optimization

### Advanced Parallelism

- ✅ **Expert Parallel (EP)** with 3D parallelism integration
- ✅ **Full parallelism combo**: EP + DP + TP + PP + SP support
- ✅ **Context Parallel (CP)** for long sequence MoE training
- ✅ **Parallel Folding** - Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
- ✅ **Distributed Optimizer for MoE** (ZeRO-1 equivalent)
- ✅ **(🚀 New!) Megatron FSDP / HSDP** with full **expert parallel support**

### Optimizations

- Memory
  - ✅ **Memory Efficient token permutation**
  - ✅ **Pipeline-aware fine-grained activation offloading** #1912
  - ✅ **Fine-grained Recomputations** (mla, moe, mlp, moe_act, norm)
- Communication
  - ✅ **DeepEP support for H100 and B200**
  - ✅ **(🚀 New!) HybridEP for GB200**
  - ✅ **1F1B EP A2A Overlap** - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
  - ✅ **DP/PP/TP/EP Communication Overlapping**
- Computation
  - ✅ **Advanced fusions** for Router, Permutation, MLA Rope, FP8 casting, etc.
  - ✅ **cuDNN fused Attention** and FlashAttn integration
  - ✅ **GroupedGEMM** and Gradient Accumulation Fusion
  - ✅ **Production-ready cudaGraph support for MoE**
- Optimizer
  - ✅ **Muon and Layer-wise distributed optimizer**

### Precision Support

- ✅ **GroupedGEMM** including FP8/MXFP8 support
- ✅ **FP8 weights with BF16 optimizer states**
- ✅ **FP8 training** full support
- 🚀 **NVFP4 Training**

### Developer Experience

- ✅ [**[MoE Model Zoo](https://github.com/yanring/Megatron-MoE-ModelZoo)**](https://github.com/yanring/Megatron-MoE-ModelZoo) with pre-training best practices
- ✅ **MCore2HF Converter** for ecosystem compatibility in megatron-bridge
- ✅ **Distributed Checkpointing Support**
- ✅ **Runtime Upcycling Support** for efficient model scaling
- ✅ **Layer-wise logging** for detailed monitoring

---

# (2026 Q2) Megatron MoE Roadmap

## Model Supports

### DeepSeek-V4

#### Functionality

- Model Architecture
  - [x] **mHC support** #2943
  - [x] **sqrtsoftplus MoE score function** #4193
  - [x] **DSA support** #2440
  - [x] **Hybrid Attention with CSA and HCA** #4458
  - [x] **Hash MoE, ClampedSwiGLU, mHC contract** #4481
  - [x] **MTP support with mHC** #4518
- Muon support
  - [ ] **Coefficients of Newton-Schulz** #4523
  - [x] **Distributed Layerwise Optimizer** #4509
  - [ ] **FP8 primary weight for Muon**
  - [ ] **Precision-aware optimizer**
- Packed Sequence
  - [ ] **DSv4HybridAttention THD support**
  - [ ] **Dispatcher support for THD**
- Long Context
  - [ ] **DSv4HybridAttention CP support**
- Convergence
  - [ ] **HF <-> MCore Checkpoint conversion**
  - [ ] **E2E convergence verification**

#### Performance Optimization

- mHC
  - [x] **mHC fusion kernels** #3828
  - [ ] **(perf) mHC kernel benchmark and optimization** #4624
  - [ ] **mHC support with EP Overlapping**
- Hybrid Attention
  - [ ] **Fused Attention Kernel for Dense MQA**
  - [ ] **Fused Attention Kernel for CSA and HCA**
  - [ ] **FP8 Indexer Computation**
- ClampedSwiGLU
  - [ ] **TE unfused path** [TE#2938](https://github.com/NVIDIA/TransformerEngine/pull/2938)
  - [ ] **Fused kernel in Cudnn-frontend**
- [ ] **batchedGEMM fusion and FP8 support**

#### E2E Training Recipes

- [ ] 4K Dense Training
- [ ] 16K Dense Training
- [ ] 64K Sparse Training
- [ ] 1M Sparse Training

#### Long-Term Topics

- [ ] Anticipatory Routing
- [ ] Flexible Activation Checkpointing
- [ ] MegaMoE

### Qwen3.5

#### Functionality

- [x] **Qwen3-Next / Qwen3.5 functionality supported**
- [x] **Packed sequence support**
- [x] **HF <-> MCore Checkpoint conversion**

#### Performance Optimization

- Gated Delta Rule (GDN) Optimization
  - [ ] **GDN kernel fusion on Blackwell** *(working)*
  - [ ] **GDN FlashQLA integration**
  - [ ] **Pre-gated delta rule kernel fusion** (conv1d + memory ops)
- Perf Optimization
  - [ ] **Full-iteration CUDA graph** to reduce CPU overhead of MoE
  - [ ] **Gated RMSNorm implementation in TE**
- Memory Optimization
  - [ ] **GDN fine-grained activation offloading** *(P0.5)*
  - [ ] **GDN selective recompute** (in_proj, conv1d, gated_delta_rule) *(P1)*

#### Long Context Training

- [x] **GDN context parallel in A2A/Ulysses mode**
- [ ] **GDN context parallel in AllGather mode** — enables Qwen3.5-397B with >64K context (FLA PR, MCore POC PR) *(working)*
- [ ] **Dynamic CP support for Multi-modal Training**

#### Multi-Modal Training for Qwen3.5
- [ ] **OOTB E2E FSDP Example for THD Training Support with Qwen3.5**
- [ ] **Dynamic CP Support**
- [ ] **Heterogeneous Parallelism for Encoder-Decoder**

## General Optimizations

### CUDA Graph
- [x] CUDA Graph Interface Refactor #4293 
- [ ] THD + CUDA Graph #4359 

### Long Context & Context Parallel
- [ ] Dynamic CP Support
- [ ] Dynamic CP + CUDA Graph

### Communication Optimization

- [ ] FSDP + A2A Overlap #3796 
- [ ] DeepEP v2 Token Dispatcher Integration #4793 
- [ ] NCCL-EP Support

## Ongoing Long-term Features

- **E2E Performance optimization** for DeepSeek-V4, Qwen-3.5 and other fine-grained MoEs
- **Extreme sparse MoE training exploration**
- **Migration from GPTModel to HybridModel**
- **Sync-Free and Full-Iter cudaGraph MoE Training**
  - **MoE ECHO** for dropless MoE load balancing #2368
- **THD and Long Context**
  - **THD Format E2E Support** - End-to-end THD format support #2924
  - **Dynamic Context Parallel** for Imbalanced Long-Sequence Training
- **Megatron FSDP Performance Optimization for MoE Training**
- **MoE Training with MegaKernels**
- **Kernel fusions and optimizations for MoE models from TE** - https://github.com/NVIDIA/TransformerEngine/issues/2438

# (2026 Q1) Highlights

### Performance & Kernel Optimizations
- [x] **Absorbed MLA** - MLA computation optimization for DSA #3193
- [x] Optimization of the standalone permute path https://github.com/deepseek-ai/DeepEP/pull/625
- [x] **HybridEP fusion with permutation** - https://github.com/deepseek-ai/DeepEP/pull/588, https://github.com/NVIDIA/Megatron-LM/pull/4073
- [x] **mHC kernel fusion** #3828
- [x] **GEMM + SwiGLU fused MLP** #3890

### Long Context & Context Parallel

- [x] **Hybrid CP Part 2** - Enhanced hybrid data x context parallelism #2000
- [x] **THD Format E2E Support** - End-to-end THD format support #2924

### Model & Architecture

- [x] **Manifold Hyper Connection (mHC)** #2943
- [x] **GDN THD Support** - Packed sequence support for gated delta net #2644
- [x] **GDN Refinement** - Refine gated delta net implementation #3040

### Advanced Functionality
- [x] **Router replay support for RL training** (in progress) #2693
- [x] **Megatron FSDP performance optimization for MoE training**

### CUDA Graph Enhancements
- [x] **Paged Stashing** - Dynamic tensor support for dropless MoE with CUDA graph #2690
- [x] **CUDA Graph + Offloading** - Support CUDA Graph capture with offloading modules #3219
- [x] **Optimizer CUDA Graph** - Enable CUDA graph for ADAM optimizer https://github.com/NVIDIA/Megatron-LM/pull/3429

---

## Call for Community Contributions

- **Model implementations** - Additional MoE model variants
- **Performance testing** - Performance tests across different platforms and workloads
- **Documentation and tutorials** - Best practices and optimization guides
- **Bug fixes**

---

This roadmap reflects the collective efforts of NVIDIA and our collaborators.

Credits: MCore MoE Team and @sbhavani

**Labels:** `roadmap`, `moe`, `call-for-contribution`

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Description

Description

Megatron MoE Supported Features

Model Support

Core MoE Functionality

Advanced Parallelism

Optimizations

Precision Support

Developer Experience

(2026 Q2) Megatron MoE Roadmap

Model Supports

DeepSeek-V4

Functionality

Performance Optimization

E2E Training Recipes

Long-Term Topics

Qwen3.5

Functionality

Performance Optimization

Long Context Training

Multi-Modal Training for Qwen3.5

General Optimizations

CUDA Graph

Long Context & Context Parallel

Communication Optimization

Ongoing Long-term Features

(2026 Q1) Highlights

Performance & Kernel Optimizations

Long Context & Context Parallel

Model & Architecture

Advanced Functionality

CUDA Graph Enhancements

Call for Community Contributions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions