Description
The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.
Megatron MoE Supported Features
Model Support
✅ DeepSeek
✅ DeepSeek-V2
✅ DeepSeek-V3, including MTP
✅ DeepSeek-V3.2
✅ Qwen
✅ Qwen2-57B-A14B
✅ Qwen3-235B-A22B
✅ Qwen3.5
✅ Kimi-K2
✅ Mixtral
🚀 DeepSeek-V4 (On-going)
Core MoE Functionality
✅ Token dropless MoE - Advanced routing without token dropping
✅ Top-K Router with flexible K selection
✅ Load balancing losses for expert load balancing optimization
Advanced Parallelism
✅ Expert Parallel (EP) with 3D parallelism integration
✅ Full parallelism combo : EP + DP + TP + PP + SP support
✅ Context Parallel (CP) for long sequence MoE training
✅ Parallel Folding - Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
✅ Distributed Optimizer for MoE (ZeRO-1 equivalent)
✅ (🚀 New!) Megatron FSDP / HSDP with full expert parallel support
Optimizations
Memory
Communication
✅ DeepEP support for H100 and B200
✅ (🚀 New!) HybridEP for GB200
✅ 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
✅ DP/PP/TP/EP Communication Overlapping
Computation
✅ Advanced fusions for Router, Permutation, MLA Rope, FP8 casting, etc.
✅ cuDNN fused Attention and FlashAttn integration
✅ GroupedGEMM and Gradient Accumulation Fusion
✅ Production-ready cudaGraph support for MoE
Optimizer
✅ Muon and Layer-wise distributed optimizer
Precision Support
✅ GroupedGEMM including FP8/MXFP8 support
✅ FP8 weights with BF16 optimizer states
✅ FP8 training full support
🚀 NVFP4 Training
Developer Experience
✅ [MoE Model Zoo ](https://github.com/yanring/Megatron-MoE-ModelZoo ) with pre-training best practices
✅ MCore2HF Converter for ecosystem compatibility in megatron-bridge
✅ Distributed Checkpointing Support
✅ Runtime Upcycling Support for efficient model scaling
✅ Layer-wise logging for detailed monitoring
(2026 Q2) Megatron MoE Roadmap
Model Supports
DeepSeek-V4
Functionality
Model Architecture
Muon support
Packed Sequence
Long Context
Convergence
Performance Optimization
E2E Training Recipes
Long-Term Topics
Qwen3.5
Functionality
Performance Optimization
Gated Delta Rule (GDN) Optimization
Perf Optimization
Memory Optimization
Long Context Training
Multi-Modal Training for Qwen3.5
General Optimizations
CUDA Graph
Long Context & Context Parallel
Communication Optimization
Ongoing Long-term Features
E2E Performance optimization for DeepSeek-V4, Qwen-3.5 and other fine-grained MoEs
Extreme sparse MoE training exploration
Migration from GPTModel to HybridModel
Sync-Free and Full-Iter cudaGraph MoE Training
THD and Long Context
Megatron FSDP Performance Optimization for MoE Training
MoE Training with MegaKernels
Kernel fusions and optimizations for MoE models from TE - MoE training optimization TransformerEngine#2438
(2026 Q1) Highlights
Performance & Kernel Optimizations
Long Context & Context Parallel
Model & Architecture
Advanced Functionality
CUDA Graph Enhancements
Call for Community Contributions
Model implementations - Additional MoE model variants
Performance testing - Performance tests across different platforms and workloads
Documentation and tutorials - Best practices and optimization guides
Bug fixes
This roadmap reflects the collective efforts of NVIDIA and our collaborators.
Credits: MCore MoE Team and @sbhavani
Labels: roadmap, moe, call-for-contribution
Description
The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.
Megatron MoE Supported Features
Model Support
Core MoE Functionality
Advanced Parallelism
Optimizations
Precision Support
Developer Experience
(2026 Q2) Megatron MoE Roadmap
Model Supports
DeepSeek-V4
Functionality
Performance Optimization
E2E Training Recipes
Long-Term Topics
Qwen3.5
Functionality
Performance Optimization
Long Context Training
Multi-Modal Training for Qwen3.5
General Optimizations
CUDA Graph
Long Context & Context Parallel
Communication Optimization
Ongoing Long-term Features
(2026 Q1) Highlights
Performance & Kernel Optimizations
Long Context & Context Parallel
Model & Architecture
Advanced Functionality
CUDA Graph Enhancements
Call for Community Contributions
This roadmap reflects the collective efforts of NVIDIA and our collaborators.
Credits: MCore MoE Team and @sbhavani
Labels:
roadmap,moe,call-for-contribution