Skip to content

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

@Victarry

Description

@Victarry

Description

The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.


Megatron MoE Supported Features

Model Support

  • DeepSeek
    • ✅ DeepSeek-V2
    • ✅ DeepSeek-V3, including MTP
    • ✅ DeepSeek-V3.2
  • Qwen
    • ✅ Qwen2-57B-A14B
    • ✅ Qwen3-235B-A22B
    • ✅ Qwen3.5
  • Kimi-K2
  • Mixtral
  • 🚀 DeepSeek-V4 (On-going)

Core MoE Functionality

  • Token dropless MoE - Advanced routing without token dropping
  • Top-K Router with flexible K selection
  • Load balancing losses for expert load balancing optimization

Advanced Parallelism

  • Expert Parallel (EP) with 3D parallelism integration
  • Full parallelism combo: EP + DP + TP + PP + SP support
  • Context Parallel (CP) for long sequence MoE training
  • Parallel Folding - Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
  • Distributed Optimizer for MoE (ZeRO-1 equivalent)
  • (🚀 New!) Megatron FSDP / HSDP with full expert parallel support

Optimizations

  • Memory
  • Communication
    • DeepEP support for H100 and B200
    • (🚀 New!) HybridEP for GB200
    • 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
    • DP/PP/TP/EP Communication Overlapping
  • Computation
    • Advanced fusions for Router, Permutation, MLA Rope, FP8 casting, etc.
    • cuDNN fused Attention and FlashAttn integration
    • GroupedGEMM and Gradient Accumulation Fusion
    • Production-ready cudaGraph support for MoE
  • Optimizer
    • Muon and Layer-wise distributed optimizer

Precision Support

  • GroupedGEMM including FP8/MXFP8 support
  • FP8 weights with BF16 optimizer states
  • FP8 training full support
  • 🚀 NVFP4 Training

Developer Experience

  • ✅ [MoE Model Zoo](https://github.com/yanring/Megatron-MoE-ModelZoo) with pre-training best practices
  • MCore2HF Converter for ecosystem compatibility in megatron-bridge
  • Distributed Checkpointing Support
  • Runtime Upcycling Support for efficient model scaling
  • Layer-wise logging for detailed monitoring

(2026 Q2) Megatron MoE Roadmap

Model Supports

DeepSeek-V4

Functionality

Performance Optimization

E2E Training Recipes

  • 4K Dense Training
  • 16K Dense Training
  • 64K Sparse Training
  • 1M Sparse Training

Long-Term Topics

  • Anticipatory Routing
  • Flexible Activation Checkpointing
  • MegaMoE

Qwen3.5

Functionality

  • Qwen3-Next / Qwen3.5 functionality supported
  • Packed sequence support
  • HF <-> MCore Checkpoint conversion

Performance Optimization

  • Gated Delta Rule (GDN) Optimization
    • GDN kernel fusion on Blackwell (working)
    • GDN FlashQLA integration
    • Pre-gated delta rule kernel fusion (conv1d + memory ops)
  • Perf Optimization
    • Full-iteration CUDA graph to reduce CPU overhead of MoE
    • Gated RMSNorm implementation in TE
  • Memory Optimization
    • GDN fine-grained activation offloading (P0.5)
    • GDN selective recompute (in_proj, conv1d, gated_delta_rule) (P1)

Long Context Training

  • GDN context parallel in A2A/Ulysses mode
  • GDN context parallel in AllGather mode — enables Qwen3.5-397B with >64K context (FLA PR, MCore POC PR) (working)
  • Dynamic CP support for Multi-modal Training

Multi-Modal Training for Qwen3.5

  • OOTB E2E FSDP Example for THD Training Support with Qwen3.5
  • Dynamic CP Support
  • Heterogeneous Parallelism for Encoder-Decoder

General Optimizations

CUDA Graph

Long Context & Context Parallel

  • Dynamic CP Support
  • Dynamic CP + CUDA Graph

Communication Optimization

Ongoing Long-term Features

(2026 Q1) Highlights

Performance & Kernel Optimizations

Long Context & Context Parallel

Model & Architecture

Advanced Functionality

CUDA Graph Enhancements


Call for Community Contributions

  • Model implementations - Additional MoE model variants
  • Performance testing - Performance tests across different platforms and workloads
  • Documentation and tutorials - Best practices and optimization guides
  • Bug fixes

This roadmap reflects the collective efforts of NVIDIA and our collaborators.

Credits: MCore MoE Team and @sbhavani

Labels: roadmap, moe, call-for-contribution

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions