Changelog Details
Model Collection Support
LLM / VLM
- Qwen3.5 text bridges (dense + MoE) (PR#3769, community @HowardZorn)
- DeepSeek V4 bridge and DeepSeek-V4-Flash pretraining recipes (PR#3562, PR#3893)
- Ernie 4.5 text-only MoE and VL bridges (PR#3263, community @bo-ke)
- GLM-5 / GLM-5.1 (MoE + MLA + DSA) bridge and provider (PR#2913, PR#3635)
- GLM-4.7 / GLM-4.7-Flash support (PR#2983)
- StepFun Step-3.5-Flash (PR#3525) and Step-3.7-Flash (PR#4043)
- MiMo-V2-Flash support (PR#3163, community @beccohov)
- Gemma 4 (26B-A4B and 31B dense, LLM + VLM), MoE and Dense models (PR#3148, PR#3885, community @pavelgein)
- Falcon H1 hybrid Transformer + Mamba support (PR#1462, community @dhiaEddineRhaiem)
- Ling MoE V2 support (PR#2028, community @ccclyu)
Multimodal
- Nemotron-3 Nano Omni support, including model, recipe, and examples (PR#3760)
- Qwen3-Omni-MoE training support (PR#3317, community @hbhflw2000)
- Qwen3-ASR support (PR#2836, PR#3273)
- Nemotron Diffusion (Nemotron-Labs-Diffusion) model support (PR#3105)
Training & Functionality
- MegatronMIMO (Multimodel-In-Multimodel-Out) is a new feature to train multimodal models with heterogeneous parallelism (e.g. different model parallelism for the image encoder and text decoder). NeMo 26.06 supports non-colocated training (i.e. encoder and decoder are placed on different ranks PR#2004, PR#2007, PR#2869, PR#2870) and MegatronMIMO model conversion (PR#3905) with a focus on dense models. Colocated training (i.e. encoder and decoder on the same rank) and MoE models will be supported in the next release.
- Energon v7 support, including metadata and stateless cookers (PR#4090)
- Energon updates for video and multi-image (PR#3691)
- Eval-time context parallelism via decentralized process-group rebinding (PR#3755)
- Deterministic training support for performance recipes (PR#3543)
- Evaluator backend integration (SFT + inference + evaluation, demonstrated on GPT-OSS) (PR#2990)
- LoRA support for not sharing expert adapters (PR#3408)
- Configurable async checkpoint strategy (PR#3153); MSC support for FSDP DTensors (PR#3300)
- Fast dataloading configs and documentation (PR#3351)
Low-Precision Bridge & Checkpoint Conversion
- Quantize-then-gather weight export (FP8 / MXFP4) for faster RL trainer→rollout weight sync (PR#2737, community @hy2826)
- DeepSeek V4 quantization-scale emission during HF export (PR#3969)
Performance
fp4_param_gatherenabled inMixedPrecisionConfig(PR#3364)- Qwen3-Next 80B GB200/GB300 parallel mappings (PR#3168)
- CUDA graph support for Qwen3-VL LLM and vision-encoder submodules (PR#2334); full-iteration CUDA graph for GPT-OSS recipes (PR#4140)
Megatron-LM ↔ Megatron-Bridge Unification
- Megatron Inference integrated into Bridge — MCore Inference Engine examples, model wrappers, pure-LLM inference CLI, and
inference_optimizedpath (PR#3897) - Tokenizer unification — MCore tokenizer config promoted as the shared surface (Bridge side: PR#3451; MCore side: MCore PR#4406)
- Training-loop upstreaming (in progress) — Bridge's config + builder patterns moving into Megatron-LM: ConfigContainer (MCore PR#4227), serialization base (MCore PR#4309), Mamba config + builder (MCore PR#4550), GPT config + builder (MCore PR#4741), supporting utils (MCore PR#4872)
Developer Experience & Compatibility
- RL API refactoring — model creation, config override, training loop, export, and LoRA for RL (PR#3813)
AGENTS.mdand AI-coding-agent skills updated (recipe-recommender, NeMo-RL & verl E2E testing) (PR#3256, PR#3277, PR#3831)
Examples & Tutorials
- MegatronMIMO Qwen3.5-VL non-collocated SFT tutorial + LLaVA tutorial (PR#4239)
- Qwen3-0.6B 128K long-context SFT recipe with YaRN RoPE scaling (PR#3316)
- HuggingFace ↔ Megatron-FSDP weight conversion (PR#3512); online HF load/save for Megatron-FSDP (PR#1910)
ModelOpt
- LoRA × ModelOpt × DeepSeek architecture support (PR#3612)
Community Contributions
A big thank you to our community contributors for their valuable support!
Known issues:
- Moonlight and Nemotron v3 Nano model training recipe shows performance degradation with TP > 1 . As a workaround, set TP=1 and use HybridEP. We have root caused this regression to the base PyTorch image upgrade from 26.02 to 26.04. We are actively investigating and looking to fix this regression soon.
- Step-3.7-Flash forward-pass outputs have not been fully verified.
- Some examples/ scripts have known minor issues: MiniMax M2 (conversion/export saving), GLM-4.5V (exported tokenizer artifacts), FLUX (tokenizer setup), and WAN (inference setup/dependencies).