Release NVIDIA Megatron-Bridge 0.5.0 · NVIDIA-NeMo/Megatron-Bridge

Changelog Details

LLM / VLM

Qwen3.5 text bridges (dense + MoE) (PR#3769, community @HowardZorn)
DeepSeek V4 bridge and DeepSeek-V4-Flash pretraining recipes (PR#3562, PR#3893)
Ernie 4.5 text-only MoE and VL bridges (PR#3263, community @bo-ke)
GLM-5 / GLM-5.1 (MoE + MLA + DSA) bridge and provider (PR#2913, PR#3635)
GLM-4.7 / GLM-4.7-Flash support (PR#2983)
StepFun Step-3.5-Flash (PR#3525) and Step-3.7-Flash (PR#4043)
MiMo-V2-Flash support (PR#3163, community @beccohov)
Gemma 4 (26B-A4B and 31B dense, LLM + VLM), MoE and Dense models (PR#3148, PR#3885, community @pavelgein)
Falcon H1 hybrid Transformer + Mamba support (PR#1462, community @dhiaEddineRhaiem)
Ling MoE V2 support (PR#2028, community @ccclyu)

Multimodal

MegatronMIMO (Multimodel-In-Multimodel-Out) is a new feature to train multimodal models with heterogeneous parallelism (e.g. different model parallelism for the image encoder and text decoder). NeMo 26.06 supports non-colocated training (i.e. encoder and decoder are placed on different ranks PR#2004, PR#2007, PR#2869, PR#2870) and MegatronMIMO model conversion (PR#3905) with a focus on dense models. Colocated training (i.e. encoder and decoder on the same rank) and MoE models will be supported in the next release.
Energon v7 support, including metadata and stateless cookers (PR#4090)
Energon updates for video and multi-image (PR#3691)
Eval-time context parallelism via decentralized process-group rebinding (PR#3755)
Deterministic training support for performance recipes (PR#3543)
Evaluator backend integration (SFT + inference + evaluation, demonstrated on GPT-OSS) (PR#2990)
LoRA support for not sharing expert adapters (PR#3408)
Configurable async checkpoint strategy (PR#3153); MSC support for FSDP DTensors (PR#3300)
Fast dataloading configs and documentation (PR#3351)

Quantize-then-gather weight export (FP8 / MXFP4) for faster RL trainer→rollout weight sync (PR#2737, community @hy2826)
DeepSeek V4 quantization-scale emission during HF export (PR#3969)

fp4_param_gather enabled in MixedPrecisionConfig (PR#3364)
Qwen3-Next 80B GB200/GB300 parallel mappings (PR#3168)
CUDA graph support for Qwen3-VL LLM and vision-encoder submodules (PR#2334); full-iteration CUDA graph for GPT-OSS recipes (PR#4140)

Megatron Inference integrated into Bridge — MCore Inference Engine examples, model wrappers, pure-LLM inference CLI, and inference_optimized path (PR#3897)
Tokenizer unification — MCore tokenizer config promoted as the shared surface (Bridge side: PR#3451; MCore side: MCore PR#4406)
Training-loop upstreaming (in progress) — Bridge's config + builder patterns moving into Megatron-LM: ConfigContainer (MCore PR#4227), serialization base (MCore PR#4309), Mamba config + builder (MCore PR#4550), GPT config + builder (MCore PR#4741), supporting utils (MCore PR#4872)

RL API refactoring — model creation, config override, training loop, export, and LoRA for RL (PR#3813)
AGENTS.md and AI-coding-agent skills updated (recipe-recommender, NeMo-RL & verl E2E testing) (PR#3256, PR#3277, PR#3831)

MegatronMIMO Qwen3.5-VL non-collocated SFT tutorial + LLaVA tutorial (PR#4239)
Qwen3-0.6B 128K long-context SFT recipe with YaRN RoPE scaling (PR#3316)
HuggingFace ↔ Megatron-FSDP weight conversion (PR#3512); online HF load/save for Megatron-FSDP (PR#1910)

ModelOpt

A big thank you to our community contributors for their valuable support!

Known issues:

Moonlight and Nemotron v3 Nano model training recipe shows performance degradation with TP > 1 . As a workaround, set TP=1 and use HybridEP. We have root caused this regression to the base PyTorch image upgrade from 26.02 to 26.04. We are actively investigating and looking to fix this regression soon.
Step-3.7-Flash forward-pass outputs have not been fully verified.
Some examples/ scripts have known minor issues: MiniMax M2 (conversion/export saving), GLM-4.5V (exported tokenizer artifacts), FLUX (tokenizer setup), and WAN (inference setup/dependencies).

Provide feedback