Releases: NVIDIA/Megatron-LM
Releases · NVIDIA/Megatron-LM
NVIDIA Megatron Core 0.9.0
- Uneven pipeline parallelism
- Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
- Per layer CUDAGraph support for GPT training with Transformer Engine modules
- Enable different TP sizes for the vision encoder
- Enable pipeline parallelism for T5 & Llava models
- Support multi-tile multi-image input in Llava models
- MoE
- FP8 support
- Runtime upcycling support
- Dispatcher implementation optimizations
- Shared expert support with overlapping optimizations
- Qwen Model support
- Known Issues
- When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.
NVIDIA Megatron Core 0.8.0
- Multimodal
- Added initial support for training vision language models using the LLaVA architecture
- Added initial support for inference with multimodal inputs
- End-to-end multimodal example from data collection to training to evaluation is provided in examples/multimodal
- MoE
- Context Parallel support.
- Distributed checkpoint support for grouped GEMM.
- Mamba
- Added initial support for training and inference of Mamba-2 models
- Support for hybrid models consisting of Mamba-2, attention, and MLP layers
- Examples provided in examples/mamba
NVIDIA Megatron Core 0.7.0
- MoE
- Token drop support
- Several efficiency optimizations
- Improved model parallelism
- Memory optimizations
- Distributed checkpointing
- Enabled for Retro
- Asynchronous checkpoint saving
- Several minor bug fixes, speed improvements, and memory optimizations
NVIDIA Megatron Core 0.6.0
- MoE (Mixture of Experts)
- Performance optimization
- Communication optimization for multi GPU and Single GPU
- 23% improvement (323 TFLOPS/GPU) over MCore 0.5.0 on Mixtral with Hopper BF16
- GroupedMLP enhancement for Hopper
- DP Overlapping. Support overlapping computation with gradient reduction and parameter gathering.
- All-to-All based Token Dispatcher
- Layer-wise logging for load balancing loss.
- Improved expert parallel support including distributed optimizer.
- Performance optimization
- Distributed optimizer
- RETRO
- Data processing
- BERT
- Distributed checkpointing
- Dist checkpointing
- PyTorch native distributed backend
- Improved saving/loading speed
- TensorRT-LLM Export
- Integration with TensorRT Model Optimizer Post-training quantization (PTQ)
- Text generation driver to perform PTQ in Megatron-LM
- Llama2 and Nemotron3-8b examples to use TensorRT-LLM unified build API to build engine after training.
- Several minor enhancements, bug fixes, and documentation updates
NVIDIA Megatron Core 0.5.0
Key Features and Enhancements
Megatron core documentation is now live!
Model Features
- MoE (Mixture of Experts)
- Support for Z-loss, Load balancing and Sinkhorn
- Layer and communications refactor
- Richer parallelism mappings and EP can be combined with other model parallel techniques for larger MoE variants, e.g. EP + TP + DP + SP + PP
- Token dropless architecture with Top-K routing
- Performance optimization with with GroupedGEMM when number of local experts is > 1
- Distributed checkpointing
- Interleaved rotary embedding
Datasets
- Masked WordPiece datasets for BERT and T5
- Raw and mock datasets
Parallelism
Performance
- Activation offloading to CPU
- Rope and Swiglu fusion
- Sliding window attention (via Transformer Engine)
General Improvements
- Timers
NVIDIA Megatron Core 0.4.0
Key Features and Enhancements
Models
- BERT
- RETRO
- T5
Parallelism
- Mixture of Experts support for GPT
- Model parallel efficient Distributed Data Parallel (DDP)
- Context Parallel (2D Tensor Parallel) support
Datasets
- GPT Dataset
- Blended Dataset
23.04
Merge branch 'pip_package' into 'main' Add pip package for megatron.core See merge request ADLR/megatron-lm!598
v2.5
Merge branch 'sc21' into 'main' scripts for sc21 See merge request ADLR/megatron-lm!298