v0.1.8
Pre-release
Pre-release
This release significantly expands VeOmni’s model coverage, distributed training capabilities, and training workflow support since v0.1.7. Major updates include broader Qwen family support with Qwen3.5 and Transformers v5 integrations, the introduction of Extra Parallel and multiple EP/FSDP/SP improvements, new DPO and DiT/WAN training workflows, stronger MoE and checkpoint compatibility, and much deeper test, CI, and documentation coverage across both GPU and NPU environments.
Highlights
Model Coverage & Training Workflows
- Added Qwen3.5 MoE language support and extended Qwen3.5 ViT coverage, significantly broadening the Qwen3.5 training path (#547, #552, #602).
- Upgraded Qwen2 and Qwen2-VL integrations to the Transformers v5 stack (#526, #543).
- Added DPOTrainer together with Qwen3 DPO configs and examples (#558, #583).
- Added support for DiT / WAN training workflows, including new task entrypoints, configs, and examples (#570).
- Fixed
freeze_vitbehavior for Qwen3.5 models (#616). - Improved compatibility for DeepSeek V3 rollout and FA4-related paths (#609, #582).
- Added FLOPs counting support for
qwen3_5andqwen3_5_moe(#561). - Improved hub-kernel loader compatibility with Transformers v5.3.0+ (#633).
Distributed Training & Parallelism
- Introduced Extra Parallel, a breaking upgrade to VeOmni’s distributed parallel abstraction (#429).
- Added
mixed_precisionsupport infsdp_configand CPU parameter loading for FSDP1 (#627, #612). - Extended EP support to merged FC1 and quack GEMM backends, and improved EP/FSDP behavior for expert modules (#588, #577).
- Added a fused Triton kernel for MoE load-balancing loss, improving distributed MoE training efficiency (#560).
- Improved Sequence Parallel stability with a zero-division guard in
ReduceLossand newroll_with_sequence_parallelutilities (#618, #608).
Data Pipeline
- Added worker-side multi-source dynamic batching resume, a breaking change for resume behavior in multi-source data loading (#603).
- Added support for non-divisible frame alignment with
frame_factor_remainderand improved video preprocessing behavior (#587, #585). - Improved data transform robustness for
source_namehandling and multimodal edge cases (#553, #554, #550).
Checkpointing & Compatibility
- Added runtime checkpoint tensor conversion for Qwen3-MoE Transformers v5 fused expert weights (#589).
- Added a DCP consolidation patch for HDFS FUSE compatibility and fixed duplicate checkpoint saves (#536, #595).
- Preserved tokenizer config in the MoE merge script and fixed checkpoint writer filtering logic (#622, #593).
Testing, CI & Documentation
- Added dummy forward and FSDP equivalence tests to improve distributed correctness coverage (#620).
- Expanded and aligned NPU CI coverage with GPU, including additional unit and end-to-end cases (#566, #567, #597, #623).
- Added stronger testing documentation and consolidated test helpers (#631).
- Updated Ascend TorchCodec installation docs and added a helper installation script (#613).
- Added a CI verifier to ensure patchgen outputs stay in sync (#559).
Breaking Changes
- Extra Parallel is now the main abstraction for expert-style parallelism, with follow-up API/doc updates from
ep_plantoextra_parallel_plan(#429, #579). - Worker-side multi-source dynamic batching resume changes resume semantics and should be validated carefully before upgrading existing jobs (#603).
New Contributors
- @hjshi84 made their first contribution in #536
- @JorgenWan made their first contribution in #429
- @deerlu made their first contribution in #552
- @nono-Sang made their first contribution in #611
- @xzzWZY made their first contribution in #595
- @whisylan made their first contribution in #608
- @cls1206 made their first contribution in #623
Full Changelog: v0.1.7...v0.1.8