Skip to content

v0.1.8

Pre-release
Pre-release

Choose a tag to compare

@deerlu deerlu released this 09 Apr 08:47
· 140 commits to main since this release
2877702

This release significantly expands VeOmni’s model coverage, distributed training capabilities, and training workflow support since v0.1.7. Major updates include broader Qwen family support with Qwen3.5 and Transformers v5 integrations, the introduction of Extra Parallel and multiple EP/FSDP/SP improvements, new DPO and DiT/WAN training workflows, stronger MoE and checkpoint compatibility, and much deeper test, CI, and documentation coverage across both GPU and NPU environments.

Highlights

Model Coverage & Training Workflows

  • Added Qwen3.5 MoE language support and extended Qwen3.5 ViT coverage, significantly broadening the Qwen3.5 training path (#547, #552, #602).
  • Upgraded Qwen2 and Qwen2-VL integrations to the Transformers v5 stack (#526, #543).
  • Added DPOTrainer together with Qwen3 DPO configs and examples (#558, #583).
  • Added support for DiT / WAN training workflows, including new task entrypoints, configs, and examples (#570).
  • Fixed freeze_vit behavior for Qwen3.5 models (#616).
  • Improved compatibility for DeepSeek V3 rollout and FA4-related paths (#609, #582).
  • Added FLOPs counting support for qwen3_5 and qwen3_5_moe (#561).
  • Improved hub-kernel loader compatibility with Transformers v5.3.0+ (#633).

Distributed Training & Parallelism

  • Introduced Extra Parallel, a breaking upgrade to VeOmni’s distributed parallel abstraction (#429).
  • Added mixed_precision support in fsdp_config and CPU parameter loading for FSDP1 (#627, #612).
  • Extended EP support to merged FC1 and quack GEMM backends, and improved EP/FSDP behavior for expert modules (#588, #577).
  • Added a fused Triton kernel for MoE load-balancing loss, improving distributed MoE training efficiency (#560).
  • Improved Sequence Parallel stability with a zero-division guard in ReduceLoss and new roll_with_sequence_parallel utilities (#618, #608).

Data Pipeline

  • Added worker-side multi-source dynamic batching resume, a breaking change for resume behavior in multi-source data loading (#603).
  • Added support for non-divisible frame alignment with frame_factor_remainder and improved video preprocessing behavior (#587, #585).
  • Improved data transform robustness for source_name handling and multimodal edge cases (#553, #554, #550).

Checkpointing & Compatibility

  • Added runtime checkpoint tensor conversion for Qwen3-MoE Transformers v5 fused expert weights (#589).
  • Added a DCP consolidation patch for HDFS FUSE compatibility and fixed duplicate checkpoint saves (#536, #595).
  • Preserved tokenizer config in the MoE merge script and fixed checkpoint writer filtering logic (#622, #593).

Testing, CI & Documentation

  • Added dummy forward and FSDP equivalence tests to improve distributed correctness coverage (#620).
  • Expanded and aligned NPU CI coverage with GPU, including additional unit and end-to-end cases (#566, #567, #597, #623).
  • Added stronger testing documentation and consolidated test helpers (#631).
  • Updated Ascend TorchCodec installation docs and added a helper installation script (#613).
  • Added a CI verifier to ensure patchgen outputs stay in sync (#559).

Breaking Changes

  • Extra Parallel is now the main abstraction for expert-style parallelism, with follow-up API/doc updates from ep_plan to extra_parallel_plan (#429, #579).
  • Worker-side multi-source dynamic batching resume changes resume semantics and should be validated carefully before upgrading existing jobs (#603).

New Contributors

Full Changelog: v0.1.7...v0.1.8