torch_musa Release v2.1.0
Release Note
We are excited to annound the release of torch_musa v2.1.0 based on PyTorch v2.5.0. This release delivering optimized performance and flexibility across key PyTorch components on MUSA platform.
We support AOTInductor, FSDP2, also adapted with our Memory Management, Triton-MUSA, and improve bunch of operators performance as well. The supported operators in torch_musa increased to over 930. We've simplified MUSA integration with automatic torch_musa loading, users are not required to call "import torch_musa" in python scripts.
New Features
AOT Inductor
MUSA-backend support is now integrated into AOTInductor, enabling models to be ahead-of-time compiled for MUSA devices. This allows seamless inference acceleration via both C++ and Python runtimes, streamlining deployment on MUSA hardware.
FSDP2
Features DTensor-based per-parameter sharding FSDP with Moore Threads GPU optimization, enabling hardware-accelerated distributed training through custom sharding strategies and native mixed precision for Large Models.
Memory Management
We are pleased to introduce a pluggable MUSA (Memory Unified System Allocator) backend, providing greater flexibility and customization for memory management in your applications.
Triton-MUSA(reland)
Reintroduces the MUSA integration with TorchInductor based on PyTorch2.5 with reduced device-specific code.
EnhanceMent
Operators
We keep adding more operators, dtypes as well, to expand our capability to support more types of DL models, we currently support more than 930 operators, by which we could deploy most of DL models from both industry and academia.
- Math Ops: _masked_softmax, tril_indices, triu_indices, trace, ...
- Statistical: nanmedian, normal, huber_loss, cauchy, log_normal,...
- NN Ops: native_batch_norm, reflection_pad, fractional_max_pool, ...
- Advanced Math: cosh, erfc, lgamma, digamma, polygamma,...
Performances
We've optimized quantization opertors, enhanced split and chunk operators. Add fused cross entropy loss implementation which can help reduce the peak memory usage. And many more - too numerous to list individually here.
Build
The MUSA backend now automatically initializes with torch - no manual imports or environment setup required. We also revamp the CMake build system to seamlessly integrate MUSA-accelerated Torch libraries in C++ projects through modern target-based dependency management.
Enjoy.