Skip to content

torch_musa Release v2.1.0

Choose a tag to compare

@fmo-mt fmo-mt released this 17 Jul 11:40
· 38 commits to main since this release
8ee39bb

Release Note

We are excited to annound the release of torch_musa v2.1.0 based on PyTorch v2.5.0. This release delivering optimized performance and flexibility across key PyTorch components on MUSA platform.
We support AOTInductor, FSDP2, also adapted with our Memory Management, Triton-MUSA, and improve bunch of operators performance as well. The supported operators in torch_musa increased to over 930. We've simplified MUSA integration with automatic torch_musa loading, users are not required to call "import torch_musa" in python scripts.

New Features

AOT Inductor

MUSA-backend support is now integrated into AOTInductor, enabling models to be ahead-of-time compiled for MUSA devices. This allows seamless inference acceleration via both C++ and Python runtimes, streamlining deployment on MUSA hardware.

FSDP2

Features DTensor-based per-parameter sharding FSDP with Moore Threads GPU optimization, enabling hardware-accelerated distributed training through custom sharding strategies and native mixed precision for Large Models.

Memory Management

We are pleased to introduce a pluggable MUSA (Memory Unified System Allocator) backend, providing greater flexibility and customization for memory management in your applications.

Triton-MUSA(reland)

Reintroduces the MUSA integration with TorchInductor based on PyTorch2.5 with reduced device-specific code.

EnhanceMent

Operators

We keep adding more operators, dtypes as well, to expand our capability to support more types of DL models, we currently support more than 930 operators, by which we could deploy most of DL models from both industry and academia.

  • Math Ops: _masked_softmax, tril_indices, triu_indices, trace, ...
  • Statistical: nanmedian, normal, huber_loss, cauchy, log_normal,...
  • NN Ops: native_batch_norm, reflection_pad, fractional_max_pool, ...
  • Advanced Math: cosh, erfc, lgamma, digamma, polygamma,...

Performances

We've optimized quantization opertors, enhanced split and chunk operators. Add fused cross entropy loss implementation which can help reduce the peak memory usage. And many more - too numerous to list individually here.

Build

The MUSA backend now automatically initializes with torch - no manual imports or environment setup required. We also revamp the CMake build system to seamlessly integrate MUSA-accelerated Torch libraries in C++ projects through modern target-based dependency management.

Enjoy.