torch_musa Release v2.0.0
Release Note
We are excited to annound the release of torch_musa v2.0.0 based on PyTorch v2.2.0.
In this release, we support MUSA virtual memory management, torch compile + torch inductor with triton backend, fused module with higher performances like SwiGLU and RoPE, MUSAGraph for arch greater than QY2, and improve bunch of operators performance as well. The supported operators in torch_musa increased to over 760.
With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
New Features
VMM(virtual memory management)
We have implemented the ExpandableSegment memory allocator based on the MUSA VMM API, which effectively mitigates GPU memory fragmentation and reduces peak memory consumption during model training, especially in LLMs training scenarios such as using FSDP, DeepSpeed and Megatron-LM.
MUSAGraph
We have implemented the MUSAGraph interface, which is consistent with CUDAGraph. It captures a sequence of MUSA kernels into a graph, which provides a mechanism to launch these kernels through a single CPU operation, and hence reduces the launching overheads. NOTE: Currently supports computational logic only (no MCCL support), and it's still a experimental feature in MUSARuntime
torch.compile for MUSA
We have integrated triton_musa backend into TorchInductor and implemented partial adaptations for TorchDynamo, which enabling users to accelerate both model training and inference through PyTorch's torch.compile interface
Fused modules & functionals
We support customize fusion modules torch.nn.RoPE, torch.nn.SwishGLU and FusedCrossEntropy, which can be used in LLMs to accelerate training and inference
FP8 support
We support FP8 dtype matmul and distribute communication in torch_musa for archs greater than QY2
EnhanceMent
Operators
We keep adding more operators, dtypes as well, to expand our capability to support more types of DL models, we currently support more than 760 operators, by which we could deploy most of DL models from society
Build
We support multi-arch compilation, one can build torch_musa on any arch of MTGPU platform than run it on other platforms.
Enjoy.