torch_musa Release v1.0.0
torch_musa Release Notes
- Highlights
- New Features
- CUDA Kernels Porting
- Caching Allocator
- Device Management
- Distributed Data Parallel Training [Experimental]
- FP16 Inference [Experimental]
- Supported Operators
- Supported Models
- Documentation
- Dockers
Highlights
We are excited to release torch_musa v1.0.0 based on PyTorch v2.0.0. In this release, we support some basic and important features, including CUDA kernels porting, device management, memory allocator, distributed data parallel training(experimental) and FP16 inference(experimental). In addition, we have adapted more than 300 operators. With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.
This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.
New Features
CUDA Kernels Porting
Thanks to CUDA-compatible capabilities of our MUSA software stack, torch_musa can easily support CUDA-compatible modules. It then effectively enables developers to reuse CUDA kernels with a small amount of efforts, which greatly speeds up operators adaptation.
Caching Allocator
The amount of required memory is constantly changing during the program execution. Frqeuent invocations of memory allocation and deallocation (through musaMalloc and musaFree) usually lead to high execution cost. To alleviate this issue, we implemented caching allocator that requests memory blocks from MUSA and strategically splits and reuses these blocks without returning them to MUSA, which results in a significant performance gain.
Device Management
In order to manage devices, three components are implemented in torch_musa, including device streams, device events and device generators. Device streams are used to manage and synchronize launched kernels. Device event is an important component related to streams, which records a specific point in the execution of a stream. Device generators are used to generate random numbers. Devices are initialized lazily, which could improve startup especially for multi-GPU systems.
Distributed Data Parallel Training [Experimental]
As the number of model parameters increases, especially for the large language models, distributed data parallel training becomes increasingly important. torch_musa has already started supporting distributed data parallel training. Some important communication primitives are already supported, including send, recv, broadcast, all_reduce, reduce, all_gather, gather, scatter, reduce_scatter and barrier. The interface torch.nn.parallel.DistributedDataParallel is also supported. This module is under rapid development.
FP16 Inference [Experimental]
To speed up model inference, we currently supported a series of FP16 operators, including linear, matmul, unary ops, binary ops, layernorm and most porting kernels. With this set of operators, we are able to run FP16 inference on a number of models. Please note this feature is still experimental, the model support might be limited.
Supported Operators
More than 300 operators are supported in torch_musa.
Supported Models
Many classic and popular models are already supported, including Stable Diffusion, ChatGLM, Conformer, Bert, YOLOV5, ResNet50, Swin-Transformer, MobileNetv3, EfficientNet, HRNet, TSM, FastSpeech2, UNet, T5, HifiGan, Real-EsrGan, OpenPose, many GPT variants and so on.
Documentation
We provide developer guide for developers, which describes the development environment preparation and some development steps in detail.
Dockers
Release docker image and development docker image are available now.
[NOTE]: If you want to compile torch_musa without using the provided docker image, please contact us to get the necessary dependencies by email developers@mthreads.com.