Release torch_musa Release v1.0.0 · MooreThreads/torch_musa

torch_musa Release Notes

Highlights
New Features
- CUDA Kernels Porting
- Caching Allocator
- Device Management
- Distributed Data Parallel Training [Experimental]
- FP16 Inference [Experimental]
Supported Operators
Supported Models
Documentation
Dockers

Highlights

We are excited to release torch_musa v1.0.0 based on PyTorch v2.0.0. In this release, we support some basic and important features, including CUDA kernels porting, device management, memory allocator, distributed data parallel training(experimental) and FP16 inference(experimental). In addition, we have adapted more than 300 operators. With these basic features and operators, torch_musa could support a large number of models in various fields, including the recently popular large language models. The number of supported operators and models is increasing rapidly. With torch_musa, users can easily accelerate AI applications on Moore Threads graphics cards.

This release is due to the efforts of engineers in Moore Threads AI Team and other departments. We sincerely hope that everyone can continue to pay attention to our work and participate in it, and witness the fast iteration of torch_musa and Moore Threads graphics cards together.

New Features

CUDA Kernels Porting

Thanks to CUDA-compatible capabilities of our MUSA software stack, torch_musa can easily support CUDA-compatible modules. It then effectively enables developers to reuse CUDA kernels with a small amount of efforts, which greatly speeds up operators adaptation.

Caching Allocator

The amount of required memory is constantly changing during the program execution. Frqeuent invocations of memory allocation and deallocation (through musaMalloc and musaFree) usually lead to high execution cost. To alleviate this issue, we implemented caching allocator that requests memory blocks from MUSA and strategically splits and reuses these blocks without returning them to MUSA, which results in a significant performance gain.

Device Management

In order to manage devices, three components are implemented in torch_musa, including device streams, device events and device generators. Device streams are used to manage and synchronize launched kernels. Device event is an important component related to streams, which records a specific point in the execution of a stream. Device generators are used to generate random numbers. Devices are initialized lazily, which could improve startup especially for multi-GPU systems.

Distributed Data Parallel Training [Experimental]

As the number of model parameters increases, especially for the large language models, distributed data parallel training becomes increasingly important. torch_musa has already started supporting distributed data parallel training. Some important communication primitives are already supported, including send, recv, broadcast, all_reduce, reduce, all_gather, gather, scatter, reduce_scatter and barrier. The interface torch.nn.parallel.DistributedDataParallel is also supported. This module is under rapid development.

FP16 Inference [Experimental]

To speed up model inference, we currently supported a series of FP16 operators, including linear, matmul, unary ops, binary ops, layernorm and most porting kernels. With this set of operators, we are able to run FP16 inference on a number of models. Please note this feature is still experimental, the model support might be limited.

Supported Operators

More than 300 operators are supported in torch_musa.

Supported Models

Many classic and popular models are already supported, including Stable Diffusion, ChatGLM, Conformer, Bert, YOLOV5, ResNet50, Swin-Transformer, MobileNetv3, EfficientNet, HRNet, TSM, FastSpeech2, UNet, T5, HifiGan, Real-EsrGan, OpenPose, many GPT variants and so on.

Documentation

We provide developer guide for developers, which describes the development environment preparation and some development steps in detail.

Dockers

Release docker image and development docker image are available now.

[NOTE]: If you want to compile torch_musa without using the provided docker image, please contact us to get the necessary dependencies by email developers@mthreads.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch_musa Release v1.0.0

Choose a tag to compare

Sorry, something went wrong.