Release Note

torch_musa v2.5.0 is now available. We make the version of torch_musa matched with PyTorch, and integrate muSolver, muFFT libraries into torch_musa, support UMM for Unified Memory devices. We kept improving compatiblities with the latest MUSA SDK, so this release of torch_musa can be built with MUSA SDK 4.2.0 - 4.3.0 and later version. The supported operators in torch_musa increased to over 1000.

New Features

Support UMM for M1000

Arm architecture employs a UMA (Unified Memory Addressing) design, enabling both GPU and CPU to access a single, shared physical memory space. To optimize memory consumption during model execution on M1000, this implementation enables:

Elimination of duplicate memory allocation on GPU
Reduction of memory copy between host and device
Direct GPU access to memory originally allocated by CPU allocator

We propose Unified Memory Management support for the MUSA backend, which avoids GPU memory allocation in torch.load(map_location="musa"), and this feature can be enabled by setting environment variable: export PYTORCH_MUSA_ALLOC_CONF="cpu:unified".

EnhanceMent

Operators

Support ilshift, irshift, replication_pad1d_bwd, angle, ctcLossTensor, ctcLossTensorBwd, logit, amin/amax/prod.dim_int, glu_bwd, etc;
Support some basic Sparse(csr) operations;
Add more quantized operators supported;
Fix torch.norm shape error;
Support reduce_sum uint8 dtype input and int64 dtype output;
Support tensor.is_musa(); in cpp extension;
Fix argmax/min with empty input;

Performances

Optimize performances of var/std, pad, convolution3d, layer_norm;

Functionality

Enable torch.musa.mccl.version() ;
Support getCurrentMUSABlasHandle and getCurrentMUSABlasLtHandle ;
Optimize FSDP2 Pipeline parallelism memory consume;

Known Issues

Complex dtype operators are not fully supported now, some oeprators are walkarounded with CPU.

Enjoy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch_musa Release v2.5.0

Choose a tag to compare

Sorry, something went wrong.