torch_musa Release v2.5.0
Release Note
torch_musa v2.5.0 is now available. We make the version of torch_musa matched with PyTorch, and integrate muSolver, muFFT libraries into torch_musa, support UMM for Unified Memory devices. We kept improving compatiblities with the latest MUSA SDK, so this release of torch_musa can be built with MUSA SDK 4.2.0 - 4.3.0 and later version. The supported operators in torch_musa increased to over 1000.
New Features
Support UMM for M1000
Arm architecture employs a UMA (Unified Memory Addressing) design, enabling both GPU and CPU to access a single, shared physical memory space. To optimize memory consumption during model execution on M1000, this implementation enables:
- Elimination of duplicate memory allocation on GPU
- Reduction of memory copy between host and device
- Direct GPU access to memory originally allocated by CPU allocator
We propose Unified Memory Management support for the MUSA backend, which avoids GPU memory allocation in torch.load(map_location="musa"), and this feature can be enabled by setting environment variable: export PYTORCH_MUSA_ALLOC_CONF="cpu:unified".
EnhanceMent
Operators
- Support
ilshift,irshift,replication_pad1d_bwd,angle,ctcLossTensor,ctcLossTensorBwd,logit,amin/amax/prod.dim_int,glu_bwd, etc; - Support some basic Sparse(csr) operations;
- Add more quantized operators supported;
- Fix
torch.normshape error; - Support
reduce_sumuint8 dtype input and int64 dtype output; - Support
tensor.is_musa(); in cpp extension; - Fix
argmax/minwith empty input;
Performances
- Optimize performances of var/std, pad, convolution3d, layer_norm;
Functionality
- Enable torch.musa.mccl.version() ;
- Support getCurrentMUSABlasHandle and getCurrentMUSABlasLtHandle ;
- Optimize FSDP2 Pipeline parallelism memory consume;
Known Issues
- Complex dtype operators are not fully supported now, some oeprators are walkarounded with CPU.
Enjoy.