Skip to content

torch_musa Release v2.5.0

Choose a tag to compare

@fmo-mt fmo-mt released this 21 Oct 08:05
· 18 commits to main since this release
0dbf6f1

Release Note

torch_musa v2.5.0 is now available. We make the version of torch_musa matched with PyTorch, and integrate muSolver, muFFT libraries into torch_musa, support UMM for Unified Memory devices. We kept improving compatiblities with the latest MUSA SDK, so this release of torch_musa can be built with MUSA SDK 4.2.0 - 4.3.0 and later version. The supported operators in torch_musa increased to over 1000.

New Features

Support UMM for M1000

Arm architecture employs a UMA (Unified Memory Addressing) design, enabling both GPU and CPU to access a single, shared physical memory space. To optimize memory consumption during model execution on M1000, this implementation enables:

  • Elimination of duplicate memory allocation on GPU
  • Reduction of memory copy between host and device
  • Direct GPU access to memory originally allocated by CPU allocator

We propose Unified Memory Management support for the MUSA backend, which avoids GPU memory allocation in torch.load(map_location="musa"), and this feature can be enabled by setting environment variable: export PYTORCH_MUSA_ALLOC_CONF="cpu:unified".

EnhanceMent

Operators

  • Support ilshift, irshift, replication_pad1d_bwd, angle, ctcLossTensor, ctcLossTensorBwd, logit, amin/amax/prod.dim_int, glu_bwd, etc;
  • Support some basic Sparse(csr) operations;
  • Add more quantized operators supported;
  • Fix torch.norm shape error;
  • Support reduce_sum uint8 dtype input and int64 dtype output;
  • Support tensor.is_musa(); in cpp extension;
  • Fix argmax/min with empty input;

Performances

  • Optimize performances of var/std, pad, convolution3d, layer_norm;

Functionality

  • Enable torch.musa.mccl.version() ;
  • Support getCurrentMUSABlasHandle and getCurrentMUSABlasLtHandle ;
  • Optimize FSDP2 Pipeline parallelism memory consume;

Known Issues

  • Complex dtype operators are not fully supported now, some oeprators are walkarounded with CPU.

Enjoy.