Skip to content

torch_musa Release v2.7.0

Choose a tag to compare

@fmo-mt fmo-mt released this 20 Nov 05:57
· 14 commits to main since this release
7a6f07a

Release Note

We are excited to annound the release of torch_musa v2.7.0 based on PyTorch v2.7.1. Along with torch v2.7.1, we supported more features, like Dynamic Double Casting and Distributed Checkpointing. We have isolated the torchvision kernels from torch_musa, for users who like to use torchvision, one should install it from the repo that we have musified, see README for more details.

New Features

Dynamic Double Casting

We support dynamic cast for some operators of float64 dtype. Before we don't support much operators with float64 dtype, now one can set an environment variable "export TORCH_USE_MUSA_DOUBLE_CAST=1", and torch_musa will utilize float32 as the compute dtype;

Distributed Checkpointing

We enable Distributed Checkpoint, including Asynchronous checkpoint save, which support loading and saving models from multiple ranks in parallel. It can significantly accelerate the saving and loading of checkpoints;

MUSAExtension 'load'

We support "load" method for compiling MUSA extensions on the fly, which is quite useful for third party libraries that can be installed in many platforms, and during execution the kernels will be compiled or not depending on the platform environment;

EnhanceMent

Operators

  • We added Poisson, binomial, _standard_gamma, _sample_dirichlet, vdot, upsample(1d, 2d, 3d, with aa), flash_attention, transformer_encoder_layer...operators, the supported MUSA specified operators is over 1050;
  • We improved profiler (kineto) stability, upgrade musified kineto to version 2.7.0 as well;
  • We optimize memory usage for pipeline parallelism in FSDP2;
  • We supported more quantized operators which can be used in our model compression toolkit (will be released soon);

Features

  • The torch.compile and AOTInductor are both enhanced through the upgrading of torch;
  • TF32 is enabled by default;
  • Keep Improving stability of torch_musa by fixing some musa kernel potential bugs;

Known Issues

  • Some FFT operators are walkarounded through offloading to CPU, which will be fixed in the next release.

Enjoy.