Enable torch.autocast with ZeRO #6993

tohtana · 2025-02-03T07:19:20Z

DeepSpeed supports mixed precision training, but the behavior is different from torch.autocast. DeepSpeed maintains parameters and gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex AMP style) and computes all modules in the lower precision while torch.autocast maintains parameters in FP32 but computes only certain operators in the lower precision.
This leads to differences in:

performance: torch.autocast needs downcast in forward/backward
memory usage: DeepSpeed needs more memory to keep copies of parameters and gradients in lower precision
accuracy: torch.autocast has a list of modules that can safely be computed in lower precision. Some precision-sensitive operators (e.g. softmax) are computed in FP32.

To align DeepSpeed's behavior with torch.autocast when necessary, this PR adds the integration with torch.autocast with ZeRO. Here is an examples of the configuration.

"torch_autocast": {
  "enabled": true,
  "dtype": "bfloat16",
  "lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"]
}

Each configuration works as follows:

enabled: Enable the integration with torch.autocast if this is set to True. You don't need to call torch.autocast in your code. The grad scaler is also applied in the DeepSpeed optimizer.
dtype: lower precision dtype passed to torch.autocast. Gradients for allreduce (reduce-scatter) and parameters for allgather (only for ZeRO3) of lower_precision_safe_modules are also downcasted to this dtype.
lower_precision_safe_modules: Downcast for allreduce (reduce-scatter) and allgather (ZeRO3) are applied only to modules specified in this list. (The precision for PyTorch operators in forward/backward follows torch.autocast's policy, not this list.) You can set names of classes with their packages. If you don't set this item, DeepSpeed uses the default list: [torch.nn.Linear, torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d].

Note that we only maintain FP32 parameters with this feature enabled. For consistency, you cannot enable fp16 or bf16 in DeepSpeed config.

Fix #6772 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

…#6967) - Issues with nv-sd updates, will follow up with a subsequent PR Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

NVIDIA Blackwell GPU generation has number 10. The SM code and architecture should be `100`, but the current code generates `1.`, because it expects a 2 characters string. This change modifies the logic to consider it as a string that contains a `.`, hence splits the string and uses the array of strings. Signed-off-by: Fabien Dupont <fdupont@redhat.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

1. update intel oneAPI basekit to 2025.0 2. update torch/ipex/oneccl to 2.5 Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Those files have code that gets run when importing them, so in systems that doesn't support triton but have triton installed this causes issues. In general, I think it is better to import triton when it is installed and supported. Signed-off-by: Omar Elayan <oelayan@habana.ai> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Fix #7014 Avoid naming collision on `partition()` --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Fix typos Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

BUGFIX for Apple Silicon hostname #6497 --------- Signed-off-by: Fabien Dupont <fdupont@redhat.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Roman Fitzjalen <romaactor@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Liangliang Ma <1906710196@qq.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

- Update existing workflows that use cu121 to cu124. Note, this means that where we download torch latest, we will now be getting torch 2.6 rather than the torch latest 2.5 provided with cuda 12.1. - Note, nv-nightly is failing in master currently due to unrelated errors, so this could be ignored in this PR (nv-nightly tested locally, where it passes with 12.1 and it also passes with 12.4). --------- Signed-off-by: Fabien Dupont <fdupont@redhat.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Omar Elayan <oelayan@habana.ai> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Liangliang Ma <1906710196@qq.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

This change is required to successfully build fp_quantizer extension on ROCm. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

@tjruwase

cc @tjruwase @jomayeri --------- Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Fix #7029 - Add Chinese blog for deepspeed windows - Fix format in README.md Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Adding compile support for AIO library on AMD GPUs. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Make trace cache warnings configurable, and disabled by default. Fix #6985, #4081, #5033, #5006, #5662 --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Update CUDA compute capability for cross compile according to wiki page. https://en.wikipedia.org/wiki/CUDA#GPUs_supported --------- Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Propagate API change. Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

tohtana · 2025-04-22T21:08:40Z

Hi @stas00, I tried to add the detection of nested autocast. This validation is called before the engine's forward.

I measured the default reduce_bucket_size 5e8 to consume 4GB peak memory usage when comms are in fp32. and only 1GB in bf16.

I see, I didn't know this behavior. It seems very weird that they allocate an additional buffer only for FP32, not for BF16. Perhaps this is a separate topic from this PR, but I will investigate it more when I have a chance.

sfc-gh-sbekman · 2025-04-23T23:49:42Z

Thank you for looking into it, Masahiro. No problem doing it elsewhere.

Using torch mem profiler will be very helpful to see the reduction memory spikes

https://pytorch.org/blog/understanding-gpu-memory-1/ - it's very easy to set up - if you need help please let me know.

sfc-gh-sbekman · 2025-05-20T16:26:56Z

Good, I can an assertion to detect that torch.autocast is enabled outside of DeepSpeed but ds_config doesn't set torch_autocast's enabled. Or it might be better to automatically enable it.

If it has to be on and it breaks nothing then automatically enabling it is probably a better idea to help with ease of use.

tohtana · 2025-05-23T07:05:30Z

Hi @stas,
Thank you for your feedback!

If it has to be on and it breaks nothing then automatically enabling it is probably a better idea to help with ease of use.

After reviewing the design, I now feel automatically enabling it wouldn't be straightforward. This autocast feature sets some flags to parameters before the optimizer is initialized. However, we only know whether torch.autocast is enabled or not just before a forward pass call as with torch.autocast(...) is placed to wrap a forward call. Reinitializing parts of the optimizer at that point would complicate the code.

Given that, I think it’s better to throw an error with the explanation.

stas00 · 2025-05-23T16:23:48Z

Then assert is the way to go, Masahiro

tohtana · 2025-05-24T03:27:09Z

Then assert is the way to go, Masahiro

Thank you @stas00, then can you approve this PR?

stas00 · 2025-05-27T05:12:54Z

Hmm, I can't just hit approve, that would be defeat the purpose of doing the review.

We have only discussed one small aspect of this PR, which has been resolved, but the rest of the PR I don't know and currently rushing to finish the porting of Ulysses to Hf/DS so until that is done I won't have time to do a serious review.

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

…pSpeed into tohtana/support_autocast

#6993 broke many paths in ZeRO1/2 optimizer. This PR fixes most of the issues the PR caused. Currently we still have one error with tests in `unit/runtime/zero`. ``` ====================================== short test summary info ====================================== FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16 ========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) ========= ``` --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

tjruwase and others added 30 commits February 28, 2025 22:53

Use ds-specific module id to avoid conflicts (#6847)

a4fbc3a

Fix #6772 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

add autocast support and ds_config item

9d48dc6

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

prepare ipg buckets for multiple dtypes

8957009

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

switch communication data type

c390c76

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

add gradscaler

458797d

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Update A6000 workflows to use newer docker container - 24.09 vs 24.03 (…

2984415

…#6967) - Issues with nv-sd updates, will follow up with a subsequent PR Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

fix import and formatting

6b6a600

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

convert comm type for z3

2817c02

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Update CNAME

b476d07

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Update CNAME

72f9687

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

[XPU] max1100 workflow update for docker and softwares (#7003)

a2425da

1. update intel oneAPI basekit to 2025.0 2. update torch/ipex/oneccl to 2.5 Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

cast dtype for allgather

1c1d43a

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Update A6000 tests transformers version (#7016)

359c85d

Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Fix ds-chat CI regression (#7015)

735fc2c

Fix #7014 Avoid naming collision on `partition()` --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

[Ulysses tutorial] typos (#7024)

1e7888c

Fix typos Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

[ROCm] Enable fp_quantizer on ROCm (#7027)

f1aea5d

This change is required to successfully build fp_quantizer extension on ROCm. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

add gds chinese blog (#7034)

8152824

cc @tjruwase @jomayeri --------- Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Add chinese blog for deepspeed windows, and fix format (#7035)

c898ac5

Fix #7029 - Add Chinese blog for deepspeed windows - Fix format in README.md Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

AIO on ROCM (#7023)

e3ea926

Adding compile support for AIO library on AMD GPUs. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Control trace cache warnings (#7039)

e946615

Make trace cache warnings configurable, and disabled by default. Fix #6985, #4081, #5033, #5006, #5662 --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Update setup.py handling of ROCm cupy (#7051)

acc6a1e

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

nv-ds-chat breaks with latest transformers (#7052)

bda1430

Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Rename aio_thread_count to intra_op_parallelism (#7056)

c184b16

Propagate API change. Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

fix test

0eafcaa

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Merge branch 'master' into tohtana/support_autocast

e423ba4

Merge branch 'master' into tohtana/support_autocast

880d15c

sfc-gh-truwase mentioned this pull request May 16, 2025

BF16 training benchmarks. #4904

Open

Merge branch 'master' into tohtana/support_autocast

518b5aa

loadams and others added 2 commits May 21, 2025 21:02

Merge branch 'master' into tohtana/support_autocast

c9f98fe

Merge branch 'master' into tohtana/support_autocast

bb855b7

Merge branch 'master' into tohtana/support_autocast

63fa2f8

tohtana and others added 7 commits June 2, 2025 06:52

Merge branch 'master' into tohtana/support_autocast

a7d8737

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Merge branch 'master' into tohtana/support_autocast

a01facd

Merge branch 'master' into tohtana/support_autocast

1a36621

Merge branch 'master' into tohtana/support_autocast

be11834

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Merge branch 'master' into tohtana/support_autocast

81dfab6

Merge branch 'master' into tohtana/support_autocast

52ab5d2

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>

Merge branch 'tohtana/support_autocast' of github.com:deepspeedai/Dee…

e1b8474

…pSpeed into tohtana/support_autocast

tjruwase approved these changes Jun 19, 2025

View reviewed changes

tjruwase and others added 2 commits June 18, 2025 22:49

Merge branch 'master' into tohtana/support_autocast

3914509

Merge branch 'master' into tohtana/support_autocast

e9154d4

tohtana enabled auto-merge (squash) June 19, 2025 20:23

tohtana merged commit ed5f737 into master Jun 19, 2025
12 checks passed

tohtana deleted the tohtana/support_autocast branch June 19, 2025 21:36

tohtana mentioned this pull request Jun 21, 2025

Fix release of IPG buffer #7376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable torch.autocast with ZeRO #6993

Enable torch.autocast with ZeRO #6993

tohtana commented Feb 3, 2025 •

edited

Loading

Uh oh!

tohtana commented Apr 22, 2025

Uh oh!

sfc-gh-sbekman commented Apr 23, 2025

Uh oh!

sfc-gh-sbekman commented May 20, 2025

Uh oh!

tohtana commented May 23, 2025

Uh oh!

stas00 commented May 23, 2025

Uh oh!

tohtana commented May 24, 2025

Uh oh!

stas00 commented May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Enable torch.autocast with ZeRO #6993

Enable torch.autocast with ZeRO #6993

Conversation

tohtana commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana commented Apr 22, 2025

Uh oh!

sfc-gh-sbekman commented Apr 23, 2025

Uh oh!

sfc-gh-sbekman commented May 20, 2025

Uh oh!

tohtana commented May 23, 2025

Uh oh!

stas00 commented May 23, 2025

Uh oh!

tohtana commented May 24, 2025

Uh oh!

stas00 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tohtana commented Feb 3, 2025 •

edited

Loading

stas00 commented May 27, 2025 •

edited

Loading