Add Automatic Mixed Precision (AMP) support for derivatives. #160

Alexey-Kamenev · 2024-07-03T00:05:51Z

Modulus Pull Request

Description

Training Physics Informed Neural Networks with Automatic Mixed Precision (AMP) currently leads to infinity loss for several models. This comes from the higher-order derivatives, which could go beyond the FP16 dynamic range (5.96e-8 to 65504). For example, the training of lid driven cavity problem got terminated around 3000 steps because u__y__y at the corner passes the FP16 maximum value of 65504.

The default AMP GradScaler tracks the model parameter gradients but not the derivatives calculated from torch.autograd.grad.

DerivScaler

The derivatives need to be tracked by another scaler because the derivatives and NN parameter gradients have different dynamic ranges.
The dynamic range of FP16 is from 2^-24 to 2^15 (40 powers of 2).
The following range is different from problem to problem, just for reference

Typical weight gradient range: 2^-40 to 2^-10
Typical first order derivative range: 2^-10 to 2^5
Typical second order derivative range: 2^0 to 2^20

This PR adds a DerivScaler which

scales and unscales the derivatives during forward in the derivative node, so the operations used in FP16 are in the good range
check INFs/NaNs when unscaling the derivatives. When INFs/NaNs are detected, this iteration will be skipped, and the scale value will be adjusted.

It supports the following features

Per derivative order scalers (default)
Per derivative term scalers
Control the scaling factors
- Avoid a very low derivative scaling factor
  - use fused activation or disable autocast for activation to avoid intermediate result overflow.
  - Use FP32 for the first layer that produces the high-order derivatives, this layer always overflows and this does not introduce too much perf drop.
- Avoid scaling factor decreasing too fast, useful for Fourier Neural Network
  - Entering “recover mode” if the scaling factor is less than a predefined threshold. In this mode, the scaling factor grows more frequently.

For more details, please refer to this publication.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

Alexey-Kamenev · 2024-07-03T16:44:37Z

/blossom-ci

ktangsali · 2024-07-03T17:07:12Z

/blossom-ci

Alexey-Kamenev · 2024-07-03T17:45:15Z

/blossom-ci

ktangsali · 2024-07-03T21:55:57Z

/blossom-ci

ktangsali · 2024-07-08T20:24:24Z

/blossom-ci

Alexey-Kamenev · 2024-07-08T22:08:11Z

/blossom-ci

Alexey-Kamenev · 2024-07-08T23:44:33Z

/blossom-ci

Alexey-Kamenev · 2024-07-08T23:51:37Z

/blossom-ci

Alexey-Kamenev · 2024-07-09T00:08:32Z

/blossom-ci

Alexey-Kamenev · 2024-07-09T16:29:14Z

/blossom-ci

Alexey-Kamenev · 2024-07-09T21:31:56Z

/blossom-ci

jinzex

LGTM.

Considering that FP16 (E5M10) has the same dynamic range as FP8 (E5M2), and the FP8 recipe has moved to per-tensor scaling instead of per-global loss-scaling, if we are interested in a more robust recipe, future directions might include:

Recommend per_term_scaler instead of per_order_scaler by default.
Per-tensor scaling (FP16) so that we can use the full dynamic range of FP16 for each forward and backward GEMMs. This will require FP16 GEMMs that can apply scaling factors for input tensors A and B and will introduce some overhead.
Explore FP8 recipe with dynamic scaling.

modulus/sym/amp.py

modulus/sym/domain/domain.py

modulus/sym/eq/derivatives.py

Alexey-Kamenev · 2024-07-12T18:33:08Z

Thank you for the review, @jinzex! I'll add your future directions suggestions to my list of items to look at for the next release.

Alexey-Kamenev · 2024-07-12T18:45:01Z

/blossom-ci

ktangsali

Looks great! We should probably add some docs about the usage of this in a separate PR.

This can be enabled by setting (in config):

amp:
  enabled: true

Alexey-Kamenev added 3 commits July 2, 2024 16:02

Add Automatic Mixed Precision (AMP) support for derivatives.

6515d12

Update CHANGELOG.

7640b95

Add license header.

bcec197

Alexey-Kamenev requested review from akshaysubr and ktangsali July 3, 2024 00:05

Add user to CI workflow.

6f14b9d

Add --no-build-isolation flag to install, add more GPU archs.

5c5ef20

Merge branch 'main' into deriv-scaler

3387d00

This was referenced Jul 9, 2024

1.6.0 rc rebase into main (first rebase) #162

Closed

1.6.0 rc rebase into main (first rebase) #164

Merged

Merge branch 'main' into deriv-scaler

461d805

jinzex reviewed Jul 11, 2024

View reviewed changes

modulus/sym/amp.py Outdated Show resolved Hide resolved

modulus/sym/amp.py Outdated Show resolved Hide resolved

modulus/sym/domain/domain.py Outdated Show resolved Hide resolved

modulus/sym/eq/derivatives.py Outdated Show resolved Hide resolved

Address feedback

21d3265

ktangsali approved these changes Jul 23, 2024

View reviewed changes

ktangsali merged commit f90f2fe into NVIDIA:main Jul 23, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Automatic Mixed Precision (AMP) support for derivatives. #160

Add Automatic Mixed Precision (AMP) support for derivatives. #160

Alexey-Kamenev commented Jul 3, 2024

Alexey-Kamenev commented Jul 3, 2024

ktangsali commented Jul 3, 2024

Alexey-Kamenev commented Jul 3, 2024

ktangsali commented Jul 3, 2024

ktangsali commented Jul 8, 2024

Alexey-Kamenev commented Jul 8, 2024

Alexey-Kamenev commented Jul 8, 2024

Alexey-Kamenev commented Jul 8, 2024

Alexey-Kamenev commented Jul 9, 2024

Alexey-Kamenev commented Jul 9, 2024

Alexey-Kamenev commented Jul 9, 2024

jinzex left a comment •

edited

Loading

Alexey-Kamenev commented Jul 12, 2024

Alexey-Kamenev commented Jul 12, 2024

ktangsali left a comment •

edited

Loading

Add Automatic Mixed Precision (AMP) support for derivatives. #160

Add Automatic Mixed Precision (AMP) support for derivatives. #160

Conversation

Alexey-Kamenev commented Jul 3, 2024

Modulus Pull Request

Description

DerivScaler

Checklist

Dependencies

Alexey-Kamenev commented Jul 3, 2024

ktangsali commented Jul 3, 2024

Alexey-Kamenev commented Jul 3, 2024

ktangsali commented Jul 3, 2024

ktangsali commented Jul 8, 2024

Alexey-Kamenev commented Jul 8, 2024

Alexey-Kamenev commented Jul 8, 2024

Alexey-Kamenev commented Jul 8, 2024

Alexey-Kamenev commented Jul 9, 2024

Alexey-Kamenev commented Jul 9, 2024

Alexey-Kamenev commented Jul 9, 2024

jinzex left a comment • edited Loading

Choose a reason for hiding this comment

Alexey-Kamenev commented Jul 12, 2024

Alexey-Kamenev commented Jul 12, 2024

ktangsali left a comment • edited Loading

Choose a reason for hiding this comment

jinzex left a comment •

edited

Loading

ktangsali left a comment •

edited

Loading