Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Automatic Mixed Precision (AMP) support for derivatives. #160

Merged
merged 8 commits into from
Jul 23, 2024

Conversation

Alexey-Kamenev
Copy link
Collaborator

Modulus Pull Request

Description

Training Physics Informed Neural Networks with Automatic Mixed Precision (AMP) currently leads to infinity loss for several models. This comes from the higher-order derivatives, which could go beyond the FP16 dynamic range (5.96e-8 to 65504). For example, the training of lid driven cavity problem got terminated around 3000 steps because u__y__y at the corner passes the FP16 maximum value of 65504.

The default AMP GradScaler tracks the model parameter gradients but not the derivatives calculated from torch.autograd.grad.

DerivScaler

The derivatives need to be tracked by another scaler because the derivatives and NN parameter gradients have different dynamic ranges.
The dynamic range of FP16 is from 2^-24 to 2^15 (40 powers of 2).
The following range is different from problem to problem, just for reference

  • Typical weight gradient range: 2^-40 to 2^-10
  • Typical first order derivative range: 2^-10 to 2^5
  • Typical second order derivative range: 2^0 to 2^20

This PR adds a DerivScaler which

  1. scales and unscales the derivatives during forward in the derivative node, so the operations used in FP16 are in the good range
  2. check INFs/NaNs when unscaling the derivatives. When INFs/NaNs are detected, this iteration will be skipped, and the scale value will be adjusted.

It supports the following features

  • Per derivative order scalers (default)
  • Per derivative term scalers
  • Control the scaling factors
    • Avoid a very low derivative scaling factor
      • use fused activation or disable autocast for activation to avoid intermediate result overflow.
      • Use FP32 for the first layer that produces the high-order derivatives, this layer always overflows and this does not introduce too much perf drop.
    • Avoid scaling factor decreasing too fast, useful for Fourier Neural Network
      • Entering “recover mode” if the scaling factor is less than a predefined threshold. In this mode, the scaling factor grows more frequently.

For more details, please refer to this publication.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

1 similar comment
@ktangsali
Copy link
Collaborator

/blossom-ci

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

@ktangsali
Copy link
Collaborator

/blossom-ci

1 similar comment
@ktangsali
Copy link
Collaborator

/blossom-ci

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

4 similar comments
@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

Copy link

@jinzex jinzex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Considering that FP16 (E5M10) has the same dynamic range as FP8 (E5M2), and the FP8 recipe has moved to per-tensor scaling instead of per-global loss-scaling, if we are interested in a more robust recipe, future directions might include:

  1. Recommend per_term_scaler instead of per_order_scaler by default.
  2. Per-tensor scaling (FP16) so that we can use the full dynamic range of FP16 for each forward and backward GEMMs. This will require FP16 GEMMs that can apply scaling factors for input tensors A and B and will introduce some overhead.
  3. Explore FP8 recipe with dynamic scaling.

modulus/sym/amp.py Outdated Show resolved Hide resolved
modulus/sym/amp.py Outdated Show resolved Hide resolved
modulus/sym/domain/domain.py Outdated Show resolved Hide resolved
modulus/sym/eq/derivatives.py Outdated Show resolved Hide resolved
@Alexey-Kamenev
Copy link
Collaborator Author

Thank you for the review, @jinzex! I'll add your future directions suggestions to my list of items to look at for the next release.

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

Copy link
Collaborator

@ktangsali ktangsali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! We should probably add some docs about the usage of this in a separate PR.

This can be enabled by setting (in config):

amp:
  enabled: true

@ktangsali ktangsali merged commit f90f2fe into NVIDIA:main Jul 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants