-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Automatic Mixed Precision (AMP) support for derivatives. #160
Conversation
/blossom-ci |
1 similar comment
/blossom-ci |
/blossom-ci |
/blossom-ci |
1 similar comment
/blossom-ci |
/blossom-ci |
4 similar comments
/blossom-ci |
/blossom-ci |
/blossom-ci |
/blossom-ci |
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Considering that FP16 (E5M10) has the same dynamic range as FP8 (E5M2), and the FP8 recipe has moved to per-tensor scaling instead of per-global loss-scaling, if we are interested in a more robust recipe, future directions might include:
- Recommend
per_term_scaler
instead ofper_order_scaler
by default. - Per-tensor scaling (FP16) so that we can use the full dynamic range of FP16 for each forward and backward GEMMs. This will require FP16 GEMMs that can apply scaling factors for input tensors A and B and will introduce some overhead.
- Explore FP8 recipe with dynamic scaling.
Thank you for the review, @jinzex! I'll add your future directions suggestions to my list of items to look at for the next release. |
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! We should probably add some docs about the usage of this in a separate PR.
This can be enabled by setting (in config):
amp:
enabled: true
Modulus Pull Request
Description
Training Physics Informed Neural Networks with Automatic Mixed Precision (AMP) currently leads to infinity loss for several models. This comes from the higher-order derivatives, which could go beyond the FP16 dynamic range (
5.96e-8
to65504
). For example, the training of lid driven cavity problem got terminated around 3000 steps becauseu__y__y
at the corner passes the FP16 maximum value of65504
.The default AMP GradScaler tracks the model parameter gradients but not the derivatives calculated from
torch.autograd.grad
.DerivScaler
The derivatives need to be tracked by another scaler because the derivatives and NN parameter gradients have different dynamic ranges.
The dynamic range of FP16 is from 2^-24 to 2^15 (40 powers of 2).
The following range is different from problem to problem, just for reference
This PR adds a DerivScaler which
It supports the following features
For more details, please refer to this publication.
Checklist
Dependencies