-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A temporary guide to stablize models' training #2262
Projects
Comments
Closed
6 tasks
This guide needs to be reedited for the evolution of Paddle |
此issue相关内容已经添加进F&Q,遂close此issue。 |
heavengate
pushed a commit
to heavengate/Paddle
that referenced
this issue
Aug 16, 2021
* fix doc link, test=dygraph, test=document
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, the convergence of many configurations in v2 APIs has not been validated. One major chanllenge comes from that the training process is always interrupted by some fatal error related to float point exception in the very begining. And we have concluded that this type of erros results from the overflow due to gradient explosion in backpropogation.
A threshold for clipping gradient of parameters can be enabled to suppress the gradient explosion. But gradient clipping seems not enough to defeat the float point exception. Instead, one necessary and more effective way is to clip error in the crucial position with proper threshold.
error
is the gradient of cost function with respect to the output of each layer, which propagates backward by following chain rule. In PaddlePaddle, the threshold for error clipping can be set via argslayer_attr
layer by layer:As long as
error
is clipped after the layers sensitive to numerical instability, the float point exception can be avoided expectedly. Take the seq2seq demo adapted from machine translation in PaddleBook for example:log_error_clipping=True
andlog_clipping=True
inpaddle.init()
;simple_attention()
which is sensitive because of thesoftmax
computation.Then begin to train the model. At first, a lot of error clipping information appears
Take it easy
Usually after serveral hundreds of batches, the training would turn to be stable. As the error becomes smaller, the threshold barely acts again.
Otherwise, a larger threshold or other hyper parameters with proper value may be needed.
Enjoy!
The text was updated successfully, but these errors were encountered: