Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A temporary guide to stablize models' training #2262

Closed
kuke opened this issue May 25, 2017 · 2 comments
Closed

A temporary guide to stablize models' training #2262

kuke opened this issue May 25, 2017 · 2 comments

Comments

@kuke
Copy link
Contributor

kuke commented May 25, 2017

Currently, the convergence of many configurations in v2 APIs has not been validated. One major chanllenge comes from that the training process is always interrupted by some fatal error related to float point exception in the very begining. And we have concluded that this type of erros results from the overflow due to gradient explosion in backpropogation.

A threshold for clipping gradient of parameters can be enabled to suppress the gradient explosion. But gradient clipping seems not enough to defeat the float point exception. Instead, one necessary and more effective way is to clip error in the crucial position with proper threshold. error is the gradient of cost function with respect to the output of each layer, which propagates backward by following chain rule. In PaddlePaddle, the threshold for error clipping can be set via args layer_attr layer by layer:

layer_attr=paddle.attr.ExtraAttr(error_clipping_threshold=10.0)

As long as error is clipped after the layers sensitive to numerical instability, the float point exception can be avoided expectedly. Take the seq2seq demo adapted from machine translation in PaddleBook for example:

  • Enable the log of error clipping by setting log_error_clipping=True and log_clipping=True in paddle.init();
  • Set the threshold for error clipping in the input mixed layer of decoder, after simple_attention() which is sensitive because of the softmax computation.

Then begin to train the model. At first, a lot of error clipping information appears

Pass 0, Batch 50, Cost 237.458789, {'classification_error_evaluator': 0.9572649598121643}
I0525 00:46:05.145277 2043461632 Layer.cpp:363]  layer=input_recurrent@decoder_group need clipping, max error=150.321 avg error=6.00274
I0525 00:46:05.186667 2043461632 Layer.cpp:363]  layer=input_recurrent@decoder_group need clipping, max error=1453.23 avg error=57.54
I0525 00:46:05.237247 2043461632 Layer.cpp:363]  layer=input_recurrent@decoder_group need clipping, max error=3381.55 avg error=143.592
I0525 00:46:05.304813 2043461632 Layer.cpp:363]  layer=input_recurrent@decoder_group need clipping, max error=5546.9 avg error=235.026
I0525 00:46:05.364843 2043461632 Layer.cpp:363]  layer=input_recurrent@decoder_group need clipping, max error=7751.46 avg error=330.331
I0525 00:46:05.423358 2043461632 Layer.cpp:363]  layer=input_recurrent@decoder_group need clipping, max error=9669.27 avg error=351.612
I0525 00:46:05.490279 2043461632 Layer.cpp:363]  layer=input_recurrent@decoder_group need clipping, max error=11335.1 avg error=474.481
I0525 00:46:05.529522 2043461632 Layer.cpp:363]  layer=input_recurrent@decoder_group need clipping, max error=12848.3 avg error=594.606
I0525 00:46:05.577287 2043461632 Layer.cpp:363]  layer=input_recurrent@decoder_group need clipping, max error=14418.5 avg error=709.603

Take it easy

Usually after serveral hundreds of batches, the training would turn to be stable. As the error becomes smaller, the threshold barely acts again.

Pass 0, Batch 340, Cost 163.044409, {'classification_error_evaluator': 0.9482758641242981}
.........
Pass 0, Batch 350, Cost 146.991077, {'classification_error_evaluator': 0.9910714030265808}
.........
Pass 0, Batch 360, Cost 167.668896, {'classification_error_evaluator': 0.93388432264328}
.........
Pass 0, Batch 370, Cost 180.562292, {'classification_error_evaluator': 0.9770992398262024}
.........
Pass 0, Batch 380, Cost 211.419653, {'classification_error_evaluator': 0.9756097793579102}
.........
Pass 0, Batch 390, Cost 155.465637, {'classification_error_evaluator': 0.9576271176338196}
.........
Pass 0, Batch 400, Cost 157.087720, {'classification_error_evaluator': 0.9473684430122375}

Otherwise, a larger threshold or other hyper parameters with proper value may be needed.

Enjoy!

@kuke
Copy link
Contributor Author

kuke commented Sep 29, 2017

This guide needs to be reedited for the evolution of Paddle

@lcy-seso
Copy link
Contributor

此issue相关内容已经添加进F&Q,遂close此issue。

heavengate pushed a commit to heavengate/Paddle that referenced this issue Aug 16, 2021
* fix doc link, test=dygraph, test=document
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants