Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use an unified FLAGS_check_nan_inf_level to control the result of checking infinite. #47672

Merged
merged 1 commit into from Nov 5, 2022

Conversation

Xreki
Copy link
Contributor

@Xreki Xreki commented Nov 4, 2022

PR types

Function optimization

PR changes

Others

Describe

#47095 中,新增了FLAGS_abort_on_nan_infFLAGS_check_tensor_max_min,来控制FLAGS_check_nan_inf开启时的行为,便于进行精度排查。实现方式有2个弊端:

  1. FLAGS数量太多,需要组合起来使用,配置相对麻烦
  2. 无法扩展至更多的精度检查

本PR删除FLAGS_abort_on_nan_infFLAGS_check_tensor_max_min,新增FLAGS_check_nan_inf_level来统一控制FLAGS_check_nan_inf工具的行为,具体功能如下:

  1. FLAGS_check_nan_inf_level = 0,只打印存在NAN、Inf的Tensor信息,并在检测到NAN、Inf之后退出进程。为默认配置。
  2. FLAGS_check_nan_inf_level = 1,只打印存在NAN、Inf的Tensor信息,在检测到NAN、Inf后不会退出进程,而是一直训练,可用于观察不同iter出现NAN、Inf的op_type、位置是否一样。
  3. FLAGS_check_nan_inf_level = 2,float专用,当Tensor的Max、Min值超出了float16的表示范围时,也会打印。用于amp精度排查。
  4. FLAGS_check_nan_inf_level = 3,打印全部Tensor的Max、Min等信息。用于进行float、amp训练精度比对。

@paddle-bot
Copy link

paddle-bot bot commented Nov 4, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@zhangting2020 zhangting2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Xreki Xreki merged commit 54bc3b4 into PaddlePaddle:develop Nov 5, 2022
@Xreki Xreki deleted the amp/opt_check_infinite branch November 5, 2022 02:48
Xreki added a commit to Xreki/Paddle that referenced this pull request Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants