Use an unified FLAGS_check_nan_inf_level to control the result of checking infinite. #47672
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Function optimization
PR changes
Others
Describe
在#47095 中,新增了
FLAGS_abort_on_nan_inf
和FLAGS_check_tensor_max_min
,来控制FLAGS_check_nan_inf
开启时的行为,便于进行精度排查。实现方式有2个弊端:本PR删除
FLAGS_abort_on_nan_inf
和FLAGS_check_tensor_max_min
,新增FLAGS_check_nan_inf_level
来统一控制FLAGS_check_nan_inf
工具的行为,具体功能如下:FLAGS_check_nan_inf_level = 0
,只打印存在NAN、Inf的Tensor信息,并在检测到NAN、Inf之后退出进程。为默认配置。FLAGS_check_nan_inf_level = 1
,只打印存在NAN、Inf的Tensor信息,在检测到NAN、Inf后不会退出进程,而是一直训练,可用于观察不同iter出现NAN、Inf的op_type、位置是否一样。FLAGS_check_nan_inf_level = 2
,float专用,当Tensor的Max、Min值超出了float16的表示范围时,也会打印。用于amp精度排查。FLAGS_check_nan_inf_level = 3
,打印全部Tensor的Max、Min等信息。用于进行float、amp训练精度比对。