【PaddlePaddle Hackathon 4 No.49】：为 Paddle bce_loss 支持 float16 数据类型 #50930

thunder95 · 2023-02-26T09:35:43Z

PR types

Performance optimization

PR changes

OPs

Describe

为bce_loss 新增float16 数据类型

测试设备：RTX 2070s

目前bce_loss前向和反向推理性能测试：

Case No.	input_shape	fp32(ms)	fp16(ms)	diff	relative diff
1	[16, 3, 64, 64, 1]	0.024328	0.0209348	0.003393	faster than 16.21%

中文API文档更新支持fp16数据类型: PaddlePaddle/docs#5704

… develop

…velop

… develop

paddle-bot · 2023-02-26T09:35:47Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… develop

…velop

… develop

…velop

… develop

zhangting2020 · 2023-03-10T07:48:19Z

paddle/phi/kernels/gpu/bce_loss_grad_kernel.cu

+    MT x_mt = static_cast<MT>(x);
+    MT term1 = max((static_cast<MT>(one) - x_mt) * x_mt, static_cast<MT>(eps));
+    return static_cast<T>(static_cast<MT>(dout) *
+                          (x_mt - static_cast<MT>(label)) / term1);


eps的问题，36行，1e-12在fp16表示下会下溢出为0

已做调整，不知道是否可以这样写。

这里可以简化一下代码？one和eps作为成员变量，初始化为MT类型。原来的构造函数可以删掉了

zhangting2020 · 2023-03-10T07:52:03Z

python/paddle/fluid/tests/unittests/test_bce_loss.py

@@ -279,6 +280,48 @@ def init_test_cast(self):
        self.shape = [2, 3, 20]


+@unittest.skipIf(
+    not core.is_compiled_with_cuda(), "core is not compiled with CUDA"
+)


这个不需要添加了

zhangting2020 · 2023-03-10T07:54:12Z

python/paddle/fluid/tests/unittests/test_bce_loss.py

+
+class TestBceLossOpFP16Case1(OpTest):
+    def init_test_cast(self):
+        self.shape = [20, 30, 40, 50]


这里应该继承TestBceLossOpFP16，以及拼写错误cast->case。下面的case也一样

zhangting2020 · 2023-03-10T07:55:07Z

python/paddle/fluid/tests/unittests/test_bce_loss.py

+        place = core.CUDAPlace(0)
+        if core.is_float16_supported(place):
+            self.check_grad_with_place(
+                place, ['X'], 'Out', max_relative_error=0.5


另外整个单测的写法需要参考低精度算子单测规范https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/amp_precision/amp_test_dev_guide_cn.html

可以继承TestBceLossOp，并对其做简单修改，简化代码

反向的相对误差是否合理？

@zhangting2020 请老师指导下, 不清楚什么地方写的有问题，反向的相对误差始终偏大
AssertionError: 0.42 not less than or equal to 0.001
AssertionError: 0.81 not less than or equal to 0.001
AssertionError: 0.81 not less than or equal to 0.001

zhangting2020 · 2023-03-10T08:05:04Z

python/paddle/nn/layer/loss.py

@@ -68,7 +68,7 @@ class BCEWithLogitsLoss(Layer):
    Args:
        weight (Tensor, optional): A manual rescaling weight given to the loss of each
            batch element. If given, it has to be a 1D Tensor whose size is `[N, ]`,
-            The data type is float32, float64. Default is ``'None'``.
+            The data type is float16, float32, float64. Default is ``'None'``.


这个API对应的是bce_loss吗？

另外，实现为class的API，通常实现中可能会调用functional下面的API，具体需要查看代码

需要对2个api的文档同步修改

需要对静态图分支的类型检查做修改

需要添加一个静态图的fp16单测，继承unittest，调用api即可。参考#51168中的静态图单测

已修改。

… develop

…nto bce_loss_fp16

zhangting2020 · 2023-03-13T03:20:23Z

paddle/phi/kernels/gpu/bce_loss_grad_kernel.cu

+    MT x_mt = static_cast<MT>(x);
+    MT term1 = max((static_cast<MT>(one) - x_mt) * x_mt, static_cast<MT>(eps));
+    return static_cast<T>(static_cast<MT>(dout) *
+                          (x_mt - static_cast<MT>(label)) / term1);


这里可以简化一下代码？one和eps作为成员变量，初始化为MT类型。原来的构造函数可以删掉了

zhangting2020 · 2023-03-13T03:22:42Z

paddle/phi/kernels/gpu/bce_loss_kernel.cu

+            static_cast<MT>(neg_100));
+    return static_cast<T>(
+        ((static_cast<MT>(label) - static_cast<MT>(one)) * term2) -
+        (static_cast<MT>(label) * term1));


这里和上面也是类似的问题，我觉得可以修改下原始的实现。one和neg_100本来是成员变量，可以初始化就为MT 类型。

zhangting2020 · 2023-03-13T03:27:13Z

python/paddle/fluid/tests/unittests/test_bce_loss.py

+
+class TestBceLossOpFP16Case2(TestBceLossOpFP16):
+    def init_test_case(self):
+        self.shape = [2, 3, 20]


上述单测可以再简化一下，TestBceLossOpFP16继承了TestBceLossOp，可以对TestBceLossOp做一些调整，比如初始化case的时候能够设置dtype，shape。这样可以去掉很多冗余的代码。

max_relative_error为什么会这么大？

暂时为了测试ci，反向的相对误差很大，一直没找到原因
AssertionError: 0.42 not less than or equal to 0.001
AssertionError: 0.81 not less than or equal to 0.001
AssertionError: 0.81 not less than or equal to 0.001

zhangting2020 · 2023-03-13T03:28:17Z

python/paddle/fluid/tests/unittests/test_bce_loss.py

+                    feed={'x': x_data, 'y': y_data}, fetch_list=[out]
+                )[0]
+                np.testing.assert_allclose(
+                    output_pd, output_np, rtol=1e-3, atol=1e-3


atol设置为0能通过吗？

@zhangting2020 这里是没问题的， atol=1e-3也能通过。

… bce_loss_fp16

…nto bce_loss_fp16

thunder95 · 2023-03-15T04:53:55Z

python/paddle/fluid/tests/unittests/test_bce_loss.py


        self.inputs = {'X': input_np, 'Label': label_np}
        self.outputs = {'Out': output_np}

    def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)


@zhangting2020 请问老师 check_eager的含义和用途是什么

这是框架升级过程中单测系统为了测试动态图加入的一个参数，不影响测试效果。

你需要看一下反向的计算精度问题，单测失败提示精度检查无法通过。

@zhangting2020 检查了很久不知道哪里出问题了。目前看来是计算numeric_grads的时候跟预期相差较大，例如在进行mean计算时 np.array([85.02881]).astype(np.float16) => 85.0, 导致pos和neg虽在float下有差异，但是float16二者取值都是85.0，所以计算结果得到的梯度是0。如果计算mean时，将输入设置成float32，就会得到梯度的值，且误差从0.42缩小到0.05. 希望老师进一步指导意见。

计算numeric_grads的时候跟预期相差较大，这个是指单测框架的实现中的哪一部分，能否贴一下链接，有可能是单测框架上造成的理论梯度值有精度损失，我确认下。

@zhangting2020 老师您好，我merge了最新的代码发现之前的op_test.py这个文件没有了，之前是在那个文件里面打印的相关输出进行数值对比的，我把atol和max_relative_error都去掉之后，居然通过检查了，就好像没检查一样。请问这部分后来是做了大的调整吗？

（1）应该是单测的一些文件目录做了调整，现在是在这个文件里了。python/paddle/fluid/tests/unittests/eager_op_test.py

（2）根据你描述的现象我比较担心可能会出现随机挂的问题。建议先在自己的开发环境上，把单测的shape调整几组，并且尝试重复运行单测比如100次：ctest -R test_bce_loss --repeat-until-fail 100 ，如果这样能通过测试，那应该就没问题了。用ctest执行单测，需要编译的时候开启DWITH_TESTING，比如

cmake .. -DPY_VERSION=3.7 -DWITH_GPU=ON -DWITH_TESTING=ON -DCMAKE_BUILD_TYPE=Release -DWITH_DISTRIBUTE=OFF

@zhangting2020 运行命令ctest -R test_bce_loss --repeat-until-fail 100，分别测试了[10, 10], [100, 100], [5000, 5000], [20, 30, 40, 50], 结果全部通过

… bce_loss_fp16

luotao1 · 2023-04-14T08:30:05Z

@thunder95 需要修复下ROCM流水线，看了下历史记录都是挂的

thunder95 · 2023-04-14T08:56:42Z

@luotao1 谢谢，才发现这里有个问题，看日志的时候没拉完

zhangting2020

LGTM

…addlePaddle#50930) * untracked files * bce_loss_fp16 * remove unused files * back max_rel_erro still big * simplify code * upd * fix max_relative_error * restart ci * Update test_bce_loss.py * Update test_bce_loss.py * Update test_bce_loss.py * Update test_bce_loss.py * try to pass test * restore file * remove error value * fix bug --------- Co-authored-by: Zhang Ting <Douyaer2020@qq.com>

thunder95 added 9 commits February 20, 2023 16:32

untracked files

c8ae296

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6aa02f0

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d599110

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

264894d

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

98d1e1c

… develop

Merge branch 'develop' of https://github.com/thunder95/Paddle into de…

b958122

…velop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

760e099

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e16076d

… develop

bce_loss_fp16

85169ee

paddle-bot bot added contributor External developers status: proposed labels Feb 26, 2023

remove unused files

c7560fe

thunder95 mentioned this pull request Feb 26, 2023

【PaddlePaddle Hackathon 第四期】任务总览 #50629

Closed

luotao1 assigned luotao1, zhangting2020 and cloud2009 Feb 27, 2023

thunder95 added 4 commits March 2, 2023 14:29

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

085c7a6

… develop

Merge branch 'develop' of https://github.com/thunder95/Paddle into de…

b1edf68

…velop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f2887e5

… develop

Merge branch 'develop' of https://github.com/thunder95/Paddle into de…

6a62308

…velop

luotao1 assigned Ligoml Mar 6, 2023

Ligoml mentioned this pull request Mar 7, 2023

【PaddlePaddle Hackathon 第四期】任务总览 #51281

Closed

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6620e88

… develop

zhangting2020 reviewed Mar 10, 2023

View reviewed changes

thunder95 added 3 commits March 10, 2023 14:43

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e4134d9

… develop

Merge branch 'bce_loss_fp16' of https://github.com/thunder95/Paddle i…

29929ba

…nto bce_loss_fp16

back max_rel_erro still big

812c917

thunder95 requested a review from zhangting2020 March 11, 2023 05:30

thunder95 mentioned this pull request Mar 11, 2023

PaddlePaddle Hackathon 4 No.49】：为 Paddle bce_loss 支持 float16 数据类型 PaddlePaddle/docs#5704

Merged

zhangting2020 reviewed Mar 13, 2023

View reviewed changes

simplify code

4793749

thunder95 requested a review from zhangting2020 March 13, 2023 07:21

thunder95 and others added 9 commits March 13, 2023 14:33

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

b50ec23

… bce_loss_fp16

upd

96b73f9

fix max_relative_error

e598009

fix max_relative_error

68e544f

restart ci

82adc55

Update test_bce_loss.py

0f96491

Merge branch 'bce_loss_fp16' of https://github.com/thunder95/Paddle i…

895c3d5

…nto bce_loss_fp16

Update test_bce_loss.py

2df50b2

Update test_bce_loss.py

c9d5fc2

thunder95 commented Mar 15, 2023

View reviewed changes

Zhang Ting and others added 7 commits March 15, 2023 13:45

Update test_bce_loss.py

9afb5e6

try to pass test

b9fc6df

merge

f8a3c2b

restore file

a8616ca

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

73071c0

… bce_loss_fp16

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

ec9249d

… bce_loss_fp16

remove error value

796c276

fix bug

72bd515

zhangting2020 approved these changes Apr 14, 2023

View reviewed changes

luotao1 merged commit 44e6de9 into PaddlePaddle:develop Apr 17, 2023
24 checks passed

luotao1 mentioned this pull request Apr 17, 2023

NO.49 为 Paddle bce_loss 算子实现 float16 数据类型支持 #51490

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【PaddlePaddle Hackathon 4 No.49】：为 Paddle bce_loss 支持 float16 数据类型 #50930

【PaddlePaddle Hackathon 4 No.49】：为 Paddle bce_loss 支持 float16 数据类型 #50930

thunder95 commented Feb 26, 2023 •

edited

paddle-bot bot commented Feb 26, 2023

zhangting2020 Mar 10, 2023

thunder95 Mar 11, 2023

zhangting2020 Mar 13, 2023

zhangting2020 Mar 10, 2023

thunder95 Mar 11, 2023

zhangting2020 Mar 10, 2023

thunder95 Mar 11, 2023

zhangting2020 Mar 10, 2023

thunder95 Mar 11, 2023

zhangting2020 Mar 10, 2023

thunder95 Mar 11, 2023

zhangting2020 Mar 13, 2023

zhangting2020 Mar 13, 2023

thunder95 Mar 13, 2023

zhangting2020 Mar 13, 2023

thunder95 Mar 13, 2023

zhangting2020 Mar 13, 2023

thunder95 Mar 13, 2023

thunder95 Mar 15, 2023

zhangting2020 Mar 17, 2023

thunder95 Mar 22, 2023

zhangting2020 Apr 13, 2023

thunder95 Apr 14, 2023

zhangting2020 Apr 14, 2023 •

edited

thunder95 Apr 14, 2023

luotao1 commented Apr 14, 2023

thunder95 commented Apr 14, 2023

zhangting2020 left a comment

【PaddlePaddle Hackathon 4 No.49】：为 Paddle bce_loss 支持 float16 数据类型 #50930

【PaddlePaddle Hackathon 4 No.49】：为 Paddle bce_loss 支持 float16 数据类型 #50930

Conversation

thunder95 commented Feb 26, 2023 • edited

PR types

PR changes

Describe

paddle-bot bot commented Feb 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangting2020 Apr 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 commented Apr 14, 2023

thunder95 commented Apr 14, 2023

zhangting2020 left a comment

Choose a reason for hiding this comment

thunder95 commented Feb 26, 2023 •

edited

zhangting2020 Apr 14, 2023 •

edited