【Hackathon4 No.44】为 Paddle 优化 logsumexp op 在 GPU 上的计算性能 #413

Asthestarsfalll · 2023-03-05T09:26:28Z

No description provided.

paddle-bot · 2023-03-05T09:26:33Z

你的PR提交成功，感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备，具体请参考示例和模版。
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

rfcs/OPs-Perf/20220305_logsumexp_op_optimization.md

Xreki · 2023-03-05T11:19:04Z

rfcs/OPs-Perf/20220305_logsumexp_op_optimization.md

+
+## 2.1 关键模块与性能提升点
+
+具体实现方式上，可以借助shared memory合并带有Reduce计算的Kernel，以减少访问global memory的次数。fusion后的logsumexp Kernel会在一开始把输入加载到shared memory中，每个block的shared memory加载一个instance的feature，shape为`(1, c)`。后续的所有中间计算结果都保存到shared memory中，只将最后的输出out写到global memory里。


确认一下，你是打算自己写CUDA Kernel吗？

是的，这是否是被鼓励的行为呢？

看算子类型哈，如果通过调用1个框架已充分优化的ElementwiseKernel、BroadcastKernel、ReduceKernel就能实现的，则不建议重复开发，直接调用这些Kernel就行。对于像这一类比较复杂的算子，是鼓励自己写Kernel的哈，RFC里面可以明确说明一下是要重新写一个优化的CUDA Kernel。

尝试了下自己写cuda kernel，有些困难😢，因此使用kps中的ReduceKernel和elementwise算子实现了一版，性能符合预期，这样也可以吧？

性能表现已更新

Asthestarsfalll · 2023-04-03T07:55:18Z

@Xreki 劳烦老师看一下

add rfc of logsumexp optimization

3c4f64f

paddle-bot bot added contributor status: proposed labels Mar 5, 2023

Asthestarsfalll added 2 commits March 5, 2023 18:47

update

11a98ee

update computation graph

8ac8253

Xreki reviewed Mar 5, 2023

View reviewed changes

rfcs/OPs-Perf/20220305_logsumexp_op_optimization.md Outdated Show resolved Hide resolved

Xreki reviewed Mar 5, 2023

View reviewed changes

rfcs/OPs-Perf/20220305_logsumexp_op_optimization.md Show resolved Hide resolved

Xreki reviewed Mar 5, 2023

View reviewed changes

update

5c1bec7

luotao1 assigned luotao1, JamesLim-sy, cloud2009 and Ligoml Mar 6, 2023

Asthestarsfalll mentioned this pull request Mar 25, 2023

【PaddlePaddle Hackathon 第四期】任务总览 PaddlePaddle/Paddle#51281

Closed

update

363d118

JamesLim-sy approved these changes Apr 4, 2023

View reviewed changes

JamesLim-sy merged commit 69a3692 into PaddlePaddle:master Apr 4, 2023

luotao1 mentioned this pull request Apr 4, 2023

【Hackathon4 No.44】为 Paddle 优化 logsumexp op 在 GPU 上的计算性能 #480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Hackathon4 No.44】为 Paddle 优化 logsumexp op 在 GPU 上的计算性能 #413

【Hackathon4 No.44】为 Paddle 优化 logsumexp op 在 GPU 上的计算性能 #413

Asthestarsfalll commented Mar 5, 2023

paddle-bot bot commented Mar 5, 2023

Xreki Mar 5, 2023

Asthestarsfalll Mar 5, 2023

Xreki Mar 5, 2023

Asthestarsfalll Mar 26, 2023

Asthestarsfalll Mar 27, 2023

Asthestarsfalll commented Apr 3, 2023


		## 2.1 关键模块与性能提升点

		具体实现方式上，可以借助shared memory合并带有Reduce计算的Kernel，以减少访问global memory的次数。fusion后的logsumexp Kernel会在一开始把输入加载到shared memory中，每个block的shared memory加载一个instance的feature，shape为`(1, c)`。后续的所有中间计算结果都保存到shared memory中，只将最后的输出out写到global memory里。

【Hackathon4 No.44】为 Paddle 优化 logsumexp op 在 GPU 上的计算性能 #413

【Hackathon4 No.44】为 Paddle 优化 logsumexp op 在 GPU 上的计算性能 #413

Conversation

Asthestarsfalll commented Mar 5, 2023

paddle-bot bot commented Mar 5, 2023

Xreki Mar 5, 2023

Choose a reason for hiding this comment

Asthestarsfalll Mar 5, 2023

Choose a reason for hiding this comment

Xreki Mar 5, 2023

Choose a reason for hiding this comment

Asthestarsfalll Mar 26, 2023

Choose a reason for hiding this comment

Asthestarsfalll Mar 27, 2023

Choose a reason for hiding this comment

Asthestarsfalll commented Apr 3, 2023