-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon4 No.44】为 Paddle 优化 logsumexp op 在 GPU 上的计算性能 #413
Conversation
|
||
## 2.1 关键模块与性能提升点 | ||
|
||
具体实现方式上,可以借助shared memory合并带有Reduce计算的Kernel,以减少访问global memory的次数。fusion后的logsumexp Kernel会在一开始把输入加载到shared memory中,每个block的shared memory加载一个instance的feature,shape为`(1, c)`。后续的所有中间计算结果都保存到shared memory中,只将最后的输出out写到global memory里。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确认一下,你是打算自己写CUDA Kernel吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,这是否是被鼓励的行为呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看算子类型哈,如果通过调用1个框架已充分优化的ElementwiseKernel
、BroadcastKernel
、ReduceKernel
就能实现的,则不建议重复开发,直接调用这些Kernel就行。对于像这一类比较复杂的算子,是鼓励自己写Kernel的哈,RFC里面可以明确说明一下是要重新写一个优化的CUDA Kernel。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
尝试了下自己写cuda kernel,有些困难😢,因此使用kps中的ReduceKernel和elementwise算子实现了一版,性能符合预期,这样也可以吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
性能表现已更新
@Xreki 劳烦老师看一下 |
No description provided.