New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

【Hackathon No.38】为 Paddle 优化 deformable_conv op 在 GPU 上的计算性能 #218

Merged

ZzSean merged 6 commits into PaddlePaddle:master from Rayman96:for_defor

Sep 7, 2022

Contributor

Rayman96 commented Aug 23, 2022 •

edited

提案内容有修改辛苦审核

Rayman96 added 3 commits

August 10, 2022 20:47


          【Hackathon No.34】为 Paddle 优化 poisson op 在 GPU 上的计算性能

bf480f7


          【Hackathon No.38】为 Paddle 优化 deformable_conv op 在 GPU 上的计算性能

bde4050


          modify

6b81b84

paddle-bot bot commented Aug 23, 2022

你的PR提交成功，感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备，具体请参考示例和模版。
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

paddle-bot bot added contributor status: proposed labels

Rayman96 mentioned this pull request

【PaddlePaddle Hackathon 第三期】任务总览 PaddlePaddle/Paddle#43938

Closed

luotao1 assigned luotao1, ZzSean and Ligoml

Rayman96 added 2 commits

August 24, 2022 10:21


          modify RFC

706daae


          modify RFC

ed0d775

ZzSean reviewed

View reviewed changes

rfcs/OPs-Perf/20220823_deformable_op_optimization.md Outdated

		@@ -0,0 +1,91 @@
		Poisson OP性能优化设计文档

ZzSean Aug 24, 2022

题目修改

Contributor Author

Rayman96 Aug 25, 2022

已修改

rfcs/OPs-Perf/20220823_deformable_op_optimization.md Outdated


		# 1 背景与意义

		目前Paddle中的Deformable_conv的GPU版是使用cuBlas+CUDA kernel实现的，kernel实现方式与论文原作者的实现类似是讲CPU的kernel迁移到了GPU上，对于CUDA代码未进行针对性优化。

ZzSean Aug 24, 2022

讲->将

Contributor Author

Rayman96 Aug 25, 2022

已修改

rfcs/OPs-Perf/20220823_deformable_op_optimization.md Outdated

+              对于此OP在目前飞桨框架（Develop分支）中的性能现状调研，表格形式列出[OP Benchmark](https://github.com/PaddlePaddle/benchmark/tree/master/api/tests_v2)中各种case场景下的OP性能数据（Tesla P4）。
+              ### 时间分析
+              通过benchmark运行deformable_conv的测试，其中执行了前向和后向的过程，需要对二者使用时间进行拆分, 下表列出核心的几部分。

ZzSean Aug 24, 2022

其实运行op benchmark时将 --backward设置为False就可以，这样数据仅包含前向看起来更直观

Contributor Author

Rayman96 Aug 25, 2022

好的学会了🙏 已更新数据

rfcs/OPs-Perf/20220823_deformable_op_optimization.md Outdated


		根据kernels运行时间分析：65%时间消耗在No.5上，而cuBlas本身的实现难以有较大优化空间，所以单独优化两个kernel较难实现目标。

		根据CUDA API时间分析：63%时间消耗在同步，26%时间消耗在内存分配。通过减少需要线程同步的次数，降低数据在内存和CUDA间迁移应该可以较大程度优化。故优化点1和优化点2是重点考虑的对象。总体来看运行过程是串行的，每个im2col结束后执行gemm，然后再进行下一个im2col，gemm的过程。

ZzSean Aug 24, 2022

降低数据在内存和CUDA间迁移，这里指的是Host与Device之间的数据拷贝吗？
这里能否将API的测试时间也补充一下

Contributor Author

Rayman96 Aug 25, 2022

指的是Host与Device之间的数据拷贝的。
这里API数据只计算前向的话分布有更新，添加到时间描述一节中，此处也已修改描述，通过API的情况无法看不出什么明显的结论

rfcs/OPs-Perf/20220823_deformable_op_optimization.md Outdated

+                 + 优化点1: 通过优化grid， block数量寻找更优配置。
+                 + 优化点2: 将deformable_conv_kernel_impl中计算像素和权重乘积的循环迁移到ModulatedDeformableIm2colGpuKernel中，将col_buffer的并行计算和output_3d的计算整合，减少部分搬运开销
+                 + 优化点3: 将deformable_conv_kernel_impl中的（batch_size / im2col_step）次循环并行化，目前用循环的方式im2col_step完成后才能进行下一个step，等待时间是无必要的。
+                 + 优化点4: 单独优化deformable_conv_kernel_impl中计算像素和权重乘积的循环；

ZzSean Aug 24, 2022

句尾标点符号还是统一一下

Contributor Author

Rayman96 Aug 25, 2022

已修改

rfcs/OPs-Perf/20220823_deformable_op_optimization.md

+              ##  2.2 Host / Device 端计算流程
+. 针对优化点1: 考虑通过paddle已实现的gpu_launch_config.h中GetGpuLaunchConfig1D方法获得较优的参数配置，或手动对BlockSize的不同大小进行性能测试验证（可能有一定优化空间）
+. 针对优化点2: ModulatedDeformableIm2colGpuKernel的Host端多接入两个参数，Device端计算完成col_buffer后继续计算output（可能有较大优化空间）

ZzSean Aug 24, 2022

这里能否描述的更详细些？如使用配图或者伪代码，因为这个优化在上述描述中应该是重点

Contributor Author

Rayman96 Aug 25, 2022

已添加伪代码描述

rfcs/OPs-Perf/20220823_deformable_op_optimization.md


		2. 针对优化点2: ModulatedDeformableIm2colGpuKernel的Host端多接入两个参数，Device端计算完成col_buffer后继续计算output（可能有较大优化空间）

		3. 针对优化点3: 将整个im2col_step的过程并行化形成新的kernel，包含im2col和gemm两个步骤(可能有较大优化空间)

ZzSean Aug 24, 2022

同上

Contributor Author

Rayman96 Aug 25, 2022

已添加伪代码描述

rfcs/OPs-Perf/20220823_deformable_op_optimization.md Outdated


		## 3 测试和验收的考量

		实现前向速度提升超过25%

ZzSean Aug 24, 2022

能否给出提升超过25%的评估逻辑

Contributor Author

Rayman96 Aug 25, 2022

已修改


          Update 20220823_deformable_op_optimization.md

56a773c

Contributor Author

Rayman96 commented Aug 26, 2022

请问下现在Paddle有办法使用动态并行吗？想实现在gpu kernel中调用blas计算矩阵乘法。
测试在kernel直接使用blas = phi::funcs::GetBlas会报错说calling a host function from a global function is not allowed

ZzSean commented Sep 1, 2022

请问下现在Paddle有办法使用动态并行吗？想实现在gpu kernel中调用blas计算矩阵乘法。测试在kernel直接使用blas = phi::funcs::GetBlas会报错说calling a host function from a global function is not allowed

cublas函数都是直接在host端调用的，内部可以理解为就是调用了cuda kernel，所以无法在global函数中再调用global函数

ZzSean approved these changes

View reviewed changes

ZzSean left a comment

LGTM

ZzSean merged commit cc3d3fc into PaddlePaddle:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment