optimization of max_pool3d grad #45934

s5u13b · 2022-09-09T09:56:14Z

Performance optimization

OPs

Environment:
- V100-32G, CUDA 11.2, cuDNN 8
Feature：
- Replace the div and mod operation with fast_divmod operation.
- Replace 1d gpu launch with 3d gpu launch.
- Optimize the computation logic of input grad accumulation. Before optimization, the gpu launch config is based on the input data, and the input grad is accumulated through traverse the coresponding index of output mask data, which introduces the overhead of much output index computation. After optimization, the gpu launch config is based on the output data, the input grad is directly accumulated through looking up the max index of each output data, which saves the overhead of output index computation but requires atomic add operation.
- (Config 0 is not optimized yet because paddle calls cudnn kernel in the config 0 of max_pool3d benchmarking.)
Performance (OP Benchmark):

Paddle Kernel	Config ID	Perf Before	Perf After	Improvement	Perf of Pytorch
cudnn::pooling_bw_5d_kernel_max	0	1779.7us	-	-	725.72us
KernelMaxPool3DWithIdxGrad	1	6128.1us	677.62us	804.3%	725.83us

JamesLim-sy

LGTM，补充下其他走相同Kernel的性能数据吧

optimization of max_pool3d grad

3e850d6

paddle-bot-old bot added the contributor External developers label Sep 9, 2022

luotao1 assigned luotao1, Ligoml and JamesLim-sy Sep 13, 2022

JamesLim-sy approved these changes Sep 20, 2022

View reviewed changes

JamesLim-sy merged commit 0e563da into PaddlePaddle:develop Sep 20, 2022

Provide feedback