CUDNN v8 Implementation of Convolution Kernels #47454

Tom-Zheng · 2022-10-28T12:11:12Z

PR types

New features

PR changes

OPs

Describe

We are making a transition from the legacy CUDNNv7 APIs into the latest CUDNN frontend API which is recommended for CUDNNv8 and later. CUDNN frontend API provides easier programming interface along with modern functionalities like autotuning and errata filter(blocks certain engine configs by json files), providing much more flexibility.
As a first step, this merge implements CUDNNv8 APIs for the convolution operator, both forward and backward.

To build paddle with CUDNNv8 APIs, set the build option WITH_CUDNN_FRONTEND to ON. Currently, the default value of this option is OFF.
We still use legacy APIs by default. To enable CUDNNv8 APIs at runtime, set the environment variable FLAGS_enable_cudnn_frontend=1.

paddle/phi/kernels/autotune/cache_cudnn_frontend.h

Xreki · 2022-11-07T12:48:29Z

paddle/phi/kernels/autotune/cache_cudnn_frontend.h

+namespace phi {
+namespace autotune {
+
+class CudnnFrontendPlanCache {


当前版本，AlgorithmsCache和CudnnAlgorithmsCacheMap存在太多的共同代码，#47667 尝试通过继承的方式重写，后续可更新下。

Xreki · 2022-11-07T12:58:03Z

paddle/phi/kernels/autotune/cache_cudnn_frontend.h

+  }
+
+ private:
+  static cudnn_frontend::feature_vector_t MakeKey(


cudnn frontend API本身就提供了cache key的计算方式？相比当前ConvCacheKey，主要差别在哪里呢？

Paddle/paddle/phi/kernels/autotune/cache.h

Line 76 in fa874a4

struct ConvCacheKey {

本质上没有大的区别，只是API的区别。cudnn v7的编程模式是自己去维护convolution 相关的descriptors, 自己去做数据结构去存各种参数，自己去实现key和cache。CUDNN Frontend API提供了更多的封装，面向对象的设计。在v7里面，这个ConvCacheKey 是通过ConvArgs里面的ConvertToConvCacheKey实现的。但在v8里面我们不再用这个ConvArgs管理descriptor了，而是用v8的class。那么最方便的就是使用提供好的key实现方式。

Xreki · 2022-11-07T13:05:51Z

paddle/phi/kernels/autotune/cache_cudnn_frontend.h

+        std::make_pair(MakeKey(op_graph, use_addto), plan.GetEngineConfig()));
+  }
+
+  bool IsStable(const cudnn_frontend::OperationGraph& op_graph,


没太理解，这个函数的作用是什么？

有可能穷举搜索的结果不稳定，例如某5次搜索得到的algo是： [1,1,0,0,0]，相比于只信任某一次搜索的结果，我们提供一个选项，可以设定一个saturation count，设为N. 某一个algo至少在N次搜索中得到最佳结果，才会被加到cache中去。上面的例子，如果N取3，那么algo1就不会加入cache，0会被认为是最快的algo并加入cache. 这个函数是用来判断是否达到saturation count的，如果返回true才会加到cache。

paddle/phi/kernels/gpudnn/conv_cudnn_frontend.h

paddle/phi/kernels/gpudnn/conv_grad_kernel.cu

paddle/phi/kernels/gpudnn/conv_grad_kernel_impl_v8.h

Xreki · 2022-11-07T14:40:55Z

paddle/phi/kernels/gpudnn/conv_grad_kernel_impl_v8.h

+        padding_common[i] = paddings[2 * i];
+      }
+    }
+  }


此处大断代码和v7分支重复，会使后续功能维护变得很困难。请考虑一下细粒度一些的封装方案。

已重构。

Xreki

Great work~

paddle/phi/kernels/autotune/cache_cudnn_frontend.h

Xreki · 2022-11-11T03:00:42Z

paddle/phi/kernels/autotune/cache_cudnn_frontend.h

+    bool ret = false;
+    std::lock_guard<std::mutex> lock(*cache_mutex_);
+    auto key = op_graph.getFeatureVector();
+    if (map_.count(MakeKey(op_graph, use_addto)) > 0) {


MakeKey这个函数需要多次重复调用吗，会有开销吗？

需要多次调用。该函数定义是implicitly inline的，所以还好。

paddle/phi/kernels/autotune/cache_cudnn_frontend.h

paddle/phi/kernels/gpudnn/conv_cudnn_frontend.h

paddle/phi/kernels/gpudnn/conv_grad_kernel_impl_v8.h

paddle/phi/kernels/gpudnn/conv_grad_kernel.cu

Xreki · 2022-11-11T05:02:06Z

另外，需修复一下rocm平台上的编译错误：

CUDNNv8 implementation

- move functions in conv_kernel_impl_v8.h and conv_grad_kernel_impl_v8.h to conv_kernel.cu and conv_grad_kernelk.cu - add const specifier for input tensor - add logging when plans fail to execute - move CudnnConvBwdFilterV8 and CudnnConvBwdDataV8 to conv_cudnn_frontend.h

Xreki

LGTM. 相关功能和代码建议后续进一步优化，一方面和cudnn v7当前行为保持一致，另一方便后续便于将Frontend API应用到更多算子。

YuanRisheng · 2022-11-17T02:13:53Z

paddle/phi/kernels/gpudnn/conv_cudnn_frontend.h

+
+#include <vector>
+
+#include "paddle/fluid/framework/convert_utils.h"


最近PHI算子库在做解耦Fluid依赖的工作，convert_utils.h刚刚清理完毕，参考PR:#48001 这里建议参考该PR的方式把这个头文件移除

YuanRisheng · 2022-11-17T02:19:15Z

paddle/phi/kernels/gpudnn/conv_cudnn_frontend.h

+#include <vector>
+
+#include "paddle/fluid/framework/convert_utils.h"
+#include "paddle/fluid/platform/device/gpu/cuda/cudnn_desc.h"


这里还是不建议引入额外的Fluid头文件到PHi下，可以先在phi的backends/gpu/cuda目录下建立同样的头文件，把需要用到的函数拷贝过来，我看了这个文件里用到的应该不多，比较好处理

YuanRisheng · 2022-11-17T02:20:39Z

paddle/phi/kernels/gpudnn/conv_cudnn_frontend.h

+        .setStrides(strides.size(), strides.data())
+        .setId(id)
+        .setAlignment(GetAlignment(tensor))
+        .setDataType(paddle::platform::ToCudnnDataType(


这里或许可以这样改：phi下新加一个ToCudnnDataType，逻辑和fluid下的一样，只不过传入的datatype类型不是proto::Vartype，而是phi的DataType

YuanRisheng · 2022-11-17T02:35:08Z

该PR先合入，Fluid头文件让相关同学进行清理

Tom-Zheng force-pushed the cudnnv8_convolution branch from 8ffdb06 to 24b78bd Compare October 28, 2022 13:37

paddle-bot-old bot added the contributor External developers label Oct 28, 2022

Tom-Zheng added the NVIDIA label Oct 31, 2022

Tom-Zheng force-pushed the cudnnv8_convolution branch 2 times, most recently from fe24aac to fe4c630 Compare November 3, 2022 09:31

Tom-Zheng changed the title ~~[WIP] CUDNN v8 Implementation of Convolution Kernels~~ CUDNN v8 Implementation of Convolution Kernels Nov 3, 2022

Tom-Zheng marked this pull request as ready for review November 3, 2022 11:00

onecatcn requested a review from Xreki November 3, 2022 12:45

Xreki reviewed Nov 7, 2022

View reviewed changes

Tom-Zheng force-pushed the cudnnv8_convolution branch from a4a45d3 to 1c96dfc Compare November 8, 2022 08:29

Xreki reviewed Nov 11, 2022

View reviewed changes

Tom-Zheng added 11 commits November 11, 2022 12:30

Refactor conv_kernel and conv_grad_kernel to provide interface for

7788a30

CUDNNv8 implementation

Fix macro

e6296cd

Add implementation for conv_kernel and conv_grad_kernel

42d1627

Modification after rebase onto latest develop

db02c1d

Modify plan cache to comply with the API of phi::autotune

6fca587

Fix build errors

466b509

Review changes

efa6073

Refactor to reduce duplicate code

3675126

Review fix:

92fc8f1

- move functions in conv_kernel_impl_v8.h and conv_grad_kernel_impl_v8.h to conv_kernel.cu and conv_grad_kernelk.cu - add const specifier for input tensor - add logging when plans fail to execute - move CudnnConvBwdFilterV8 and CudnnConvBwdDataV8 to conv_cudnn_frontend.h

- move plan building outside of cache

b39d16b

test=cuda117

ec83533

Tom-Zheng force-pushed the cudnnv8_convolution branch from c0364b5 to ec83533 Compare November 11, 2022 13:26

Tom-Zheng added 4 commits November 13, 2022 12:00

Fix ROCM build

fe345e1

test=cuda117

c007a4d

Fix ROCM build

872dffc

test=cuda117

42452a2

onecatcn assigned Xreki Nov 14, 2022

Xreki approved these changes Nov 15, 2022

View reviewed changes

YuanRisheng reviewed Nov 17, 2022

View reviewed changes

YuanRisheng approved these changes Nov 17, 2022

View reviewed changes

Xreki requested a review from lanxianghit November 17, 2022 02:37

lanxianghit approved these changes Nov 17, 2022

View reviewed changes

raindrops2sea approved these changes Nov 18, 2022

View reviewed changes

Xreki merged commit 14a6e67 into PaddlePaddle:develop Nov 18, 2022

Tom-Zheng deleted the cudnnv8_convolution branch November 21, 2022 08:40

Tom-Zheng mentioned this pull request Feb 3, 2023

CUDNNv8 fused kernels for ResNet50 training #50196

Closed

This was referenced Apr 24, 2023

CUDNNv8 fused kernels for ResNet50 training #53295

Closed

CUDNNv8 fused kernels for ResNet50 training #53296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDNN v8 Implementation of Convolution Kernels #47454

CUDNN v8 Implementation of Convolution Kernels #47454

Tom-Zheng commented Oct 28, 2022 •

edited

Xreki Nov 7, 2022

Xreki Nov 7, 2022

Tom-Zheng Nov 8, 2022

Xreki Nov 7, 2022

Tom-Zheng Nov 8, 2022

Xreki Nov 7, 2022

Tom-Zheng Nov 8, 2022

Xreki left a comment

Xreki Nov 11, 2022

Tom-Zheng Nov 11, 2022

Xreki commented Nov 11, 2022

Xreki left a comment

YuanRisheng Nov 17, 2022

YuanRisheng Nov 17, 2022

YuanRisheng Nov 17, 2022

YuanRisheng commented Nov 17, 2022


		#include <vector>

		#include "paddle/fluid/framework/convert_utils.h"

CUDNN v8 Implementation of Convolution Kernels #47454

CUDNN v8 Implementation of Convolution Kernels #47454

Conversation

Tom-Zheng commented Oct 28, 2022 • edited

PR types

PR changes

Describe

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki commented Nov 11, 2022

Xreki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YuanRisheng commented Nov 17, 2022

Tom-Zheng commented Oct 28, 2022 •

edited