Add SeprableConvolutionGPU kernel #2311

klecki · 2020-09-30T09:09:58Z

Supports Frames-first, Channel-last, wraps
ConvolutionGPU applying several passes,
for number of data axes in [1, 3].
Simple sanity test, convolution already tested.

Signed-off-by: Krzysztof Lecki klecki@nvidia.com

Why we need this PR?

It adds new kernel, that will be used in GaussianBlur GPU

What happened in this PR?

What solution was applied:
Same as CPU, wrap the ConvolutionGpu kernel and apply several passes
Affected modules and functionalities:
Kernels
Key points relevant for the review:
Nothing special
Validation and testing:
Simple sanity gtest
Documentation (including examples):
NA

JIRA TASK: [Use DALI-1588 or NA]

Supports Frames-first, Channel-last, wraps ConvolutionGPU applying several passes, for number of data axes in [1, 3]. Simple sanity test, convolution already tested. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

mzient · 2020-09-30T09:15:26Z

dali/kernels/imgproc/convolution/separable_convolution_gpu.h

+/**
+ * @brief Apply convolution in all spatial axes, starting from the innermost to outermost.
+ *        If channel axis is pressent, the convolution is not applied there.
+ *        If it is marqed as sequence, the first data axis is considered as temporal axis


Suggested change

* If it is marqed as sequence, the first data axis is considered as temporal axis

* If it is marked as sequence, the outermost dimension denotes frames and

* convolution is not applied to it.

mzient · 2020-09-30T09:21:31Z

dali/kernels/imgproc/convolution/separable_convolution_gpu.h

+    req.AddInputSet(req_inner, false);
+    req.AddInputSet(req_outer, false);


Suggested change

req.AddInputSet(req_inner, false);

req.AddInputSet(req_outer, false);

req.AddInputSet(req_inner, true);

req.AddInputSet(req_outer, true);

Can't you reuse the scratchpad? It should be possible.

I see that SeparableConvolutionXXX are the only kernels using this function. And... it's bugeed. Also, it's used incorrectly, since AddInputSet concatenates the list of outputs.

Huh, missed that append. I think it was used also in something else, I just wanted to have a way of accumulating the requirements without dealing with the insides of the KernelRequirements representation. Will fix.

mzient · 2020-09-30T09:22:00Z

dali/kernels/imgproc/convolution/separable_convolution_gpu.h

+    req.AddInputSet(req_inner, false);
+    req.AddInputSet(req_middle, false);
+    req.AddInputSet(req_outer, false);


Suggested change

req.AddInputSet(req_inner, false);

req.AddInputSet(req_middle, false);

req.AddInputSet(req_outer, false);

req.AddInputSet(req_inner, true);

req.AddInputSet(req_middle, true);

req.AddInputSet(req_outer, true);

?

dali/kernels/imgproc/convolution/separable_convolution_gpu.h

JanuszL · 2020-09-30T09:43:33Z

dali/kernels/imgproc/convolution/separable_convolution_gpu.h

+  static constexpr int axes = 1;
+  static constexpr int sequence_axes = static_cast<int>(is_sequence);
+  static constexpr int channel_axes = static_cast<int>(has_channels);
+  static constexpr int ndim = sequence_axes + axes + channel_axes;


It seems to be the same for all specialization, can we extract this?

Do you have any idea how to make it better? I can maybe put it in some

template <int axes, bool is_sequence, bool has_channels> struct calc_ndim { static constexpr int ndim = sequence_axes + axes + channel_axes; };

and skip the sequence_axes & channel_axes members, but I'm not sure how much better is it.

Just an observation, if you don't see any better way leave it as it is.

maybe a constexpr function can also work.

Hmm, actually 3 of those member variables are used directly in some kernel parameters definitions or calls. So I will just leave it.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-09-30T13:56:18Z

!build

dali-automaton · 2020-09-30T14:00:30Z

CI MESSAGE: [1666358]: BUILD STARTED

mzient · 2020-09-30T15:22:14Z

dali/kernels/imgproc/convolution/separable_convolution_gpu_test.cu

+  int num_samples_;
+
+  void SetDataShape() {
+    TensorShape<> target_shape = {64, 64, 64};


Use non-cubic data.

mzient · 2020-09-30T15:22:31Z

dali/kernels/imgproc/convolution/separable_convolution_gpu_test.cu

+      target_shape = shape_cat(target_shape, 3);
+    if (kFrames)
+      target_shape = shape_cat(20, target_shape);
+    data_shape_ = uniform_list_shape<kNdim>(1, target_shape.to_static<kNdim>());


Use non-uniform list shape in the batch.

mzient · 2020-09-30T15:23:28Z

dali/kernels/imgproc/convolution/separable_convolution_gpu_test.cu

+    auto req = kernel_gpu.Setup(ctx_gpu, data_shape_, window_dims_);
+
+    ScratchpadAllocator scratch_alloc;
+    scratch_alloc.Reserve(req.scratch_sizes);
+    auto scratchpad = scratch_alloc.GetScratchpad();
+    ctx_gpu.scratchpad = &scratchpad;
+
+    kernel_gpu.Run(ctx_gpu, out_gpu_v, in_gpu_v, window_v);
+
+    auto out_cpu_v = output_.cpu(0);
+    cudaDeviceSynchronize();
+    CUDA_CALL(cudaGetLastError());
+    Check(out_cpu_v, baseline_out_v);


Run Setup ad Run at least twice with different data to make sure that there's no accumulation of state in some internal vectors or sth like this.

dali-automaton · 2020-09-30T15:29:25Z

CI MESSAGE: [1666358]: BUILD PASSED

mzient · 2020-09-30T15:32:01Z

dali/kernels/kernel_req.h

+  return result;
+}
+
+inline scratch_sizes_t GetSumScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {


Suggested change

inline scratch_sizes_t GetSumScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {

inline scratch_sizes_t AppendScratchSize(const scratch_sizes_t &a, const scratch_sizes_t &b, int alignment = 64) {

Also, use this function in AddInputSet - the existing implementation is buggy.

mzient · 2020-09-30T15:32:39Z

dali/kernels/kernel_req.h

+inline scratch_sizes_t GetSumScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {
+  scratch_sizes_t result;
+  for (size_t i = 0; i < result.size(); i++) {
+    result[i] = a[i] + b[i];


Suggested change

result[i] = a[i] + b[i];

result[i] = align_up(a[i], alignment) + b[i];

mzient · 2020-09-30T15:32:55Z

dali/kernels/kernel_req.h

@@ -25,13 +25,31 @@
 namespace dali {
 namespace kernels {

+using scratch_sizes_t = std::array<size_t, static_cast<size_t>(AllocType::Count)>;
+
+inline scratch_sizes_t GetMaxScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {


nitpick

Suggested change

inline scratch_sizes_t GetMaxScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {

inline scratch_sizes_t MaxScratchSize(const scratch_sizes_t &a, const scratch_sizes_t &b) {

mzient · 2020-09-30T15:36:06Z

dali/kernels/imgproc/convolution/separable_convolution_gpu_test.cu

+  }
+
+  void FillData() {
+    ConstantFill(input_.cpu(), 1);


This is not a particularly strong test....

It's not, it's mostly sanity test as the actual testing is done in the ConvolutionGpu and this is simple wrapper. The Operator brings the full load of tests in python as well.

Eh, added yet another similar test.

mzient · 2020-09-30T15:38:43Z

dali/kernels/imgproc/convolution/separable_convolution_gpu_test.cu

+  void FillData() {
+    ConstantFill(input_.cpu(), 1);
+    for (int i = 0; i < kAxes; i++) {
+      ConstantFill(kernel_window_[i].cpu(), 1);


How about filling the windows with a pattern:
1 2 3 4 3 2 1
?
With constant input you could still calculate the reference as a product of sums of the windows.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-09-30T17:45:42Z

!build

dali-automaton · 2020-09-30T17:50:48Z

CI MESSAGE: [1666944]: BUILD STARTED

dali-automaton · 2020-09-30T19:14:30Z

CI MESSAGE: [1666944]: BUILD PASSED

Add SeprableConvolutionGPU kernel

e3350f4

Supports Frames-first, Channel-last, wraps ConvolutionGPU applying several passes, for number of data axes in [1, 3]. Simple sanity test, convolution already tested. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

mzient reviewed Sep 30, 2020

View reviewed changes

JanuszL reviewed Sep 30, 2020

View reviewed changes

dali/kernels/imgproc/convolution/separable_convolution_gpu.h Show resolved Hide resolved

JanuszL reviewed Sep 30, 2020

View reviewed changes

JanuszL approved these changes Sep 30, 2020

View reviewed changes

klecki added 3 commits September 30, 2020 15:32

Scratpchpad useage fix

531a150

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Propagate to CPU kernel

d1ca432

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Missing sub_ctx swap

57f0943

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki mentioned this pull request Sep 30, 2020

Gaussian blur gpu #2314

Merged

mzient reviewed Sep 30, 2020

View reviewed changes

Review fixes, testing with CPU as baseline

a20c42f

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

mzient approved these changes Oct 1, 2020

View reviewed changes

klecki merged commit 1491278 into NVIDIA:master Oct 1, 2020

klecki deleted the separable-conv-gpu branch October 1, 2020 09:30

	* If it is marqed as sequence, the first data axis is considered as temporal axis
	* If it is marked as sequence, the outermost dimension denotes frames and
	* convolution is not applied to it.

		req.AddInputSet(req_inner, false);
		req.AddInputSet(req_outer, false);

	inline scratch_sizes_t GetSumScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {
	inline scratch_sizes_t AppendScratchSize(const scratch_sizes_t &a, const scratch_sizes_t &b, int alignment = 64) {

	result[i] = a[i] + b[i];
	result[i] = align_up(a[i], alignment) + b[i];

	inline scratch_sizes_t GetMaxScratch(const scratch_sizes_t &a, const scratch_sizes_t &b) {
	inline scratch_sizes_t MaxScratchSize(const scratch_sizes_t &a, const scratch_sizes_t &b) {

Add SeprableConvolutionGPU kernel #2311

Add SeprableConvolutionGPU kernel #2311

Conversation

klecki commented Sep 30, 2020

Why we need this PR?

What happened in this PR?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented Sep 30, 2020

dali-automaton commented Sep 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Sep 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented Sep 30, 2020

dali-automaton commented Sep 30, 2020

dali-automaton commented Sep 30, 2020