Add Convolution CPU kernel #1987

klecki · 2020-05-28T20:39:47Z

Signed-off-by: Krzysztof Lecki klecki@nvidia.com

Why we need this PR?

This is part of the effort for Gaussian Filter Operator

What happened in this PR?

What solution was applied:
A DALI kernel was implemented, that can convolve a kernel window with n-dimenstional tensor in selected axis.
The selected axis is traversed using a sliding window/cyclic buffer, adding only one new pixel for every output pixel - as we have the input in contiguous memory (almost) the convolution with kernel window should be fast even if the axis have bigger strides.
Cyclic window used for non-innermost cases to allow "in-place" computation
with cyclic buffer memory requested through scratchpad. It now keeps lanes, as some contiguous interval of channel values that is used as i-th window element in convolution for all of them.
Non in-place innermost dimension uses flattened direct loop from @mzient,
needs additional check for big window case.
Affected modules and functionalities:
Kernels
Key points relevant for the review:
Any more edge cases?
Where to apply the fix in ConvolveInnerDim, the problematic parts are mentioned in the comments there.
Validation and testing:
Gtest test added
Documentation (including examples):
A bit of docstrings

JIRA TASK: [DALI-1425]

klecki · 2020-05-28T20:40:40Z

!build

dali-automaton · 2020-05-28T21:16:06Z

CI MESSAGE: [1354581]: BUILD STARTED

dali-automaton · 2020-05-28T22:55:27Z

CI MESSAGE: [1354581]: BUILD PASSED

klecki · 2020-05-29T15:40:40Z

!build

dali-automaton · 2020-05-29T15:45:58Z

CI MESSAGE: [1356721]: BUILD STARTED

dali-automaton · 2020-05-29T16:58:39Z

CI MESSAGE: [1356721]: BUILD PASSED

JanuszL · 2020-06-01T11:08:11Z

dali/kernels/imgproc/convolution/convolution_cpu_test.cc

+    ::testing::Types<cpw_params<1, true>, cpw_params<3, true>, cpw_params<1, false>>;
+
+TYPED_TEST_P(CyclicPixelWrapperTest, FillAndCycle) {
+  constexpr int size = 6;


Can you test if the buffer actually holds up to 6 samples?

It will prevent you from pushing more only in debug configuration with asserts. Too many PushBack with not enough PopFronts can be considered undefined. I check if the pointers to elements map to offsets that I would expect, so if you don't abuse it, it's working ok.

dali/kernels/imgproc/convolution/convolution_cpu.h

klecki · 2020-06-05T19:06:28Z

!build

dali-automaton · 2020-06-05T19:11:01Z

CI MESSAGE: [1374247]: BUILD STARTED

dali-automaton · 2020-06-05T19:45:58Z

CI MESSAGE: [1374247]: BUILD FAILED

klecki · 2020-06-05T20:48:21Z

!build

dali-automaton · 2020-06-05T21:27:29Z

CI MESSAGE: [1374667]: BUILD STARTED

dali-automaton · 2020-06-05T23:25:56Z

CI MESSAGE: [1374667]: BUILD PASSED

JanuszL · 2020-06-08T16:54:49Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+        out_axis[xout * channels + c] = ConvertSat<Out>(acc * scale);
+      }
+    }
+    // hotfix?


As I mentioned in PR description, there is one case with window > image, I think I can either stop here or run through the loops below. I don't know which is better and if that's all.

Early exit sounds better.

mzient · 2020-06-08T17:19:36Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+    // this is probably not true, after we skipped the loop above
+    // x0 = axis_size - window_size + 1;
+    // xout = x0 + radius;
+    x0 = flat_x / channels;
+    xout = flat_xout / channels;
+    // we either stopped with (full window - 1) fitting before the end of image,
+    // or we reached the end?


These comments are quite confusing.

mzient · 2020-06-08T17:21:13Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+    }
+    // hotfix?
+    // if (xout == axis_size) {
+    //   return;


Not return, but break.

It's better to exit here or to allow it to run through all the loops?
Do you see any other edge case than the one with window > image?

Actually, not even break - this should be a continue - there are still outer dimensions to process. However, it's an edge case and maybe it's not worth checking, since in 99% cases it just won't happen and the loops would be empty anyway - so on average there's a performance penalty of additional if - and, to corroborate the hypothesis that this special case is a bad idea, we've both made a mistake here already (return/break instead of continue).

Yeah, I let it go through all the loops, recovering the non-flattened index.

mzient · 2020-06-09T09:41:46Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+    }
+  }
+
+  int NumLanes() {


Suggested change

int NumLanes() {

int NumLanes() const noexcept {

mzient · 2020-06-09T09:44:07Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+  }
+
+  int NumLanes() {
+    return std::min(num_lanes_, max_lanes);


It's already enforcing it in the constructor... is this function even necessary?

Suggested change

return std::min(num_lanes_, max_lanes);

return num_lanes_;

My intention was to hint the compiler, that whenever I use the num_lanes_ it's less than max_lanes. But I didn't test if that helps much.

Maybe if it got unrolled... but I don't think any compiler would be that smart.

mzient · 2020-06-09T09:44:15Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+    }
+  }
+
+  int Size() {


Suggested change

int Size() {

int Size() const {

mzient · 2020-06-09T09:44:22Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+    return elements_;
+  }
+
+  bool Empty() {


Suggested change

bool Empty() {

bool Empty() const {

mzient · 2020-06-09T09:44:32Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+  }
+
+ private:
+  void WrapPosition(int& pos) {


Suggested change

void WrapPosition(int& pos) {

void WrapPosition(int& pos) const {

Put const everywhere I could.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Keep in-place, use fixed-width of lanes for non-innermost dims Add optimized version for innermost dim, Fix in-place with right border so it uses cyclic buffer for border values that are already overwritten Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-06-09T11:14:24Z

!build

dali-automaton · 2020-06-09T12:23:24Z

CI MESSAGE: [1380989]: BUILD STARTED

dali/kernels/imgproc/convolution/convolution_cpu.h

JanuszL · 2020-06-09T13:08:36Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+template <bool has_channels, typename Out, typename In, typename W, int ndim>
+void ConvolveInnerDim(Out* out, const In* in, const W* window, int window_size, int radius,
+                      const TensorShape<ndim>& shape, const TensorShape<ndim>& strides, W scale) {
+  constexpr int last_dim = has_channels ? ndim - 2 : ndim - 1;


Maybe there should be some assert based on ndim?

static_assert(ndim >= (has_channels ? 2 : 1))
something like that

dali-automaton · 2020-06-09T14:01:12Z

CI MESSAGE: [1380989]: BUILD PASSED

JanuszL · 2020-06-09T14:05:58Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+    int64_t xout = 0;
+    Out* out_axis = &out[o * axis_stride];
+    const In* in_axis = &in[o * axis_stride];
+    for (; x0 < 0 && xout < axis_size; x0++, xout++) {


I would add a comment that this is left border.

JanuszL · 2020-06-09T14:07:08Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+
+template <bool has_channels, typename Out, typename In, typename W, int ndim>
+void ConvolveInnerDim(Out* out, const In* in, const W* window, int window_size, int radius,
+                      const TensorShape<ndim>& shape, const TensorShape<ndim>& strides, W scale) {


What is the difference between radius and window_size?

diameter = window_size
radius = (diameter - 1) / 2

So do you need as the function argument?

I didn't :P

JanuszL · 2020-06-09T14:09:39Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+      out_axis[flat_xout] = ConvertSat<Out>(acc * scale);
+    }
+    // get back from flat coordinates
+    x0 = flat_x / channels;


Right border.

dali/kernels/imgproc/convolution/convolution_cpu.h

jantonguirao · 2020-06-09T12:24:36Z

dali/kernels/imgproc/convolution/convolution_cpu_test.cc

+        if (i + d >= 0 && i + d < len) {
+          out[i * stride + c] += in[(i + d) * stride + c] * window[d + r];
+        } else {
+          out[i * stride + c] +=


suggestion: I think that you could just use this and remove the if block and the condition, since the out-of-bounds check already happens inside idx_reflect_101

Done, reworked it a bit with ConvertSat and accumulator, I need it in follow-up, might as well land here.

jantonguirao · 2020-06-09T12:27:38Z

dali/kernels/imgproc/convolution/convolution_cpu_test.cc

+    Kernel kernel;
+
+    auto req = kernel.Setup(ctx, in_.shape, k_win_.num_elements());
+    // this is painful


you could always use the kernel manager

jantonguirao · 2020-06-09T12:39:08Z

dali/kernels/imgproc/convolution/convolution_cpu_test.cc

+TYPED_TEST_P(CyclicWindowWrapperTest, FillAndCycle) {
+  constexpr int size = 6;
+  constexpr int num_lanes = TypeParam::num_lanes;
+  int tmp_buffer[size * num_lanes];    // NOLINT


what is the linter complaining about here?

It thinks that a value that is not named kSomethingSomething is not constexpr and will use a VLA.

jantonguirao · 2020-06-09T12:54:47Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+  }
+
+  /**
+   * @brief Drop one element consistine of `NumLanes()` lanes from the buffer, moving the start


Suggested change

* @brief Drop one element consistine of `NumLanes()` lanes from the buffer, moving the start

* @brief Drop one element consisting of `NumLanes()` lanes from the buffer, moving the start

jantonguirao · 2020-06-09T14:19:52Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+   * @brief Drop one element consistine of `NumLanes()` lanes from the buffer, moving the start
+   *        element.
+   */
+  void PopElement() {


Suggested change

void PopElement() {

void PopFront() {

suggestion

jantonguirao · 2020-06-09T15:07:40Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+  for (int64_t o = 0; o < outer_elements; o++) {
+    int64_t x0 = -radius;
+    int64_t xout = 0;
+    Out* out_axis = &out[o * axis_stride];


o * axis_stride could be calculated only once, or make o incremented by axis_stride

I think it's more readable that way, it's one multiply per axis which will do window_size * axis_elements multiplies anyway.

jantonguirao · 2020-06-09T15:14:38Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+        out_axis[xout * channels + c] = ConvertSat<Out>(acc * scale);
+      }
+    }
+    int64_t flat_x = x0 * channels;


I'd add a note here that this central part won't take effect when the window is bigger than the extent of the axis

jantonguirao · 2020-06-09T15:23:31Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+void ConvolveInplaceOuterLoop(Out* out, const In* in, const W* window,
+                              const TensorShape<ndim>& shape, const TensorShape<ndim>& strides,
+                              int diameter, In* input_window_buffer, W scale = 1) {
+  int64_t outer_elements = volume(&shape[0], &shape[axis]);


can you add a static assert to check that axis is within the valid range? Unless you are sure it is because it's checked before

Already in Kernel:

static_assert(0 <= axis && axis < (has_channels ? ndim - 1 : ndim), "Selected axis must be in [0, ndim) when there is no channel axis, or in [0, ndim " "- 1) for channel-last input");

jantonguirao · 2020-06-09T15:40:15Z

dali/kernels/imgproc/convolution/convolution_cpu.h

+  KernelRequirements Setup(KernelContext& ctx, const TensorShape<ndim>& in_shape, int window_size) {
+    KernelRequirements req;
+    ScratchpadEstimator se;
+    DALI_ENFORCE(window_size % 2 == 1,


is window_size = 1 valid?

Should be. It would basically scale the image by the single window value. It doesn't make much sense to use it, but I don't see the reason to forbid it.

jantonguirao · 2020-06-09T15:42:47Z

dali/kernels/imgproc/convolution/convolution_cpu_test.cc

+  }
+}
+
+void baseline_dot(span<int> result, span<const int> input, span<const int> window, int in_offset) {


Suggested change

void baseline_dot(span<int> result, span<const int> input, span<const int> window, int in_offset) {

void BaselineDot(span<int> result, span<const int> input, span<const int> window, int in_offset) {

nitpick: consistency with the rest of the file

I keep naming stuff with snake_case all the time, fixed.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-06-10T15:17:14Z

!build

dali-automaton · 2020-06-10T15:20:38Z

CI MESSAGE: [1384404]: BUILD STARTED

dali-automaton · 2020-06-10T16:03:43Z

CI MESSAGE: [1384404]: BUILD FAILED

dali-automaton · 2020-06-10T17:59:40Z

CI MESSAGE: [1384404]: BUILD PASSED

JanuszL reviewed Jun 1, 2020

View reviewed changes

dali/kernels/imgproc/convolution/convolution_cpu.h Outdated Show resolved Hide resolved

JanuszL reviewed Jun 1, 2020

View reviewed changes

dali/kernels/imgproc/convolution/convolution_cpu.h Outdated Show resolved Hide resolved

JanuszL reviewed Jun 1, 2020

View reviewed changes

dali/kernels/imgproc/convolution/convolution_cpu.h Outdated Show resolved Hide resolved

JanuszL reviewed Jun 1, 2020

View reviewed changes

dali/kernels/imgproc/convolution/convolution_cpu.h Outdated Show resolved Hide resolved

klecki force-pushed the convolution-cpu-kernel branch from 8e2bd9c to 961d701 Compare June 5, 2020 19:06

klecki force-pushed the convolution-cpu-kernel branch from 961d701 to f01cc37 Compare June 5, 2020 20:48

JanuszL reviewed Jun 8, 2020

View reviewed changes

mzient reviewed Jun 8, 2020

View reviewed changes

klecki mentioned this pull request Jun 8, 2020

Separable convolution #2009

Merged

mzient reviewed Jun 9, 2020

View reviewed changes

dali/kernels/imgproc/convolution/convolution_cpu.h Outdated

}

}

int Size() {

Copy link

Contributor

mzient Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

int Size() {

int Size() const {

mzient reviewed Jun 9, 2020

View reviewed changes

klecki added 2 commits June 9, 2020 11:45

Add Convolution CPU kernel

0dd4beb

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Rework

8ca08fa

Keep in-place, use fixed-width of lanes for non-innermost dims Add optimized version for innermost dim, Fix in-place with right border so it uses cyclic buffer for border values that are already overwritten Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the convolution-cpu-kernel branch from f01cc37 to 8ca08fa Compare June 9, 2020 10:01

Add const

aa119b9

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

JanuszL reviewed Jun 9, 2020

View reviewed changes

dali/kernels/imgproc/convolution/convolution_cpu.h Show resolved Hide resolved

JanuszL reviewed Jun 9, 2020

View reviewed changes

dali/kernels/imgproc/convolution/convolution_cpu.h Outdated Show resolved Hide resolved

JanuszL reviewed Jun 9, 2020

View reviewed changes

NVIDIA deleted a comment from klecki Jun 9, 2020

JanuszL reviewed Jun 9, 2020

View reviewed changes

dali/kernels/imgproc/convolution/convolution_cpu.h Show resolved Hide resolved

jantonguirao approved these changes Jun 9, 2020

View reviewed changes

klecki added 2 commits June 10, 2020 16:45

Review changes

934597e

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

More updates

eef7e92

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

JanuszL approved these changes Jun 10, 2020

View reviewed changes

klecki merged commit 05f9eec into NVIDIA:master Jun 10, 2020

klecki deleted the convolution-cpu-kernel branch June 10, 2020 18:05

	void WrapPosition(int& pos) {
	void WrapPosition(int& pos) const {

	* @brief Drop one element consistine of `NumLanes()` lanes from the buffer, moving the start
	* @brief Drop one element consisting of `NumLanes()` lanes from the buffer, moving the start

	void baseline_dot(span<int> result, span<const int> input, span<const int> window, int in_offset) {
	void BaselineDot(span<int> result, span<const int> input, span<const int> window, int in_offset) {

Add Convolution CPU kernel #1987

Add Convolution CPU kernel #1987

Conversation

klecki commented May 28, 2020 • edited Loading

Why we need this PR?

What happened in this PR?

klecki commented May 28, 2020

dali-automaton commented May 28, 2020

dali-automaton commented May 28, 2020

klecki commented May 29, 2020

dali-automaton commented May 29, 2020

dali-automaton commented May 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented Jun 5, 2020

dali-automaton commented Jun 5, 2020

dali-automaton commented Jun 5, 2020

klecki commented Jun 5, 2020

dali-automaton commented Jun 5, 2020

dali-automaton commented Jun 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient Jun 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented Jun 9, 2020

dali-automaton commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klecki commented Jun 10, 2020

dali-automaton commented Jun 10, 2020

dali-automaton commented Jun 10, 2020

dali-automaton commented Jun 10, 2020

klecki commented May 28, 2020 •

edited

Loading

mzient Jun 9, 2020 •

edited

Loading