Add GPU filter kernel #4298

stiepan · 2022-09-29T09:29:43Z

Adds GPU filter kernel (for 2d and 3d convolutions).

Signed-off-by: Kamil Tokarski ktokarski@nvidia.com

Category:

New feature (non-breaking change which adds functionality)

Description:

Adds GPU filter kernel with tests against straightoward CPU baseline.

Supports following border modes: reflect 101, reflect 1001, wrap, clamp, constant.
Passing smaller outputs than inputs and anchor manipulation can be used to get ROI handling. It will be used by the operator to implement "valid" mode, when filter lies fully within the input for all products comuted.

For reasonalby sized filters and number of channels provides implementation that first transfers input data to gpu shm, otherwise uses direct implementation.

Additional information:

(20 batches of random imgs from imgnet, decoded once and repeated "synthetically" in the loop, batch size = 128)

For smaller kernels it outperforms our sparable-conv ops.
The 2D separable convolution run simply as two launches of the new convolution outperforms the current impl for all window sizes supported by the current impl.

It supports 2D and 3D convolutions.

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: FILTER.01-FILTER.04, FILTER.06-FILTER.12

JIRA TASK: DALI-2949

stiepan · 2022-09-29T10:11:46Z

!build

dali-automaton · 2022-09-29T10:17:55Z

CI MESSAGE: [6043632]: BUILD STARTED

dali-automaton · 2022-09-29T11:02:16Z

CI MESSAGE: [6043632]: BUILD FAILED

stiepan · 2022-09-29T11:09:57Z

!build

dali-automaton · 2022-09-29T11:15:07Z

CI MESSAGE: [6043975]: BUILD STARTED

dali-automaton · 2022-09-29T13:36:35Z

CI MESSAGE: [6043975]: BUILD PASSED

mzient · 2022-10-03T10:29:11Z

include/dali/core/boundary.h

+      return "wrap";
+    case BoundaryType::TRANSPARENT:
+      return "transparent";
+    case BoundaryType::ISOLATED:


This is just a flag that can be combined with other border modes.

Good point, it is no longer used by the kernel, the "valid" mode will be part of the operator, at the kernel level it can be just achieved with the implicit roi handling (output smaller than the input + anchor).

mzient · 2022-10-03T10:33:57Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+      case BoundaryType::ISOLATED:
+        RunKernelBorderRemap<filter::ROIOnlyValid>(ctx, std::move(launch_kernel));
+        break;


ROI handling (including this one) can be achieved by specifying the output as smaller than input and appropriate anchors. To just use the pixels that have full halos present, you'd go with output_size[d] = input_size[d] - kernel_size[d] + 1 and anchored at (0, 0). Note that this mechanism allows you to use totally arbitrary ROI.

jantonguirao · 2022-10-03T10:21:20Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+ * @{
+ */
+
+template <bool degenerated_extents>


I don't see where degenerated_extents is used

Oh, good catch. I first implemented the idx reflect by hand, then realized we already have that. But without this degenerated_extents optimization.

Reflecting input with 101 goes into infinite loop if the width is 1: you need to explicitly check that before the loop. It is visible cost for smaller filters. I added this as a hint and changed that condition to be if (degenerated_extents && width <= 1). Now I wonder if I should get rid of the optimization or add it to the existing utility. I am in favour of adding it to the utility.

After all the changes it seemed to no longer make a visible difference, so I dropped the idea.

jantonguirao · 2022-10-03T10:29:25Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+
+template <typename Out, typename In, typename W, bool has_channel_dim, bool has_sequence_dim>
+struct Filter2dGpu {
+  /* It computes a corellation of the input and the filter.


Suggested change

/* It computes a corellation of the input and the filter.

/* It computes a correlation of the input and the filter.

Isn't it a convolution?

I think that strictly speaking the filter is "visited" in the reversed order to the signal in the convolution. Anyway, I saw similar remarks in some of the python libs, so thought that it makes sense to put such a clarifying remark. On the other hand, maybe it only adds noise?

dali/kernels/imgproc/convolution/filter_gpu.cuh

jantonguirao · 2022-10-03T10:47:38Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+ */
+
+/**
+ * @brief First loads the input roi neceesary to compute the output of shape ``block_width x


Suggested change

* @brief First loads the input roi neceesary to compute the output of shape ``block_width x

* @brief First loads the input roi necessary to compute the output of shape ``block_width x

jantonguirao · 2022-10-03T11:10:20Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+    // over an image is costly, so try to avoid it.
+    BOOL_SWITCH(
+        has_degenerated_extents, HasDegeneratedExtents,
+        (using Loader = filter::InLoaderBorderRemap<filter::Reflect101<HasDegeneratedExtents>, In>;


Seems the template parameter is ignored in the implementation

jantonguirao · 2022-10-03T11:48:03Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+                                                          int inner_stride,
+                                                          int total_stride) const {
+    if (idx < 0) {
+      int reflect_dim_idx = (idx + 1) / inner_stride - 1;


I'd add some simple explanation here for future readers, perhaps with some example:

// First, shift by 1 towards 0, so that indices belonging to the same pixel translate to the right pixel index when dividing by the stride - (-6, -5, -4, -3, -2, -1) + 1 -> (-5, -4, -3. -2, -1, 0) // (-5, -4, -3, -2, -1, 0) / 3 -> (-1, -1, -1, 0, 0, 0) // Now, shift away from 0 to revert (-1, -1, -1, 0, 0, 0) + 1 -> (-2, -2, -2, -1, -1, -1) // We don't want to reflect the order of the channels, so we revert the order // (-2, -1, 0, -2, -1, 0) + 3 - 1 -> (0, 1, 2, 0, 1, 2)

klecki

Mostly complaining about naming and comments, I guess nothing major, I need to take a second look tomorrow, for now looks good.

klecki · 2022-10-03T13:31:49Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+struct ShapeDesc {
+  int64_t hwc;
+  int wc, f, h, w, c;
+  int filter_vol, r, s;
+  int filter_top_anchor, filter_left_anchor;
+  int in_workspace_width;
+};


Could you provide some documentation for this?

klecki · 2022-10-03T14:09:27Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+    int h = h_end - h_begin;
+    int wc = wc_end - wc_begin;


It is interesting that the ROI description doesn't carry strides of original image. I assume they are later looked up in the shape_desc?

klecki · 2022-10-03T16:14:12Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+      int inner_dim_idx = (idx + 1) % inner_stride + inner_stride - 1;
+      return this->border_remap(reflect_dim_idx, reflect_dim_size) * inner_stride + inner_dim_idx;
+    }
+    if (idx >= total_stride) {


Just putting a comment to check how it works with the ROI - why we only reflect when we cross the stride, or maybe it's just a bit of weird naming.

klecki · 2022-10-03T16:32:22Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+    return border_remap_strided(idx, sample_shape.w, sample_shape.c, sample_shape.wc);
+  }
+
+  DALI_HOST_DEV DALI_FORCEINLINE In load(const In* __restrict__ in, int y, int x,


I could imagine load working in "absolute" coordinates and remapping them internally in this function, unless you want to make the access explicit.

It is slower unfortunately.

klecki · 2022-10-03T17:04:29Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+    return idx;
+  }
+
+  DALI_HOST_DEV DALI_FORCEINLINE In load(const In* __restrict__ in, int y, int x,


It looks like handling the borders inside the load would make sense here as well, as now we rely on the pass through of the functions above and detect the case in load either way. We could do it in both cases.

klecki · 2022-10-03T18:31:29Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+            static_cast<int>(s),
+            static_cast<int>(filter_top_anchor),
+            static_cast<int>(filter_left_anchor),
+            static_cast<int>(in_workspace_width)};


I would love designated initializers, but it's in C++20.

klecki · 2022-10-03T18:47:27Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+      int w = out_shape[num_sequence_dim + 1];
+      int c = has_channel_dim ? out_shape[num_sequence_dim + 2] : 1;
+      max_height = std::max(max_height, h);
+      max_width = std::max(max_width, w * c);


I might be highly nitpicky, but it might be clearer if we won't mix the width and w*c, and for example use max_row for the former?

dali/kernels/imgproc/convolution/filter_gpu.cuh

klecki · 2022-10-03T19:43:17Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+    for (int s = 0; s < sample_desc.shape.s; s++) {
+      int x = threadIdx.x + s * sample_desc.shape.c;
+      for (int r = 0; r < sample_desc.shape.r; r++) {
+        auto filter_coef = __ldg(filter + r * sample_desc.shape.s + s);


Just a random thought, did you entertain loading the filter into shared memory?

I did try it. The differences where not big and with no clear winner as the winner depended on the filter size.

klecki · 2022-10-03T19:48:01Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+      for (int y = 0; y < SampleDescT::lanes; y++) {
+        load_row(y);
+      }
+      for (int y = SampleDescT::lanes; y < SampleDescT::lanes + sample_desc.shape.r - 1; y++) {
+        load_row(y);
+      }


How big do we expect r to be compared to r? would it make sense to do (and be feasible), something similar to::

Suggested change

for (int y = 0; y < SampleDescT::lanes; y++) {

load_row(y);

}

for (int y = SampleDescT::lanes; y < SampleDescT::lanes + sample_desc.shape.r - 1; y++) {

load_row(y);

}

for (int i = 0; i < r / lanes; i++) {

#pragma unroll

for (int y = i * lanes; y < (i + 1) * lanes; y++) {

load_row(y);

}

}

for (int y = r / lanes * lanes; y < sample_desc.shape.r - 1; y++) {

load_row(y);

}

I think pragma unroll lanes should do something along those lines but I saw no speed ups here.

stiepan · 2022-11-23T16:20:03Z

!build

dali-automaton · 2022-11-23T16:30:39Z

CI MESSAGE: [6583354]: BUILD STARTED

dali-automaton · 2022-11-23T18:10:27Z

CI MESSAGE: [6583354]: BUILD PASSED

stiepan · 2022-11-23T18:20:39Z

!build

dali-automaton · 2022-11-23T18:25:51Z

CI MESSAGE: [6584316]: BUILD STARTED

dali-automaton · 2022-11-23T20:36:51Z

CI MESSAGE: [6584316]: BUILD FAILED

dali-automaton · 2022-11-23T22:21:39Z

CI MESSAGE: [6584316]: BUILD PASSED

jantonguirao · 2022-11-24T09:41:45Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+template <int N, typename T>
+vec<N, T> rev(const vec<N, T>& v) {
+  vec<N, T> out;
+  for (int d = 0; d < N; d++) {
+    out[N - d - 1] = v[d];
+  }
+  return out;
+}


This could go to vec.h. Alternatively you can simply use std::reverse(v.begin(), v.end()); (Checked it, it works)

As to the std::reverse, I'd like to get a copy - I use it only "for presentation purposes", i.e. in error messages.

This isn't even used. Please remove.

It will be in the operator. So let me move it to the utils. Maybe as reversed rather than rev.

jantonguirao · 2022-11-24T09:43:50Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+template <int N, typename T, typename U>
+void strides(vec<N, U>& out, U& total_stride, const vec<N, T>& v) {
+  total_stride = 1;
+  for (int d = 0; d < N; d++) {
+    out[d] = total_stride;
+    total_stride *= v[d];
+  }
+}
+
+template <int N, typename T, typename U>
+void strides(vec<N, U>& out, const vec<N, T>& v) {
+  U total_strides;
+  strides(out, total_strides, v);
+}


We already have CalcStrides in dali/kernels/common/utils.h. I those should work with vec.

It works with shape and reverses the order of the extents which is already reversed by shape2vec.

then perhaps move those some common utils? I suggest to the header I mentioned above.

moved to geom_utils

jantonguirao · 2022-11-24T09:47:29Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+      // (-6, -5, -4, -3, -2, -1) + 1 -> (-5, -4, -3, -2, -1, 0)
+      // (-5, -4, -3, -2, -1, 0) / 3 -> (-1, -1, -1, 0, 0, 0)
+      // Then shift back away from 0 (-1, -1, -1, 0, 0, 0) - 1 -> (-2, -2, -2, -1, -1, -1)
+      // Finally, with (num_channels - 1) we get the positive channels indecies


Suggested change

// Finally, with (num_channels - 1) we get the positive channels indecies

// Finally, with (num_channels - 1) we get the positive channel indices

jantonguirao · 2022-11-24T09:50:09Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+    const auto& block_dim = block_setup.block_dim();
+    const auto& thread_idx = block_setup.thread_idx();


Suggested change

const auto& block_dim = block_setup.block_dim();

const auto& thread_idx = block_setup.thread_idx();

const auto& block_dim = block_setup.block_dim();

const auto& thread_idx = block_setup.thread_idx();

What's the reason behind using references here?

No particularly strong reson for that. It just that I expected some of the methods to return by value some to return reference to the memebr, so const auto & seemed in place.

Suggested change

const auto& block_dim = block_setup.block_dim();

const auto& thread_idx = block_setup.thread_idx();

auto block_dim = block_setup.block_dim();

auto thread_idx = block_setup.thread_idx();

In device code, you should rather avoid references for such small objects.

jantonguirao · 2022-11-24T09:52:22Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+}
+
+/*
+* The ``lanes`` paramter impacts perf in number of ways:


Suggested change

* The ``lanes`` paramter impacts perf in number of ways:

* The ``lanes`` parameter impacts perf in number of ways:

jantonguirao · 2022-11-24T09:52:38Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+/*
+* The ``lanes`` paramter impacts perf in number of ways:
+* 1. It reduces overhead of the for loops arithmetic when iterating over the filter extents:
+*    We do not know the filter extents in compile time so those loops cannot be unrolled.


Suggested change

* We do not know the filter extents in compile time so those loops cannot be unrolled.

* We do not know the filter extents at compile time so those loops cannot be unrolled.

jantonguirao · 2022-11-24T09:57:19Z

dali/kernels/imgproc/convolution/filter_gpu_test.cu

+    baseline_out_ = baseline_output_.cpu();
+    out_view_ = output_.gpu();
+    if (!T::valid_only_mode) {
+      kernel_gpu.Run(ctx_gpu, out_view_, in_view_, filters_view_, make_cspan(anchors_),


seems like you can put the kernel run outside of the if, as it's the same on both branches

Good catch, thanks

jantonguirao · 2022-11-24T10:35:40Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+      // Then shift back away from 0 (-1, -1, -1, 0, 0, 0) - 1 -> (-2, -2, -2, -1, -1, -1)
+      // Finally, with (num_channels - 1) we get the positive channels indecies
+      // (-2, -1, 0, -2, -1, 0) + 3 - 1 -> (0, 1, 2, 0, 1, 2)
+      int reflect_dim_idx = (idx + 1) / num_channels - 1;


A more general question. Have we considered/measured using fast_div here?

I was planning on checking that, but it seems there is already big registers pressure in this kernel so my first thought was that it probably won't help. But I didn't try it, I'll try to give it a go.

fast_div is only for unsigned numbers. It should still be possible to use it, though not through div_mod.

jantonguirao · 2022-11-24T11:19:12Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+  DALI_HOST_DEV DALI_FORCEINLINE In load(const In* __restrict__ in, const ivec<axes>& coords,
+                                         const ShapeDesc<axes>& sample_shape) const {


A general thought. I see you are passing arguments by reference in those functions. While optimizing other kernels, we found that it generally works generally better to pass those by value. It'd worth trying to see if you can squeeze a better performance by doing that.

That's interesting, let me check that out. My intuition was that constness + reference should perform no worse than passing by value.

jantonguirao · 2022-11-24T11:27:02Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+template <int axes_>
+struct ShapeDesc {
+  static constexpr int axes = axes_;
+


Alternatively, you could hold Surface<axes, T> instances, that would abstract away the multiplication by strides, etc.

mzient · 2022-11-24T14:50:34Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+
+ protected:
+  template <typename T>
+  vec<axes, T> ShapeAsVec(const TensorShape<ndim>& shape) {


static

Suggested change

vec<axes, T> ShapeAsVec(const TensorShape<ndim>& shape) {

static vec<axes, T> ShapeAsVec(const TensorShape<ndim>& shape) {

mzient · 2022-11-24T14:55:11Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+  }
+
+  DALI_DEVICE DALI_FORCEINLINE void load_input_to_shm(const In* __restrict__ in,
+                                                      const ivec2& anchored_start) const {


Suggested change

const ivec2& anchored_start) const {

ivec2 anchored_start) const {

mzient · 2022-11-24T15:00:19Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+  const SampleDescT& sample_desc;
+  const Inloader& in_loader;
+  const BlockSetupT& block_setup;


With all those references it's likely to end up reaching to memory more than you'd wish.

mzient · 2022-11-24T15:05:40Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+        with_shm_conv(
+            sample_desc, in_loader, block_setup, idx_workspace, in_workspace, [&](auto&& conv) {
+              stride_grid(block_start, grid_size, out_extents, out_strides, out_frame_stride, conv);
+            });
+
+      } else {
+        with_direct_conv(sample_desc, in_loader, block_setup, [&](auto&& conv) {
+          stride_grid(block_start, grid_size, out_extents, out_strides, out_frame_stride, conv);
+        });


Based on our experience with optimizing stuff for H100, I would recommend splitting this kernel into two variants (and possibly launching both, if the decision is made per-sample). The use of SHM will prevent the shared memory from being used as L1 cache, which will cripple the direct variant. There should be two launches - the samples eligible for shm variant should go to shm kernel and the others should go to the direct kernel, with reduced SHM consumption.

mzient · 2022-11-24T15:32:00Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+DALI_DEVICE DALI_FORCEINLINE void for_each_output_point_in_log_block(const ivec2& block_start,
+                                                                     const ivec2& out_extents,
+                                                                     const Conv& conv,
+                                                                     const Cb&& cb) {


Suggested change

const Cb&& cb) {

Cb&& cb) {

or

Suggested change

const Cb&& cb) {

const Cb& cb) {

?

stiepan · 2022-12-15T20:24:25Z

!build

dali-automaton · 2022-12-15T20:30:22Z

CI MESSAGE: [6789928]: BUILD STARTED

dali-automaton · 2022-12-15T22:41:12Z

CI MESSAGE: [6789928]: BUILD PASSED

klecki · 2022-12-16T16:15:53Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+  // the strides for the in_extents
+  ivec<axes> in_strides;
+  // The offset in shared memory that should be left
+  // for precomuted indices. At that offset, the


Suggested change

// for precomuted indices. At that offset, the

// for precomputed indices. At that offset, the

klecki · 2022-12-16T17:59:30Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+ * @brief Computes the 2D or 3D (``axes_``) convolution of the input (``In``) and the
+ * filter (``W``). The input must have the same number of spatial dims as the filter, but
+ * can have extra sequence dim at the beginning (``has_sequence_dim``) and channels
+ * dim at the end (``has_channel_dim``). If the ``enable_roi_`` is true, the outputs and inputs


Isn't it the opposite?

Suggested change

* dim at the end (``has_channel_dim``). If the ``enable_roi_`` is true, the outputs and inputs

* dim at the end (``has_channel_dim``). If the ``enable_roi_`` is false, the outputs and inputs

klecki · 2022-12-16T18:06:36Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+  /**
+   * @brief The innermost extent consists of the width and channel extents flattened.
+   * Thus, handling border condition for innermost extent requires extra step of computing back
+   * the scurrent channel and spatial position.


Suggested change

* the scurrent channel and spatial position.

* the current channel and spatial position.

klecki · 2022-12-16T18:54:05Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+                                            ivec<SampleDescT::axes> start) const {
+    auto anchored_start = start - sample_desc_.in_shape.anchor_shift;
+    __syncthreads();
+    precompute_indices(in, anchored_start);


in seems unused?

Spot-on, I guess it is an artifact of splitting the precomuting and loading into two functions.

done

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

stiepan · 2022-12-16T19:22:12Z

!build

dali-automaton · 2022-12-16T19:25:59Z

CI MESSAGE: [6800727]: BUILD STARTED

dali-automaton · 2022-12-16T22:09:29Z

CI MESSAGE: [6800727]: BUILD PASSED

mzient · 2022-12-19T15:06:11Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+};
+
+/**
+ * @brief Computes the convolution of logical block size size (logical_block_extents) accessing


Nitpick - this is not a "brief" description.

I split those comments to have brief intro, blank line and the rest to follow.

mzient · 2022-12-19T16:12:37Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+    return in[dot(coords, sample_shape.in_strides)];
+  }
+
+  In fill_value;


A scalar fill value is a bit underwhelming.

I agree it may be useful. I just thought it is not a big priority as we already have operators that have scalar-only padding. As this PR grew considerbaly, I'd prefer making it a follow-up.

It should not be a big change on its own - if we enounter OOB access, we can retreive index channel (as we do for remapping in border reflect etc.) in the InLoader and take an input value accordingly.

mzient · 2022-12-19T16:55:41Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+      int global_x = in_loader_.border_remap_innermost(anchored_start.x + x, sample_desc_.in_shape);
+      for (int y = thread_idx.y; y < sample_desc_.workspace_desc.in_extents.y; y += block_dim.y) {
+        int global_y = precomputed_idx_[y];
+        in_workspace_[dot(ivec2{x, y}, sample_desc_.workspace_desc.in_strides)] =
+            in_loader_.load(in, ivec2{global_x, global_y}, sample_desc_.in_shape);


The fact that border_remap is separated from load, but load still does some border handling (padding) seems a bit off...

For smaller kernels, it seems that the way border handling is done makes a visible difference in the perf. Even without the precompute_idx thingy it is faster when decoupled from loading than when it is done in the load method in the innermost loop (sadly, as I hoped for compiler to hoist what can be hoisted). And to actually handle each extent once (with the precompute helper), I don't see other way.

mzient · 2022-12-19T19:40:41Z

include/dali/core/geom/geom_utils.h

+template <int N, typename T, typename U>
+void strides(vec<N, U>& out, U& total_stride, const vec<N, T>& v) {
+  total_stride = 1;
+  for (int d = 0; d < N; d++) {
+    out[d] = total_stride;
+    total_stride *= v[d];
+  }
+}
+
+template <int N, typename T, typename U>
+void strides(vec<N, U>& out, const vec<N, T>& v) {
+  U total_strides;
+  strides(out, total_strides, v);
+}


How about:

Suggested change

template <int N, typename T, typename U>

void strides(vec<N, U>& out, U& total_stride, const vec<N, T>& v) {

total_stride = 1;

for (int d = 0; d < N; d++) {

out[d] = total_stride;

total_stride *= v[d];

}

}

template <int N, typename T, typename U>

void strides(vec<N, U>& out, const vec<N, T>& v) {

U total_strides;

strides(out, total_strides, v);

}

template <int N, typename T, typename U>

U strides(vec<N, U>& out, const vec<N, T>& v) {

U total_stride = 1;

for (int d = 0; d < N; d++) {

out[d] = total_stride;

total_stride *= v[d];

}

return total_stride;

}

Also: I'm not convinced this is the correct file to put this utility in. This could land in the same place that we have CalcStrides - and, indeed, we could extend CalcStrides to support different traversal directions and to produce the total volume, e.g.:

template <bool outer_first = true, typename Strides, typename Shape> auto CalcStrides(Strides &strides, const Shape& shape) { using stride_t = std::remove_reference_t<decltype(strides[0])>; int ndim = dali::size(shape); resize_if_possible(strides, ndim); // no-op if strides is a plain array or std::array stride_t ret = 1; if (outer_first) { for (int d = ndim - 1; d > 0; d--) { strides[d] = ret; ret *= shape[d]; } } else { for (int d = 0; d < ndim; d++) { strides[d] = ret; ret *= shape[d]; } } return ret; }

Then you could use it like:

total_stride = CalcStrides<false>(strides, shape);

Sounds good, I'll move it.

CalcStrides uses int64_t for strides accumulation and it likely needs to stay that way, as the strides[0], for example in transpoge_gpu_impl, can be fast_div<uint64_t>, that is not happy about *= operator. I left it that way:

template <bool outer_first = true, typename Strides, typename Shape> DALI_HOST_DEV std::remove_reference_t<decltype(std::declval<Strides>()[0])> CalcStrides( Strides &strides, const Shape &shape) { int ndim = dali::size(shape); resize_if_possible(strides, ndim); // no-op if strides is a plain array or std::array int64_t ret = 1; if (outer_first) { for (int d = ndim - 1; d >= 0; d--) { strides[d] = ret; ret *= shape[d]; } } else { for (int d = 0; d < ndim; d++) { strides[d] = ret; ret *= shape[d]; } } return ret; }

mzient · 2022-12-19T19:46:20Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+// limitations under the License.
+
+#ifndef DALI_KERNELS_IMGPROC_CONVOLUTION_FILTER_GPU_CUH_
+#define DALI_KERNELS_IMGPROC_CONVOLUTION_FILTER_GPU_CUH_


I'd recommend splitting this file into:
filter_gpu.h, containing the kernel class, parameters, etc
and
filter_gpu_impl.cuh with GPU kernels.
Depending on build times, we might even decide to instantiate just some input/output combinations in dali_kernels to avoid compiling this file twice - in kernel tests and operators.

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

stiepan · 2022-12-20T12:42:43Z

!build

dali-automaton · 2022-12-20T12:45:57Z

CI MESSAGE: [6826816]: BUILD STARTED

dali-automaton · 2022-12-20T14:47:37Z

CI MESSAGE: [6826816]: BUILD FAILED

stiepan · 2022-12-20T15:29:36Z

CI MESSAGE: [6826816]: BUILD FAILED

Oh, the recent numpy issue.

mzient · 2022-12-20T15:40:36Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+            filter::filter<<<grid_setup.kernel_setup(), StaticConfigT::threadblock_size,
+                             required_shm_size_, ctx.gpu.stream>>>(
+                samples_desc_dev, block_setup_provider, in_loader_provider, grid_setup,
+                out_shape_provider, conv_factory);
+            CUDA_CALL(cudaGetLastError());


Well, I guess it would be better to split the launch instead of redirecting all sample to the inefficient global load use case. This can be left for a follow-up, though.

mzient · 2022-12-20T15:42:22Z

dali/kernels/imgproc/convolution/filter_gpu.cuh

+  void Run(KernelContext& ctx, const TensorListView<StorageGPU, Out, ndim>& out,
+           const TensorListView<StorageGPU, const In, ndim>& in,
+           const TensorListView<StorageGPU, const W, axes>& filters,
+           const span<const ivec<axes>> anchors, boundary::BoundaryType border_type,


Suggested change

const span<const ivec<axes>> anchors, boundary::BoundaryType border_type,

const span<const ivec<axes>> anchors,

boundary::BoundaryType border_type,

I honestly thought that boundary_type could be per-sample, seeing border_type at the end and span at the beginning of the line.

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

stiepan · 2022-12-20T23:30:34Z

!build

dali-automaton · 2022-12-20T23:35:24Z

CI MESSAGE: [6831529]: BUILD STARTED

dali-automaton · 2022-12-21T02:41:12Z

CI MESSAGE: [6831529]: BUILD PASSED

* Add GPU filter (convolution) kernel * Support 2D and 3D * Specialize to use shm for smaller kernels Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

jantonguirao assigned mzient, jantonguirao and klecki Sep 29, 2022

stiepan mentioned this pull request Sep 29, 2022

Add enumeration of supported border modes #4271

Closed

18 tasks

mzient reviewed Oct 3, 2022

View reviewed changes

jantonguirao reviewed Oct 3, 2022

View reviewed changes

klecki reviewed Oct 3, 2022

View reviewed changes

stiepan force-pushed the filter_kernel_gpu branch 2 times, most recently from c8f3f96 to 5d8a2bb Compare November 21, 2022 10:27

jantonguirao reviewed Nov 24, 2022

View reviewed changes

mzient reviewed Nov 24, 2022

View reviewed changes

klecki approved these changes Dec 16, 2022

View reviewed changes

klecki reviewed Dec 16, 2022

View reviewed changes

Review remarks: fix spelling, remove unused param

eb36f96

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

mzient reviewed Dec 19, 2022

View reviewed changes

CalcStrides - extend and reuse, split the kernel file

ad3efa3

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

mzient reviewed Dec 20, 2022

View reviewed changes

mzient approved these changes Dec 20, 2022

View reviewed changes

Paramaters formatting: pass border type in another

7ee3a6f

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

stiepan merged commit ebbe767 into NVIDIA:main Dec 21, 2022

stiepan mentioned this pull request Jan 2, 2023

Add GPU filter operator (2D, 3D) #4525

Merged

18 tasks

aderylo pushed a commit to zpp-dali-2022/DALI that referenced this pull request Mar 17, 2023

Add GPU filter kernel (NVIDIA#4298)

be06d1d

* Add GPU filter (convolution) kernel * Support 2D and 3D * Specialize to use shm for smaller kernels Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>

JanuszL mentioned this pull request Sep 6, 2023

Roadmap 2023 #4578

Closed

	/* It computes a corellation of the input and the filter.
	/* It computes a correlation of the input and the filter.

	* @brief First loads the input roi neceesary to compute the output of shape ``block_width x
	* @brief First loads the input roi necessary to compute the output of shape ``block_width x

	// Finally, with (num_channels - 1) we get the positive channels indecies
	// Finally, with (num_channels - 1) we get the positive channel indices

		const auto& block_dim = block_setup.block_dim();
		const auto& thread_idx = block_setup.thread_idx();

	* The ``lanes`` paramter impacts perf in number of ways:
	* The ``lanes`` parameter impacts perf in number of ways:

	* We do not know the filter extents in compile time so those loops cannot be unrolled.
	* We do not know the filter extents at compile time so those loops cannot be unrolled.

		DALI_HOST_DEV DALI_FORCEINLINE In load(const In* __restrict__ in, const ivec<axes>& coords,
		const ShapeDesc<axes>& sample_shape) const {

	vec<axes, T> ShapeAsVec(const TensorShape<ndim>& shape) {
	static vec<axes, T> ShapeAsVec(const TensorShape<ndim>& shape) {

	const ivec2& anchored_start) const {
	ivec2 anchored_start) const {

	// for precomuted indices. At that offset, the
	// for precomputed indices. At that offset, the

	* dim at the end (``has_channel_dim``). If the ``enable_roi_`` is true, the outputs and inputs
	* dim at the end (``has_channel_dim``). If the ``enable_roi_`` is false, the outputs and inputs

	* the scurrent channel and spatial position.
	* the current channel and spatial position.

	const span<const ivec<axes>> anchors, boundary::BoundaryType border_type,
	const span<const ivec<axes>> anchors,
	boundary::BoundaryType border_type,

Add GPU filter kernel #4298

Add GPU filter kernel #4298

Conversation

stiepan commented Sep 29, 2022 • edited Loading

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

stiepan commented Sep 29, 2022

dali-automaton commented Sep 29, 2022

dali-automaton commented Sep 29, 2022

stiepan commented Sep 29, 2022

dali-automaton commented Sep 29, 2022

dali-automaton commented Sep 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stiepan Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stiepan Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jantonguirao Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

klecki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stiepan commented Nov 23, 2022

dali-automaton commented Nov 23, 2022

dali-automaton commented Nov 23, 2022

stiepan commented Nov 23, 2022

dali-automaton commented Nov 23, 2022

dali-automaton commented Nov 23, 2022

dali-automaton commented Nov 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stiepan commented Sep 29, 2022 •

edited

Loading

stiepan Oct 3, 2022 •

edited

Loading

stiepan Oct 3, 2022 •

edited

Loading

jantonguirao Oct 3, 2022 •

edited

Loading

mzient Nov 24, 2022 •

edited

Loading

mzient Nov 24, 2022 •

edited

Loading