Add optimized variant of CMN for HWC to CHW case #4972

klecki · 2023-08-02T12:54:42Z

Category: New feature, Refactoring

Description:

Add an optimized version of SliceFlipNormalize kernel for the HWC to CHW layout switch.
The kernel has several variants that allow for:

cropping (only X-dimension is relevant, as Y-dimension is done via tiling)
mirroring X coordinate (as required by Crop Mirror Normalize operator)
padding channel dimension
It assumes uint8_t inputs and allows for float16 and float32 outputs.

The algorithm is described in the docstring.
Additionally, due to the linear tiling, the bin-search for tile index is ported over from the Cast kernel.
Due to time constraints I did some code duplication, it may be worth to generalize this approach
for more kernels in a follow-up PR.

The CropMirrorNormalize operator setup is generalized to support the new and old versions
of the Slice kernel (most notably their setup).
Selection of appropriate implementation based on the inputs and parameters was added.

Testing is done via Python layer for simplicity.

The pure kernel (disregarding the setups) achieves 2.6 TB/s vs the 1.6 TB/s of the previous variant.

Simple benchmark utilizing DALI pipeline gives 2TB/s for the new one vs 1.5 TB/s for the old one, note that here we are self restricted by the previous iteration, overlapping the compute with training may help further.

Additional information:

Affected modules and functionalities:

New slice kernel, CropMirrorNormalize op.

Key points relevant for the review:

Kernel impl

Tests:

Existing operator tests + new Python tests focusing on the parameters used for this kernel variant.

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

klecki · 2023-08-02T12:55:56Z

!build

dali-automaton · 2023-08-02T13:00:14Z

CI MESSAGE: [9209468]: BUILD STARTED

dali-automaton · 2023-08-02T13:05:11Z

CI MESSAGE: [9209468]: BUILD FAILED

klecki · 2023-08-02T13:13:10Z

!build

dali-automaton · 2023-08-02T13:15:19Z

CI MESSAGE: [9209664]: BUILD STARTED

Integrate it into current CropMirrorNormalize, generalize the setup parts of the operator to allow for selection of the optimized implementations. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

dali-automaton · 2023-08-02T14:11:13Z

CI MESSAGE: [9209664]: BUILD FAILED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2023-08-02T14:21:57Z

!build

dali-automaton · 2023-08-02T14:25:18Z

CI MESSAGE: [9210806]: BUILD STARTED

dali-automaton · 2023-08-02T15:32:38Z

CI MESSAGE: [9210806]: BUILD FAILED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2023-08-03T11:26:37Z

!build

dali-automaton · 2023-08-03T11:30:07Z

CI MESSAGE: [9224809]: BUILD STARTED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2023-08-03T11:36:55Z

!build

dali-automaton · 2023-08-03T11:40:31Z

CI MESSAGE: [9224890]: BUILD STARTED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

dali/kernels/slice/slice_hwc2chw_normalize_gpu.cu

dali-automaton · 2023-08-03T13:56:22Z

CI MESSAGE: [9224890]: BUILD PASSED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2023-08-03T16:43:03Z

!build

dali-automaton · 2023-08-03T16:45:29Z

CI MESSAGE: [9228248]: BUILD STARTED

dali-automaton · 2023-08-03T19:04:26Z

CI MESSAGE: [9228248]: BUILD FAILED

dali-automaton · 2023-08-04T11:57:35Z

CI MESSAGE: [9228248]: BUILD PASSED

mzient · 2023-08-04T12:14:44Z

dali/kernels/slice/slice_hwc2chw_normalize_gpu.cu

+    aligned_tile[idx * 4 + 2] = in.z;
+    aligned_tile[idx * 4 + 3] = in.w;
+  }
+  int64_t processed_in_main = (left_after_prologue / 4) * 4;


One instruction instead of 4 (signed division by a power of 2 requires three instructions).

Suggested change

int64_t processed_in_main = (left_after_prologue / 4) * 4;

int64_t processed_in_main = left_after_prologue & -4;

https://godbolt.org/z/GcWvxTPoW

mzient · 2023-08-04T12:24:53Z

dali/kernels/slice/slice_hwc2chw_normalize_gpu.cu

+       idx < end_x / kStaticChannels; idx += blockDim.x, base_x += blockDim.x) {
+    // TODO(klecki): forceinline device function
+    int64_t out_offset;
+    if constexpr (enable_mirror) {


Does it help?

Yes, not having those ifs if we do not use them actually helped.

mzient · 2023-08-04T12:25:22Z

dali/kernels/slice/slice_hwc2chw_normalize_gpu.cu

+  for (int64_t idx = threadIdx.x + start_x / kStaticChannels, base_x = threadIdx.x;
+       idx < end_x / kStaticChannels; idx += blockDim.x, base_x += blockDim.x) {
+    int64_t out_offset;
+    if constexpr (enable_mirror) {


Does it help?

mzient · 2023-08-04T13:33:03Z

dali/kernels/slice/slice_hwc2chw_normalize_gpu.cu

+
+  float *tile_row = tile;
+
+  for (int y = y_start; y < y_end; y++) {


This looks suspicious with the inner block-strided loop. If the slice is narrow (< 4*blockDim.x), then many threads will do nothing.

We include the channel dimension here, so the narrowest slice that utilizes the whole block is about 44 pixels. It's not that narrow.

mzient · 2023-08-04T13:35:40Z

dali/operators/image/crop/new_crop_mirror_normalize.cu

+          ), DALI_FAIL(make_string("Not supported channel dimension:", channel_dim_idx_)););  // NOLINT
+        ), DALI_FAIL(make_string("Not supported number of spatial dimensions:", spatial_ndim_)););  // NOLINT
+      ), DALI_FAIL(make_string("Not supported output type:", output_type_)););  // NOLINT
+    ), DALI_FAIL(make_string("Not supported input type:", input_type_)););  // NOLINT


Nitpick:

Suggested change

), DALI_FAIL(make_string("Not supported channel dimension:", channel_dim_idx_));); // NOLINT

), DALI_FAIL(make_string("Not supported number of spatial dimensions:", spatial_ndim_));); // NOLINT

), DALI_FAIL(make_string("Not supported output type:", output_type_));); // NOLINT

), DALI_FAIL(make_string("Not supported input type:", input_type_));); // NOLINT

), DALI_FAIL(make_string("Unsupported channel dimension:", channel_dim_idx_));); // NOLINT

), DALI_FAIL(make_string("Unsupported number of spatial dimensions:", spatial_ndim_));); // NOLINT

), DALI_FAIL(make_string("Unsupported output type:", output_type_));); // NOLINT

), DALI_FAIL(make_string("Unsupported input type:", input_type_));); // NOLINT

mzient · 2023-08-04T13:37:51Z

dali/operators/image/crop/new_crop_mirror_normalize.cu

+    // const auto &req = k.Setup(ctx, sh, cargs);
+    // // k.test();
+    auto cargs = make_cspan(args);
+    auto &req = kmgr_.Setup<Kernel>(0, ctx, sh, make_cspan(args));


Suggested change

auto &req = kmgr_.Setup<Kernel>(0, ctx, sh, make_cspan(args));

auto &req = kmgr_.Setup<Kernel>(0, ctx, sh, cargs);

? Otherwise cargs is unused.

mzient

We need it, so ✔️ , however, some parts need a follow-up.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2023-08-07T10:31:41Z

!build

dali-automaton · 2023-08-07T10:35:15Z

CI MESSAGE: [9266334]: BUILD STARTED

dali-automaton · 2023-08-07T12:17:29Z

CI MESSAGE: [9266334]: BUILD PASSED

Add an optimized version of SliceFlipNormalize kernel for the HWC to CHW layout switch. The kernel has several variants that allow for: * cropping (only X-dimension is relevant, as Y-dimension is done via tiling) * mirroring X coordinate (as required by Crop Mirror Normalize operator) * padding channel dimension It assumes uint8_t inputs and allows for float16 and float32 outputs. The algorithm is described in the docstring. Additionally, due to the linear tiling, the bin-search for tile index is ported over from the Cast kernel. The CropMirrorNormalize operator setup is generalized to support the new and old versions of the Slice kernel (most notably their setup). Selection of appropriate implementation based on the inputs and parameters was added. Testing is done via Python layer for simplicity. The pure kernel (disregarding the setups) achieves 2.6 TB/s vs the 1.6 TB/s of the previous variant. Simple benchmark utilizing DALI pipeline gives 2TB/s for the new one vs 1.5 TB/s for the old one, note that here we are self restricted by the previous iteration, overlapping the compute with training may help further. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Add optimized Slice Hwc2Chw Normalize Mirror Pad kernel

6f7bf8d

Integrate it into current CropMirrorNormalize, generalize the setup parts of the operator to allow for selection of the optimized implementations. Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki force-pushed the cmn-in-dali branch from f307287 to 6f7bf8d Compare August 2, 2023 14:10

klecki added 2 commits August 2, 2023 16:17

Adjust test range

ed77621

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Cast adjustments

156df0d

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki added 2 commits August 2, 2023 22:51

Fixup

3efe225

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Add the binsearch optimization

cfb6c2c

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Fixup and cleanup

55195a2

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Remove reduntant docstring section

c199f0b

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki marked this pull request as ready for review August 3, 2023 11:51

mzient reviewed Aug 3, 2023

View reviewed changes

dali/kernels/slice/slice_hwc2chw_normalize_gpu.cu Outdated Show resolved Hide resolved

mzient reviewed Aug 3, 2023

View reviewed changes

dali/kernels/slice/slice_hwc2chw_normalize_gpu.cu Outdated Show resolved Hide resolved

mzient reviewed Aug 3, 2023

View reviewed changes

dali/kernels/slice/slice_hwc2chw_normalize_gpu.cu Outdated Show resolved Hide resolved

Small fixes

f7903f4

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

jantonguirao assigned jantonguirao and mzient Aug 4, 2023

mzient reviewed Aug 4, 2023

View reviewed changes

mzient approved these changes Aug 4, 2023

View reviewed changes

Optimization of tile descriptors

faaee3d

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

jantonguirao approved these changes Aug 7, 2023

View reviewed changes

Review fixes

9a91671

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki merged commit dbb79d4 into NVIDIA:main Aug 7, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized variant of CMN for HWC to CHW case #4972

Add optimized variant of CMN for HWC to CHW case #4972

klecki commented Aug 2, 2023 •

edited

Loading

klecki commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

klecki commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

klecki commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

klecki commented Aug 3, 2023

dali-automaton commented Aug 3, 2023

klecki commented Aug 3, 2023

dali-automaton commented Aug 3, 2023

dali-automaton commented Aug 3, 2023

klecki commented Aug 3, 2023

dali-automaton commented Aug 3, 2023

dali-automaton commented Aug 3, 2023

dali-automaton commented Aug 4, 2023

mzient Aug 4, 2023 •

edited

Loading

klecki Aug 7, 2023

mzient Aug 4, 2023

klecki Aug 7, 2023

mzient Aug 4, 2023

mzient Aug 4, 2023

klecki Aug 7, 2023

mzient Aug 4, 2023

klecki Aug 7, 2023

mzient Aug 4, 2023

klecki Aug 7, 2023

mzient left a comment

klecki commented Aug 7, 2023

dali-automaton commented Aug 7, 2023

dali-automaton commented Aug 7, 2023

	int64_t processed_in_main = (left_after_prologue / 4) * 4;
	int64_t processed_in_main = left_after_prologue & -4;


		float *tile_row = tile;

		for (int y = y_start; y < y_end; y++) {

	auto &req = kmgr_.Setup<Kernel>(0, ctx, sh, make_cspan(args));
	auto &req = kmgr_.Setup<Kernel>(0, ctx, sh, cargs);

Add optimized variant of CMN for HWC to CHW case #4972

Add optimized variant of CMN for HWC to CHW case #4972

Conversation

klecki commented Aug 2, 2023 • edited Loading

Category: New feature, Refactoring

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

klecki commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

klecki commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

klecki commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

dali-automaton commented Aug 2, 2023

klecki commented Aug 3, 2023

dali-automaton commented Aug 3, 2023

klecki commented Aug 3, 2023

dali-automaton commented Aug 3, 2023

dali-automaton commented Aug 3, 2023

klecki commented Aug 3, 2023

dali-automaton commented Aug 3, 2023

dali-automaton commented Aug 3, 2023

dali-automaton commented Aug 4, 2023

mzient Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient left a comment

Choose a reason for hiding this comment

klecki commented Aug 7, 2023

dali-automaton commented Aug 7, 2023

dali-automaton commented Aug 7, 2023

klecki commented Aug 2, 2023 •

edited

Loading

mzient Aug 4, 2023 •

edited

Loading