Split reduction kernels #4383

mzient · 2022-10-25T08:09:23Z

Signed-off-by: Michał Zientkiewicz mzient@gmail.com

Category:

Other Optimization, refactoring

Description:

The reduction kernels used to select the processing function at run-time. This places contradictory requirements on launch parameters - and it's also pessimistic about register requirements. This PR splits these kernels into separate groups and distributes the works to the different variants on host side.

Additional information:

Affected modules and functionalities:

Reduction kernels

Key points relevant for the review:

Tests:

Added a test with random (and possibly large) 3D data to make sure that all cases are hit in one batch.
The tests are reworked for better readability.

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: DALI-3087

jantonguirao · 2022-10-27T06:45:17Z

dali/kernels/reduce/reduce_gpu_impl.cuh

@@ -44,6 +44,27 @@ namespace kernels {
 /// @brief Implementation details of reduction kernels
 namespace reduce_impl {

+template <typename Iterator, typename Predicate>


some doxygen to multi_partition would be nice

Actually, I could even move it to some utils.

jantonguirao · 2022-10-27T06:48:49Z

dali/kernels/reduce/reduce_gpu_impl.cuh

+    auto launch_params = [&](auto kernel, int nsamples, int shm_size, int max_block_size) {
+      int preferred_block_size = max_block_size;
+      int preferred_grid_size;  // unused
+      CUDA_CALL(cudaOccupancyMaxPotentialBlockSize(
+        &preferred_grid_size,
+        &preferred_block_size,
+        kernel,
+        shm_size,
+        max_block_size));
+
+      dim3 block(32, preferred_block_size / 32);
+      int gridx = std::max(32, 512/nsamples);
+      dim3 grid(gridx, nsamples);
+      return std::make_pair(grid, block);
+    };


This could be extracted to a function (you use it for Inner and Middle launch functions)

Mixed feelings here. I think that the fact this part

dim3 block(32, preferred_block_size / 32); int gridx = std::max(32, 512/nsamples); dim3 grid(gridx, nsamples);

is the same mostly by coincidence. Extracting just the call to cudaOccupancyMaxPotentialBlockSize doesn't seem to make much sense.

jantonguirao · 2022-10-27T06:51:01Z

dali/test/python/operator/test_reduce.py

+
+def test_reduce_large_data():
+    np.random.seed(1234)
+    for device in ['gpu']:


Suggested change

for device in ['gpu']:

for device in ['cpu', 'gpu']:

True; I disabled cpu tests to get to the interesting part more quickly.

jantonguirao · 2022-10-27T12:15:45Z

dali/kernels/reduce/reduce_axes_gpu_impl.cuh

+          typename PreprocessorBank = reduce_impl::IdentityPreprocessor<1>,
+          typename Postprocessor = identity>
+__global__ void ReduceInnerSmallKernel(const ReduceSampleDesc<Out, In> *samples,
+                                  Reduction reduce = {}, const PreprocessorBank *pre = nullptr,


indentation is off here

jantonguirao · 2022-10-27T12:15:57Z

dali/kernels/reduce/reduce_axes_gpu_impl.cuh

+          typename PreprocessorBank = reduce_impl::IdentityPreprocessor<1>,
+          typename Postprocessor = identity>
+__global__ void ReduceInnerMediumKernel(const ReduceSampleDesc<Out, In> *samples,
+                                  Reduction reduce = {}, const PreprocessorBank *pre = nullptr,


indentation is off here

jantonguirao · 2022-10-27T12:16:02Z

dali/kernels/reduce/reduce_axes_gpu_impl.cuh

+          typename PreprocessorBank = reduce_impl::IdentityPreprocessor<1>,
+          typename Postprocessor = identity>
+__global__ void ReduceInnerLargeKernel(const ReduceSampleDesc<Out, In> *samples,
+                                  Reduction reduce = {}, const PreprocessorBank *pre = nullptr,


mzient · 2022-10-27T12:24:27Z

dali/kernels/reduce/reduce_axes_gpu_impl.cuh

@@ -68,6 +68,24 @@ struct UniformPreprocessorBank {
 }  // namespace reduce_impl


+template <typename Out, typename In>


Nothing new - just moved a few lines up.

mzient · 2022-10-27T12:26:54Z

dali/kernels/reduce/reduce_gpu_impl.cuh

+          ReduceNoneKernel<Acc, StageOut, StageIn, red_t, pre_bank_t, post_t>,
+          num_none, 0, 256);
+
+      ReduceNoneKernel<Acc><<<grid, block, 0, ctx.stream>>>(


I know this is repetitive. I've tried hoisting the parameter setup and launch to a function, but CUDA kernels aren't first-class C++ entities and cannot be passed as a template argument - at that point they become ordinary functions, so I'd have to launch them with cudaKernelLaunch instead of <<<>>>, losing compile-time parameter checking.

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2022-10-28T07:42:20Z

CI MESSAGE: [6317450]: BUILD STARTED

dali-automaton · 2022-10-28T09:31:27Z

CI MESSAGE: [6317450]: BUILD FAILED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2022-10-28T09:41:17Z

CI MESSAGE: [6318396]: BUILD STARTED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2022-10-28T09:45:10Z

CI MESSAGE: [6318439]: BUILD STARTED

dali-automaton · 2022-10-28T12:43:27Z

CI MESSAGE: [6318439]: BUILD FAILED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2022-10-28T15:30:36Z

CI MESSAGE: [6320781]: BUILD STARTED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2022-10-28T15:36:26Z

CI MESSAGE: [6320781]: BUILD FAILED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2022-10-28T18:20:29Z

CI MESSAGE: [6322500]: BUILD STARTED

mzient · 2022-10-28T20:37:11Z

include/dali/core/partition.h

+ */
+template <typename Collection, typename... Predicates>
+auto multi_partition(Collection &&c, Predicates &&... preds)
+-> decltype(detail::multi_partition_impl(dali::begin(c), dali::end(c),


this trailing auto-type serves as SFINAE

mzient · 2022-10-28T20:38:24Z

include/dali/core/partition.h

+         typename = std::tuple<
+            decltype(std::declval<Predicates>()(*std::declval<Iterator>()))...>>


This idiom is a poor man's concept. It's becoming quite popular in C++ community for use where C++20 is not yet available.

dali-automaton · 2022-10-28T22:12:47Z

CI MESSAGE: [6322500]: BUILD PASSED

mzient · 2022-10-31T08:45:44Z

dali/kernels/reduce/reduce_axes_gpu_impl.cuh

+ *                      per output sample
+ * @param post          posptprocessing unary functor
+ */
+template <typename Out, typename In, typename PreprocessorBank, typename Postprocessor>


This is the only device function that's actually new. The previous "ReduceNone" function only worked with the innermost dimension.

mzient · 2022-10-31T08:46:40Z

dali/kernels/reduce/reduce_axes_gpu_impl.cuh

-__global__ void ReduceNoneKernel(Out *const *out, const In *const *in, const int64_t *lengths,
-                                 PreprocessorBank *pre = nullptr,
-                                 Postprocessor *post = nullptr) {
+__global__ void ReduceNoneRawKernel(Out *const *out, const In *const *in, const int64_t *lengths,


"Raw", because it works directly on data pointers, skipping SampleDescs.

jantonguirao self-assigned this Oct 25, 2022

mzient force-pushed the SplitReductionKernels branch from 8bd64cc to e719bc4 Compare October 25, 2022 18:57

mzient changed the title ~~Split middle reduction kernels~~ Split reduction kernels Oct 26, 2022

mzient marked this pull request as ready for review October 26, 2022 16:59

jantonguirao reviewed Oct 27, 2022

View reviewed changes

jantonguirao approved these changes Oct 27, 2022

View reviewed changes

jantonguirao assigned szalpal Oct 27, 2022

jantonguirao reviewed Oct 27, 2022

View reviewed changes

jantonguirao approved these changes Oct 27, 2022

View reviewed changes

mzient commented Oct 27, 2022

View reviewed changes

mzient and others added 6 commits October 28, 2022 09:40

Split middle reduction kernels.

2a361ea

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Contd.

030178c

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

[WIP]

e35eefd

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Don't request shared memory for InnerMedium.

a9d91f5

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Bugfix. Additional tests.

de2c861

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Fix indentation.

3e52acf

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient force-pushed the SplitReductionKernels branch from 1da5977 to 3e52acf Compare October 28, 2022 07:40

Python lint.

527a2f5

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Revert useless change.

a784349

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Fix ReduceNone for middle axis, when the preprocessor is not trivial.

c50cb7b

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Enable large stddev tests for CPU backend.

55454e6

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

Revert debugging code in test_normalize.py.

a05360a

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient commented Oct 28, 2022

View reviewed changes

jantonguirao approved these changes Oct 31, 2022

View reviewed changes

NVIDIA deleted a comment from dali-automaton Oct 31, 2022

mzient commented Oct 31, 2022

View reviewed changes

szalpal approved these changes Oct 31, 2022

View reviewed changes

mzient merged commit 5491483 into NVIDIA:main Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split reduction kernels #4383

Split reduction kernels #4383

mzient commented Oct 25, 2022 •

edited

Loading

jantonguirao Oct 27, 2022

mzient Oct 27, 2022

mzient Oct 28, 2022

jantonguirao Oct 27, 2022

mzient Oct 27, 2022

jantonguirao Oct 27, 2022

mzient Oct 27, 2022

jantonguirao Oct 27, 2022

jantonguirao Oct 27, 2022

jantonguirao Oct 27, 2022

mzient Oct 27, 2022

mzient Oct 27, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

mzient Oct 28, 2022

mzient Oct 28, 2022

dali-automaton commented Oct 28, 2022

mzient Oct 31, 2022

mzient Oct 31, 2022

		@@ -68,6 +68,24 @@ struct UniformPreprocessorBank {
		} // namespace reduce_impl


		template <typename Out, typename In>

		typename = std::tuple<
		decltype(std::declval<Predicates>()(*std::declval<Iterator>()))...>>

Split reduction kernels #4383

Split reduction kernels #4383

Conversation

mzient commented Oct 25, 2022 • edited Loading

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

dali-automaton commented Oct 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dali-automaton commented Oct 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzient commented Oct 25, 2022 •

edited

Loading