Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split reduction kernels #4383

Merged
merged 11 commits into from Oct 31, 2022
Merged

Split reduction kernels #4383

merged 11 commits into from Oct 31, 2022

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Oct 25, 2022

Signed-off-by: Michał Zientkiewicz mzient@gmail.com

Category:

Other Optimization, refactoring

Description:

The reduction kernels used to select the processing function at run-time. This places contradictory requirements on launch parameters - and it's also pessimistic about register requirements. This PR splits these kernels into separate groups and distributes the works to the different variants on host side.

Additional information:

Affected modules and functionalities:

Reduction kernels

Key points relevant for the review:

Tests:

Added a test with random (and possibly large) 3D data to make sure that all cases are hit in one batch.
The tests are reworked for better readability.

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other (comments; it's internal)
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-3087

@jantonguirao jantonguirao self-assigned this Oct 25, 2022
@mzient mzient changed the title Split middle reduction kernels Split reduction kernels Oct 26, 2022
@mzient mzient marked this pull request as ready for review October 26, 2022 16:59
@@ -44,6 +44,27 @@ namespace kernels {
/// @brief Implementation details of reduction kernels
namespace reduce_impl {

template <typename Iterator, typename Predicate>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some doxygen to multi_partition would be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I could even move it to some utils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +1316 to +1314
auto launch_params = [&](auto kernel, int nsamples, int shm_size, int max_block_size) {
int preferred_block_size = max_block_size;
int preferred_grid_size; // unused
CUDA_CALL(cudaOccupancyMaxPotentialBlockSize(
&preferred_grid_size,
&preferred_block_size,
kernel,
shm_size,
max_block_size));

dim3 block(32, preferred_block_size / 32);
int gridx = std::max(32, 512/nsamples);
dim3 grid(gridx, nsamples);
return std::make_pair(grid, block);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be extracted to a function (you use it for Inner and Middle launch functions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed feelings here. I think that the fact this part

      dim3 block(32, preferred_block_size / 32);
      int gridx = std::max(32, 512/nsamples);
      dim3 grid(gridx, nsamples);

is the same mostly by coincidence. Extracting just the call to cudaOccupancyMaxPotentialBlockSize doesn't seem to make much sense.


def test_reduce_large_data():
np.random.seed(1234)
for device in ['gpu']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for device in ['gpu']:
for device in ['cpu', 'gpu']:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True; I disabled cpu tests to get to the interesting part more quickly.

typename PreprocessorBank = reduce_impl::IdentityPreprocessor<1>,
typename Postprocessor = identity>
__global__ void ReduceInnerSmallKernel(const ReduceSampleDesc<Out, In> *samples,
Reduction reduce = {}, const PreprocessorBank *pre = nullptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is off here

typename PreprocessorBank = reduce_impl::IdentityPreprocessor<1>,
typename Postprocessor = identity>
__global__ void ReduceInnerMediumKernel(const ReduceSampleDesc<Out, In> *samples,
Reduction reduce = {}, const PreprocessorBank *pre = nullptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is off here

typename PreprocessorBank = reduce_impl::IdentityPreprocessor<1>,
typename Postprocessor = identity>
__global__ void ReduceInnerLargeKernel(const ReduceSampleDesc<Out, In> *samples,
Reduction reduce = {}, const PreprocessorBank *pre = nullptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

@@ -68,6 +68,24 @@ struct UniformPreprocessorBank {
} // namespace reduce_impl


template <typename Out, typename In>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing new - just moved a few lines up.

ReduceNoneKernel<Acc, StageOut, StageIn, red_t, pre_bank_t, post_t>,
num_none, 0, 256);

ReduceNoneKernel<Acc><<<grid, block, 0, ctx.stream>>>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is repetitive. I've tried hoisting the parameter setup and launch to a function, but CUDA kernels aren't first-class C++ entities and cannot be passed as a template argument - at that point they become ordinary functions, so I'd have to launch them with cudaKernelLaunch instead of <<<>>>, losing compile-time parameter checking.

mzient and others added 6 commits October 28, 2022 09:40
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [6317450]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [6317450]: BUILD FAILED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [6318396]: BUILD STARTED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [6318439]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [6318439]: BUILD FAILED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [6320781]: BUILD STARTED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [6320781]: BUILD FAILED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [6322500]: BUILD STARTED

*/
template <typename Collection, typename... Predicates>
auto multi_partition(Collection &&c, Predicates &&... preds)
-> decltype(detail::multi_partition_impl(dali::begin(c), dali::end(c),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this trailing auto-type serves as SFINAE

Comment on lines +64 to +65
typename = std::tuple<
decltype(std::declval<Predicates>()(*std::declval<Iterator>()))...>>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This idiom is a poor man's concept. It's becoming quite popular in C++ community for use where C++20 is not yet available.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [6322500]: BUILD PASSED

@NVIDIA NVIDIA deleted a comment from dali-automaton Oct 31, 2022
* per output sample
* @param post posptprocessing unary functor
*/
template <typename Out, typename In, typename PreprocessorBank, typename Postprocessor>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only device function that's actually new. The previous "ReduceNone" function only worked with the innermost dimension.

__global__ void ReduceNoneKernel(Out *const *out, const In *const *in, const int64_t *lengths,
PreprocessorBank *pre = nullptr,
Postprocessor *post = nullptr) {
__global__ void ReduceNoneRawKernel(Out *const *out, const In *const *in, const int64_t *lengths,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Raw", because it works directly on data pointers, skipping SampleDescs.

@mzient mzient merged commit 5491483 into NVIDIA:main Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants