New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split reduction kernels #4383
Split reduction kernels #4383
Conversation
8bd64cc
to
e719bc4
Compare
@@ -44,6 +44,27 @@ namespace kernels { | |||
/// @brief Implementation details of reduction kernels | |||
namespace reduce_impl { | |||
|
|||
template <typename Iterator, typename Predicate> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some doxygen to multi_partition would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I could even move it to some utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
auto launch_params = [&](auto kernel, int nsamples, int shm_size, int max_block_size) { | ||
int preferred_block_size = max_block_size; | ||
int preferred_grid_size; // unused | ||
CUDA_CALL(cudaOccupancyMaxPotentialBlockSize( | ||
&preferred_grid_size, | ||
&preferred_block_size, | ||
kernel, | ||
shm_size, | ||
max_block_size)); | ||
|
||
dim3 block(32, preferred_block_size / 32); | ||
int gridx = std::max(32, 512/nsamples); | ||
dim3 grid(gridx, nsamples); | ||
return std::make_pair(grid, block); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be extracted to a function (you use it for Inner and Middle launch functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixed feelings here. I think that the fact this part
dim3 block(32, preferred_block_size / 32);
int gridx = std::max(32, 512/nsamples);
dim3 grid(gridx, nsamples);
is the same mostly by coincidence. Extracting just the call to cudaOccupancyMaxPotentialBlockSize
doesn't seem to make much sense.
|
||
def test_reduce_large_data(): | ||
np.random.seed(1234) | ||
for device in ['gpu']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for device in ['gpu']: | |
for device in ['cpu', 'gpu']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True; I disabled cpu
tests to get to the interesting part more quickly.
typename PreprocessorBank = reduce_impl::IdentityPreprocessor<1>, | ||
typename Postprocessor = identity> | ||
__global__ void ReduceInnerSmallKernel(const ReduceSampleDesc<Out, In> *samples, | ||
Reduction reduce = {}, const PreprocessorBank *pre = nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is off here
typename PreprocessorBank = reduce_impl::IdentityPreprocessor<1>, | ||
typename Postprocessor = identity> | ||
__global__ void ReduceInnerMediumKernel(const ReduceSampleDesc<Out, In> *samples, | ||
Reduction reduce = {}, const PreprocessorBank *pre = nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is off here
typename PreprocessorBank = reduce_impl::IdentityPreprocessor<1>, | ||
typename Postprocessor = identity> | ||
__global__ void ReduceInnerLargeKernel(const ReduceSampleDesc<Out, In> *samples, | ||
Reduction reduce = {}, const PreprocessorBank *pre = nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
@@ -68,6 +68,24 @@ struct UniformPreprocessorBank { | |||
} // namespace reduce_impl | |||
|
|||
|
|||
template <typename Out, typename In> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing new - just moved a few lines up.
ReduceNoneKernel<Acc, StageOut, StageIn, red_t, pre_bank_t, post_t>, | ||
num_none, 0, 256); | ||
|
||
ReduceNoneKernel<Acc><<<grid, block, 0, ctx.stream>>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is repetitive. I've tried hoisting the parameter setup and launch to a function, but CUDA kernels aren't first-class C++ entities and cannot be passed as a template argument - at that point they become ordinary functions, so I'd have to launch them with cudaKernelLaunch
instead of <<<>>>
, losing compile-time parameter checking.
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
1da5977
to
3e52acf
Compare
CI MESSAGE: [6317450]: BUILD STARTED |
CI MESSAGE: [6317450]: BUILD FAILED |
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
CI MESSAGE: [6318396]: BUILD STARTED |
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
CI MESSAGE: [6318439]: BUILD STARTED |
CI MESSAGE: [6318439]: BUILD FAILED |
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
CI MESSAGE: [6320781]: BUILD STARTED |
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
CI MESSAGE: [6320781]: BUILD FAILED |
Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
CI MESSAGE: [6322500]: BUILD STARTED |
*/ | ||
template <typename Collection, typename... Predicates> | ||
auto multi_partition(Collection &&c, Predicates &&... preds) | ||
-> decltype(detail::multi_partition_impl(dali::begin(c), dali::end(c), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this trailing auto-type serves as SFINAE
typename = std::tuple< | ||
decltype(std::declval<Predicates>()(*std::declval<Iterator>()))...>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This idiom is a poor man's concept. It's becoming quite popular in C++ community for use where C++20 is not yet available.
CI MESSAGE: [6322500]: BUILD PASSED |
* per output sample | ||
* @param post posptprocessing unary functor | ||
*/ | ||
template <typename Out, typename In, typename PreprocessorBank, typename Postprocessor> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only device function that's actually new. The previous "ReduceNone" function only worked with the innermost dimension.
__global__ void ReduceNoneKernel(Out *const *out, const In *const *in, const int64_t *lengths, | ||
PreprocessorBank *pre = nullptr, | ||
Postprocessor *post = nullptr) { | ||
__global__ void ReduceNoneRawKernel(Out *const *out, const In *const *in, const int64_t *lengths, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Raw", because it works directly on data pointers, skipping SampleDescs.
Signed-off-by: Michał Zientkiewicz mzient@gmail.com
Category:
Other Optimization, refactoring
Description:
The reduction kernels used to select the processing function at run-time. This places contradictory requirements on launch parameters - and it's also pessimistic about register requirements. This PR splits these kernels into separate groups and distributes the works to the different variants on host side.
Additional information:
Affected modules and functionalities:
Reduction kernels
Key points relevant for the review:
Tests:
Added a test with random (and possibly large) 3D data to make sure that all cases are hit in one batch.
The tests are reworked for better readability.
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-3087