Implement the new tuning API for DeviceReduce#6544
Implement the new tuning API for DeviceReduce#6544bernhardmgruber merged 44 commits intoNVIDIA:mainfrom
DeviceReduce#6544Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test ad38ff2 |
This comment has been minimized.
This comment has been minimized.
ad38ff2 to
1500255
Compare
b45686c to
3e605f8
Compare
c9d7fd7 to
d2f0578
Compare
|
Found an issue with the way the accumulator type is specified in the benchmarks, which explains the regressions I currently observe when using the public tuning API: #6576 |
7a5a806 to
61bf19d
Compare
miscco
left a comment
There was a problem hiding this comment.
I like the direction this is going
410d404 to
04ee487
Compare
griwes
left a comment
There was a problem hiding this comment.
As said in one of the comments below - I do like the overall structure of this. That said, lack of pattern matching = pain.
c/parallel/src/reduce.cu
Outdated
| // convert type information to CUB arch_policies | ||
| using namespace cub::detail::reduce; | ||
|
|
||
| auto at = accum_type::other; | ||
| if (accum_t.type == CCCL_FLOAT32) | ||
| { | ||
| at = accum_type::float32; | ||
| } | ||
| if (accum_t.type == CCCL_FLOAT64) | ||
| { | ||
| at = accum_type::double32; | ||
| } | ||
|
|
||
| auto ot = op_type::unknown; | ||
| switch (op.type) | ||
| { | ||
| case CCCL_PLUS: | ||
| ot = op_type::plus; | ||
| break; | ||
| case CCCL_MINIMUM: | ||
| case CCCL_MAXIMUM: | ||
| ot = op_type::min_or_max; | ||
| break; | ||
| default: | ||
| break; | ||
| } | ||
|
|
||
| using cub::detail::RuntimeReduceAgentPolicy; | ||
| auto reduce_policy = RuntimeReduceAgentPolicy::from_json(runtime_policy, "ReducePolicy"); | ||
| auto st_policy = RuntimeReduceAgentPolicy::from_json(runtime_policy, "SingleTilePolicy"); | ||
| auto os = offset_size::_8; // sizeof(uint64_t) |
There was a problem hiding this comment.
This should be centralized. Not just for c.parallel (so that we can avoid re-stating this over and over again in mimicry of the CUB classify calls), but also for CUB itself so that c.parallel can just do this per category (op_type, accum_type) instead of doing it per algorithm.
| using MaxPolicy = Policy1000; | ||
| }; | ||
|
|
||
| struct arch_policies // equivalent to the policy_hub, holds policies for a bunch of CUDA architectures |
There was a problem hiding this comment.
This is an internal type, but one that still materializes when users invoke the algorithms, right? I wonder if this should turn into a template and its data members should be turned into an environment returning those values by queries, because as is, any change to the layout would be an ABI break...
There was a problem hiding this comment.
A very appealing aspect of the current design is that tuning information is expressed very simply as structs with data members, so I would love if we could keep that.
Regarding API breaks, we do allow those at every release. This is pointed out in our README:
Symbols in the
thrust::andcub::namespaces may break ABI at any time without warning.
There was a problem hiding this comment.
I agree; however, it'd be nice to have the ABI break manifest as a linker error instead of being entirely silent.
There was a problem hiding this comment.
Doesn't this already happen automatically, since each CCCL release will have all CUB and Thrust entities in a different inline namespace? Like, now the type is called cub::_V_300300_SM120::detail::reduce. With the next release, the 300300 changes to 300400. What more is needed?
|
@bernhardmgruber can we see a comparison in compile time between this approach and the new one for the DeviceReduce tests? I want to see if there is any impact (for better or worse) on compile time with the new tuning machinery. |
DeviceReduce
cada761 to
ae3a5aa
Compare
This reverts commit 881b89a.
Co-authored-by: Nader Al Awar <naderalawar@gmail.com>
3422d89 to
91f0c69
Compare
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 12h 31m: Pass: 100%/98 | Total: 5d 07h | Max: 5h 13m | Hits: 33%/97705See results here. |
| using dispatch_reduce_t = | ||
| DispatchReduce<arg_index_input_iterator_t, | ||
| accumulating_transform_out_it_t, | ||
| PerPartitionOffsetT, | ||
| ReductionOpT, | ||
| empty_problem_init_t, | ||
| per_partition_accum_t, | ||
| ::cuda::std::identity, | ||
| PolicyChainT>; |
There was a problem hiding this comment.
Critical: PolicyChainT was passed here to dispatch_reduce_t and is later picked up by dispatch_reduce_t::Dispatch. The replacement call reduce::dispatch<per_partition_accum_t> no longer carries forward this information.
Part of #6368, which was design approved yesterday. The goal is to merge refacttorings like the one here continuously, but avoid any public exposure of the tuning APIs for now. We can turn them live once we completed the rewrite.
cub.bench.reduce.sum.baseon sm120cub.bench.transform_reduce.sum.baseon sm120Running 3 times (excluding 1 warmup)
before:
after:
Quick benchmark of
cub.bench.reduce.sum.baseon my RTX 5090, since the SASS diff would not cover regressions in host code. LGTM:Fixes: #6565