-
Notifications
You must be signed in to change notification settings - Fork 447
Transparent support for 64-bit indexing in device algorithms #212
Comments
Thank you for raising this concern @allisonvacanti. A point of reference for this discussion can be found in cupy/cupy#3309, which in turn points to #129 that you already referenced. An acceptable compromise to me is to keep |
@leofang My suggestion was to keep the Dispatch layer private, but automatically switch to 64-bit indices in the Device layer when needed. This way, there's only one entry point that users should need to worry about and they won't have to even consider their input size when calling into CUB -- we'll just do "The Right Thing" for the provided inputs. Is there a reason you would prefer to continue using the Dispatch layer instead? |
I was merely summarizing my thoughts from cupy/cupy#3309. As I said it's a compromise, so if there's an alternative way (like what you suggested) it would be certainly better! 🙂 We haven't done any work using the dispatch layer yet. |
Related: #215 from @RAMitchell fixed a number of index truncations in the Agent layer. |
It is certainly true that adding int64_t instantiations increases compile time, and that they come with a non-trivial performance penalty. In pytorch land we are working around both these problem by splitting inputs into the chunks that can be processed with int32_t indexing. This may be hard to impossible to do for some algorithms (e.g. sort), but other (reduction, scan, compaction) lend themselves very well to this optimization. In fact, pytorch has its own wrapper around scan that allows it to use int32_t scan from cub for arbitrarily large inputs https://github.com/pytorch/pytorch/blob/d9e6750759b78c68e7d98b80202c67bea7ba24ec/aten/src/ATen/native/cuda/ScanKernels.cu#L474-L518 |
Just to update the status of this issue, it's been pushed back to the 1.13-1.14 milestone. I'm working on a new performance monitoring and tuning framework for Thrust/CUB, and I'd like to have that in place before I make any large changes that could impact performance. 64-bit indexing is still one of my top priorities once I can safely start making sweeping changes like this. @ngimel Thanks for sharing that example -- it sounds an approach that we should explore to optimize these. |
This came up in the context of #340, making a note here. The "preprocessor macros to control 32-bit vs 64-bit vs both code paths" implementation could be something like the following (note that I've omitted error handling when // Create cub/detail/offset_dispatch.cuh and define this CUB_OFFSET_DISPATCH
// macro:
#if defined(CUB_NO_32BIT_OFFSETS) // Always 64-bit offsets
#define CUB_OFFSET_DISPATCH(impl32, impl64, num_items) impl64
#elif defined(CUB_NO_64BIT_OFFSETS) // Always 32-bit offsets
#define CUB_OFFSET_DISPATCH(impl32, impl64, num_items) impl32
#else // Default; Runtime check + select best offsets
#define CUB_OFFSET_DISPATCH(impl32, impl64, num_items) \
do \
{ \
if (num_items <= cub::NumericTraits<std::int32_t>::Max()) \
{ \
impl32; \
} \
else \
{ \
impl64; \
} \
} while (false)
#endif
// Use CUB_OFFSET_DISPATCH in the Device layer to pick a Dispatch
// implementation:
struct DeviceRadixSort
{
template <typename KeyT>
CUB_RUNTIME_FUNCTION static cudaError_t SortKeys(...,
std::size_t num_items,
...)
{
using Dispatch32 = DispatchRadixSort<false, KeyT, NullType, std::int32_t>;
using Dispatch64 = DispatchRadixSort<false, KeyT, NullType, std::int64_t>;
CUB_OFFSET_DISPATCH(
return Dispatch32::Dispatch(...,
static_cast<std::int32_t>(num_items),
...),
return Dispatch64::Dispatch(...,
static_cast<std::int64_t>(num_items),
...),
num_items);
}
}; Other notes:
|
I think other libraries were using |
To add more context, in RAPIDS cuDF our input sizes are also limited to This requirement largely comes from the Apache Arrow spec, which historically limited the size of a column to an I'd be curious to look at how various STL implementations handle this problem in various |
Couldn't the index type also be inferred from the Looking at something like It wouldn't work for existing code, but one option would be making all CUB algorithms be This is much further out, but we could also look at solving this problem with a |
I'm concerned about signed/unsigned warnings in the common case of using a container (e.g.
I haven't thought this through completely, but I'm not opposed to this idea. The Since the STL handles it this way I imagine it should be fairly robust. Maybe @brycelelbach @griwes or @ericniebler have thoughts on this. One potential issue is that if a user calls the same function with different types for the size, they'll instantiate multiple instances of what is essentially the same function. Since pretty much everything in CUB is marked
This would be correct, but I'd prefer the template over this, mainly for the simpler API :P The "64-bit |
Updating with an offline conversation, we may want to replace the macros with a per-call solution via a policy class. This would prevent surprises (like changing the behavior of Thrust/stdpar). It would also open the door for additional customizations, like tuning parameter overrides, launch bounds, etc. |
Another update from an offline conversation: Currently the preferred approach is to take the
This is a much simpler approach than I previously proposed, and provides the user with the most control and flexibility. For each call into a CUB algorithm, the user can choose the exact behavior they desire:
If we properly document this as a performance consideration, users should be able to get exactly what they want out of it. Additional considerations:
|
Updated previous comment to incorporate additional feedback from @canonizer @jrhemstad and @dumerrill. Changes:
|
I just happened to be re-reviewing this issue and noticed this. We should be very careful about forcing things to unsigned. Using unsigned types for loops/indexing can impact performance as the compiler isn't usually able to unroll a loop that uses an unsigned loop variable as it has to allow for the unsigned type to overflow whereas it can assume a signed type will not overflow. |
I'm revisiting this conversation after a while and I wanted to double check that our current thinking is to infer the If that's the case, I understand the motivation of simplifying things to always use an unsigned type, but I am still concerned about the performance implications of always using unsigned like I mentioned in my earlier comment. That said, I am also sympathetic to the extra testing burden that would come from having to test with 4 offset types instead of 2. Though do we really need to explicitly test all 4 offset types? I'd think any bugs that would show up from using a signed offset type would stem from overflow that CUB can't do anything about anyways, right? |
I'm going to close this in favor of NVIDIA/cccl#47 where we will be distill the relevant conclusions from the discussion here. |
Summary
The user-friendly
cub::Device*
entry points into the CUB device algorithms assume that the problem size can be indexed with a 32-bit int. As evidenced by a slew of bug report against both CUB and Thrust, this often surprises users.Current Workarounds
In #129, the recommendation is made to have users reach into the implementation details of CUB to directly instantiate and use the underlying
cub::Dispatch*
interfaces with 64-bitOffsetT
types. This is what Thrust has been doing with itsTHRUST_INDEX_TYPE_DISPATCH
macros.Details
⚠ Note: this section is out of date and under active discussion. See the comments below for the current proposal. ⚠
Currently, the CUB device algorithms will fail if the problem size cannot be indexed using 32-bit integers. Users must reach past the public APIs and into the Dispatch layer to directly instantiate the algorithms with 64-bit index types if they want to use larger inputs. CUB's test harness does not check whether that these instantiations yield correct results, and while performance is expected to suffer with 64-bit indices, this impact is not quantified.
This situation is confusing for users and fragile. Large problem sizes are not uncommon in modern HPC applications, and we should fully support, test, and evaluate the performance of these usecases.
Some concerns have been raised about increasing compile times. Since the algorithm implementations must be instantiated twice, once for each index type, the build time is expected to roughly double. This can be worked around by controlling the instantiation with preprocessor macros. Users that will primarily target large datasets may only want to instantiate the 64-bit indexed path, while users who exclusively deal with smaller data can safely restrict the instantiations to the 32-bit indexed path.
Disabling either of the 32-bit or 64-bit indexed paths will limit the capability of the algorithm at runtime, and users should be able to detect if their problem size is inappropriate for the available indexing options. If the problem size is too large for 32-bit indexing and 64-bit indexing is disabled, the algorithm will fail gracefully with a clearly explained diagnostic. If the problem size is appropriate for 32-bit indexing but only 64-bit indexing is available, the user will be able to request that a warning is written to the
CubLog
diagnostic stream.An additional concern is that this will require changes to the device algorithm call signatures, which exclusively use
int
parameters to pass problem sizes. These will need to be updated tosize_t
orcuda::std::int64_t
. This change will be source compatible with existing usage, and is safe to do in a minor release.Deliverables
The text was updated successfully, but these errors were encountered: