Transparent support for 64-bit indexing in device algorithms #212

alliepiper · 2020-10-13T17:21:56Z

Summary

The user-friendly cub::Device* entry points into the CUB device algorithms assume that the problem size can be indexed with a 32-bit int. As evidenced by a slew of bug report against both CUB and Thrust, this often surprises users.

Current Workarounds

In #129, the recommendation is made to have users reach into the implementation details of CUB to directly instantiate and use the underlying cub::Dispatch* interfaces with 64-bit OffsetT types. This is what Thrust has been doing with its THRUST_INDEX_TYPE_DISPATCH macros.

Details

⚠ Note: this section is out of date and under active discussion. See the comments below for the current proposal. ⚠

Currently, the CUB device algorithms will fail if the problem size cannot be indexed using 32-bit integers. Users must reach past the public APIs and into the Dispatch layer to directly instantiate the algorithms with 64-bit index types if they want to use larger inputs. CUB's test harness does not check whether that these instantiations yield correct results, and while performance is expected to suffer with 64-bit indices, this impact is not quantified.

This situation is confusing for users and fragile. Large problem sizes are not uncommon in modern HPC applications, and we should fully support, test, and evaluate the performance of these usecases.

Some concerns have been raised about increasing compile times. Since the algorithm implementations must be instantiated twice, once for each index type, the build time is expected to roughly double. This can be worked around by controlling the instantiation with preprocessor macros. Users that will primarily target large datasets may only want to instantiate the 64-bit indexed path, while users who exclusively deal with smaller data can safely restrict the instantiations to the 32-bit indexed path.

Disabling either of the 32-bit or 64-bit indexed paths will limit the capability of the algorithm at runtime, and users should be able to detect if their problem size is inappropriate for the available indexing options. If the problem size is too large for 32-bit indexing and 64-bit indexing is disabled, the algorithm will fail gracefully with a clearly explained diagnostic. If the problem size is appropriate for 32-bit indexing but only 64-bit indexing is available, the user will be able to request that a warning is written to the CubLog diagnostic stream.

An additional concern is that this will require changes to the device algorithm call signatures, which exclusively use int parameters to pass problem sizes. These will need to be updated to size_t or cuda::std::int64_t. This change will be source compatible with existing usage, and is safe to do in a minor release.

Deliverables

Device algorithms will detect problem sizes at runtime and dispatch to either 32-bit or 64-bit indexed instantiations.
- Use a mechanism similar to what Thrust does with the indexing macros.
Opt-out of 64-bit or 32-bit instantiations via preprocessor macros.
- Save compile time for power users who are able to restrict problem sizes.
Preprocessor option to log warnings when 32-bit indexing could be used but is disabled.
Graceful failure with clearly explained diagnostic when 64-bit indexing is required but disabled.
Test all device algorithms with large problem sizes.
Unit test for '64-bit indexing required but disabled' diagnostic.
Remove dispatching code from Thrust and just call CUB device algorithms directly.
- Makes Thrust code easier to read, write, and maintain.

The text was updated successfully, but these errors were encountered:

leofang · 2020-10-13T18:59:59Z

Thank you for raising this concern @allisonvacanti. A point of reference for this discussion can be found in cupy/cupy#3309, which in turn points to #129 that you already referenced. An acceptable compromise to me is to keep cub::Dispatch* documented as a public API. Downstream libraries like CuPy need to get this kind of strong guarantee in order to move on and actually use it. We don't have the bandwidth to chase after changes in CUB if it's considered private. Thanks.

alliepiper · 2020-10-13T19:06:02Z

@leofang My suggestion was to keep the Dispatch layer private, but automatically switch to 64-bit indices in the Device layer when needed. This way, there's only one entry point that users should need to worry about and they won't have to even consider their input size when calling into CUB -- we'll just do "The Right Thing" for the provided inputs.

Is there a reason you would prefer to continue using the Dispatch layer instead?

leofang · 2020-10-13T19:08:36Z

I was merely summarizing my thoughts from cupy/cupy#3309. As I said it's a compromise, so if there's an alternative way (like what you suggested) it would be certainly better! 🙂 We haven't done any work using the dispatch layer yet.

alliepiper · 2020-11-05T16:52:35Z

Related: #215 from @RAMitchell fixed a number of index truncations in the Agent layer.

ngimel · 2021-02-08T19:03:13Z

It is certainly true that adding int64_t instantiations increases compile time, and that they come with a non-trivial performance penalty. In pytorch land we are working around both these problem by splitting inputs into the chunks that can be processed with int32_t indexing. This may be hard to impossible to do for some algorithms (e.g. sort), but other (reduction, scan, compaction) lend themselves very well to this optimization. In fact, pytorch has its own wrapper around scan that allows it to use int32_t scan from cub for arbitrarily large inputs https://github.com/pytorch/pytorch/blob/d9e6750759b78c68e7d98b80202c67bea7ba24ec/aten/src/ATen/native/cuda/ScanKernels.cu#L474-L518

alliepiper · 2021-02-09T15:35:08Z

Just to update the status of this issue, it's been pushed back to the 1.13-1.14 milestone. I'm working on a new performance monitoring and tuning framework for Thrust/CUB, and I'd like to have that in place before I make any large changes that could impact performance. 64-bit indexing is still one of my top priorities once I can safely start making sweeping changes like this.

@ngimel Thanks for sharing that example -- it sounds an approach that we should explore to optimize these.

alliepiper · 2021-09-23T16:59:45Z

This came up in the context of #340, making a note here.

The "preprocessor macros to control 32-bit vs 64-bit vs both code paths" implementation could be something like the following (note that I've omitted error handling when num_items exceeds to the chosen OffsetT. The actual implementation should check for this).

// Create cub/detail/offset_dispatch.cuh and define this CUB_OFFSET_DISPATCH
// macro:

#if defined(CUB_NO_32BIT_OFFSETS) // Always 64-bit offsets

#define CUB_OFFSET_DISPATCH(impl32, impl64, num_items) impl64

#elif defined(CUB_NO_64BIT_OFFSETS) // Always 32-bit offsets

#define CUB_OFFSET_DISPATCH(impl32, impl64, num_items) impl32

#else // Default; Runtime check + select best offsets

#define CUB_OFFSET_DISPATCH(impl32, impl64, num_items)                         \
  do                                                                           \
  {                                                                            \
    if (num_items <= cub::NumericTraits<std::int32_t>::Max())                  \
    {                                                                          \
      impl32;                                                                  \
    }                                                                          \
    else                                                                       \
    {                                                                          \
      impl64;                                                                  \
    }                                                                          \
  } while (false)
#endif

// Use CUB_OFFSET_DISPATCH in the Device layer to pick a Dispatch
// implementation:
struct DeviceRadixSort
{
  template <typename KeyT>
  CUB_RUNTIME_FUNCTION static cudaError_t SortKeys(...,
                                                   std::size_t num_items,
                                                   ...)
  {
    using Dispatch32 = DispatchRadixSort<false, KeyT, NullType, std::int32_t>;
    using Dispatch64 = DispatchRadixSort<false, KeyT, NullType, std::int64_t>;

    CUB_OFFSET_DISPATCH(
      return Dispatch32::Dispatch(...,
                                  static_cast<std::int32_t>(num_items),
                                  ...),
      return Dispatch64::Dispatch(...,
                                  static_cast<std::int64_t>(num_items),
                                  ...),
      num_items);
  }
};

Other notes:

The Device layer functions should accept std::size_t num_items.
Internally we'll use OffsetT = std::int64_t or std::int32_t when instantiating the Dispatch layer templates.
- We can revisit this later, but for now stick with signed types. The current implementations are written and tested using signed types, and we'd risk introducing regressions by switching to unsigned.
We should leave the Device layer functions that accept int num_items in place.
- This way users won't have to work-around signed-vs-unsigned warnings in code that currently passes int.
- Once we have all CUB device algorithms ported to support std::size_t num_items, we can deprecate all of the int num_items overloads at once.

cliffburdick · 2021-09-23T17:28:54Z

I think other libraries were using int64_t or long long int rather than size_t with the thought being that 2^63 was sufficiently large to handle all cases, and the sign bit could help with sentinels or error codes. If I'm understanding correctly, is the user going to be exposed to a size_t, but internally it was cast to a signed 64-bit? Can the exposed API just be signed 64?

jrhemstad · 2021-09-23T17:52:35Z

It is certainly true that adding int64_t instantiations increases compile time, and that they come with a non-trivial performance penalty. In pytorch land we are working around both these problem by splitting inputs into the chunks that can be processed with int32_t indexing.

To add more context, in RAPIDS cuDF our input sizes are also limited to int32_t (but this is not without a lot of consternation from users).

This requirement largely comes from the Apache Arrow spec, which historically limited the size of a column to an int32_t.

I'd be curious to look at how various STL implementations handle this problem in various std:: algorithms. Maybe they just always use size_t and don't worry about it.

jrhemstad · 2021-09-23T18:10:34Z

Couldn't the index type also be inferred from the std::iterator_traits<It>::difference_type? That might annoy existing users passing in raw pointers, that's going to default to 64 bit indices (ptrdiff_t). CUB could provide a simple wrapper to allow using a smaller difference_type if they know they want that.

Looking at something like std::for_each_n, the size parameter type is a template, so that allows configuration and avoids any double instantiation without any macro stuff.

It wouldn't work for existing code, but one option would be making all CUB algorithms be *_n algorithms where the size type is a template.

This is much further out, but we could also look at solving this problem with a ranges like interface for CUB algorithms. The return type of ranges::size is deduced from the static properties of the viewed range. So users could select the CUB instantiation implicitly based on the size_type of their container/range.

alliepiper · 2021-09-23T19:29:57Z

Can the exposed API just be signed 64?

I'm concerned about signed/unsigned warnings in the common case of using a container (e.g. std::size_t thrust::device_vector<T>::size()) to produce num_items. If we want to use the extra bit for sentinals, etc, that'd be an implementation detail that shouldn't be exposed in the public API.

Could this be a template parameter?

I haven't thought this through completely, but I'm not opposed to this idea. The int case for existing users would still work, and we can use runtime checks to ensure that the passed in type can be converted to a known-good OffsetT, erroring out if needed.

Since the STL handles it this way I imagine it should be fairly robust. Maybe @brycelelbach @griwes or @ericniebler have thoughts on this.

One potential issue is that if a user calls the same function with different types for the size, they'll instantiate multiple instances of what is essentially the same function. Since pretty much everything in CUB is marked __forceinline__, this could be surprisingly problematic. But maybe this will just motivate us to remove some of those annotations...

Couldn't the index type also be inferred from the std::iterator_traits<It>::difference_type?

This would be correct, but I'd prefer the template over this, mainly for the simpler API :P The "64-bit ptrdiff_t" issue shouldn't be a problem, since we'd check at runtime whether 32 or 64 bit indices should be used in the actual implementation based on the value of num_items.

alliepiper · 2021-09-23T22:19:01Z

Updating with an offline conversation, we may want to replace the macros with a per-call solution via a policy class. This would prevent surprises (like changing the behavior of Thrust/stdpar).

It would also open the door for additional customizations, like tuning parameter overrides, launch bounds, etc.

alliepiper · 2021-09-28T19:11:51Z

Another update from an offline conversation:

Currently the preferred approach is to take the SizeT provided by the user (that is, the inferred type of num_items) and use it to statically deduce OffsetT, adjusting for signedness, etc as required by the algorithm's implementation:

struct DeviceRadixSort
{
  template <..., typename SizeT, ...>
  CUB_RUNTIME_FUNCTION static
  cudaError_t SortKeys(..., SizeT num_items, ...)
  {
    static_assert(std::is_integral<SizeT>::value, "SizeT must be an integral type.");
    using OffsetT = /* Either int32 or int64 depending on whether or not int32 can represent the full range of SizeT */
    using Dispatch = DispatchRadixSort<..., OffsetT>;
    return Dispatch::Dispatch(..., static_cast<OffsetT>(num_items), ...);
  }
};

This is a much simpler approach than I previously proposed, and provides the user with the most control and flexibility. For each call into a CUB algorithm, the user can choose the exact behavior they desire:

Don't care about peak performance and just want decent performance, ease of use, fast(ish) compiles, and reliability? Just pass std::size_t.
Don't care about 64-bit support and want maximum performance for your smallish datasets? Just pass an int.
Want peak performance under variable conditions and don't mind paying the extra compilation cost? Explicitly use runtime dispatch.

If we properly document this as a performance consideration, users should be able to get exactly what they want out of it.

Additional considerations:

Avoiding unnecessary template instantiations is important.
- Compiling CUB code is slow. For instance, Thrust's test suite takes 2-3 minutes to build for the CPU threaded backends, but 20-60 minutes for the CUB backend (depending on arch flags).
- Consumers of CUB often instantiate our algorithms for at least a dozen types in generic template code. According to @jrhemstad, RAPIDS needs to support 25 distinct types. Other libraries I've worked on stick with the 14 built-in numeric types, and face frequent user complaints about compile times and memory requirements when building the CUB-based backends.
Initial benchmarks show that the overhead of using 64-bit indices for inputs that could be indexed with int32 is 10% at most, and 1-3% in most cases.
- This isn't great, but it isn't catastrophic. Users can write their own runtime dispatch if they're concerned about this.
- @griwes points out that this will be a perf regression for anyone who currently passes vec.size() or similar to a CUB algorithm, and it will require them to update their code to resolve this.
  - I'm not too concerned about this case. Those users currently have buggy code if they're relying on an implicit narrowing conversion, and should update their usage anyway. Their code will be more robust after we make this change.
How many OffsetT should we support in the deduction?
- We'll need to explicitly add tests for each type we support, so I think two should be plenty: ~~int32 and int64~~ uint32 and uint64.
- We've had issues with some algorithm implementations breaking when OffsetT is unsigned. As we port these algorithms to have 64-bit support, we should make them work with unsigned offsets. Otherwise we'll have a weird perf cliff where SizeT = int32_t -> OffsetT = int32_t, but SizeT = uint32_t -> OffsetT = int64_t.
- There's no point in using less than 4 byte offset types because the hardware will promote smaller offsets to 4B.
We need a way to determine the maximum value of a SizeT at compile-time from host/device code.
- std::numeric_limits<T>::max() is not usable from device code.
- cub::NumericTraits<T>::Max() is not constexpr (We can change this -- this is likely the short term fix)
- Eventually we'll be adding dependency on libcu++, so we can use their numeric_limits implementation when that happens. This is the best long-term fix.
Since the Device APIs may be instantiated for multiple SizeT that map to the same OffsetT, we should make sure that we aren't force-inlining the Dispatch layer -- otherwise we'll bloat the binaries with multiple copies of the same Dispatch instantiations.

alliepiper · 2021-09-30T21:19:17Z

Updated previous comment to incorporate additional feedback from @canonizer @jrhemstad and @dumerrill.

Changes:

Added static_assert(std::is_integral<SizeT>::value) to Device function body.
Use unsigned offsets instead of signed -- since we'll be porting these algorithms incrementally, we can afford to spend some time fixing any issues that arise from the change in signedness. This avoids surprises when SizeT is uint32 vs int32 (the former would be forced to use int64 signed offsets otherwise).
Added note that there's no point in optimizing for sizeof(OffsetT) < 4, since hardware will promote to 4B anyway.

jrhemstad · 2022-08-04T23:00:02Z

Use unsigned offsets instead of signed -- since we'll be porting these algorithms incrementally, we can afford to spend some time fixing any issues that arise from the change in signedness. This avoids surprises when SizeT is uint32 vs int32 (the former would be forced to use int64 signed offsets otherwise).

I just happened to be re-reviewing this issue and noticed this.

We should be very careful about forcing things to unsigned. Using unsigned types for loops/indexing can impact performance as the compiler isn't usually able to unroll a loop that uses an unsigned loop variable as it has to allow for the unsigned type to overflow whereas it can assume a signed type will not overflow.

jrhemstad · 2023-01-05T16:16:21Z

How many OffsetT should we support in the deduction?
We'll need to explicitly add tests for each type we support, so I think two should be plenty: ~~int32~~ and ~~int64~~ uint32 and uint64.

I'm revisiting this conversation after a while and I wanted to double check that our current thinking is to infer the OffsetT from the user-provided iterator and we'll always statically select uint32_t or uint64_t?

If that's the case, I understand the motivation of simplifying things to always use an unsigned type, but I am still concerned about the performance implications of always using unsigned like I mentioned in my earlier comment.

That said, I am also sympathetic to the extra testing burden that would come from having to test with 4 offset types instead of 2. Though do we really need to explicitly test all 4 offset types? I'd think any bugs that would show up from using a signed offset type would stem from overflow that CUB can't do anything about anyways, right?

jrhemstad · 2023-04-21T20:03:52Z

I'm going to close this in favor of NVIDIA/cccl#47 where we will be distill the relevant conclusions from the discussion here.

This was referenced Oct 13, 2020

[NVBug 2818198] Thrust and CUB have ongoing issues with large indices (problem size > INT32_MAX) NVIDIA/cccl#744

Open

Fix some integer overflows #215

Merged

Issue using cub reduce on more than elements than fit into a 4 byte integer #129

Closed

alliepiper added type: bug: functional Does not work as intended. P0: must have Absolutely necessary. Critical issue, major blocker, etc. labels Oct 21, 2020

alliepiper added this to the 1.12.0 milestone Oct 21, 2020

brycelelbach modified the milestones: 1.12.0, 1.11.1 Oct 28, 2020

leofang mentioned this issue Nov 11, 2020

CUB device_reduce and size_t cupy/cupy#3309

Open

alliepiper changed the title ~~Fix CUB device algorithms to support large (>32 bit) indices~~ Transparent support for 64-bit indexing in Device Algorithms Nov 19, 2020

alliepiper changed the title ~~Transparent support for 64-bit indexing in Device Algorithms~~ Transparent support for 64-bit indexing in device algorithms Nov 19, 2020

alliepiper mentioned this issue Nov 20, 2020

Add async scan algorithms NVIDIA/thrust#1251

Merged

alliepiper mentioned this issue Jan 21, 2021

Use data type OffsetT for DispatchRadixSort correctly in InitPassConfig #258

Merged

WindQAQ mentioned this issue Jan 26, 2021

tf.reduce_max returns wrong answers on large tensors e.g., (2048,2048,1024) tensorflow/tensorflow#46621

Closed

alliepiper modified the milestones: 1.12.0, 1.13.0 Feb 8, 2021

alliepiper modified the milestones: 1.13.0, 1.14.0 Mar 1, 2021

maddyscientist added the helps: quda Helps or needed by QUDA. label Apr 20, 2021

alliepiper mentioned this issue Apr 23, 2021

Error in cub::DeviceHistogram::HistogramEven #288

Closed

alliepiper mentioned this issue Jul 22, 2021

Port adjacent difference into CUB #331

Merged

leofang mentioned this issue Jul 28, 2021

Bug: HIP kernels with array size >= 2^32 do not behave correctly ROCm/HIP#2253

Closed

alliepiper mentioned this issue Oct 14, 2021

New segmented sort algorithm #357

Merged

alliepiper modified the milestones: 1.15.0, 1.16.0 Oct 15, 2021

alliepiper modified the milestones: 1.16.0, 1.17.0 Feb 7, 2022

alliepiper mentioned this issue Mar 7, 2022

thrust partition fails for size_t{3} * 1024 * 1024 * 1024 elements NVIDIA/thrust#1618

Closed

alliepiper modified the milestones: 1.17.0, Backlog Apr 25, 2022

alliepiper mentioned this issue Apr 26, 2022

Use CUB version of CUB adjacent difference NVIDIA/thrust#1512

Merged

GregoryKimball mentioned this issue Jun 25, 2022

[FEA] Make cudf::size_type 64-bit rapidsai/cudf#3958

Closed

gevtushenko mentioned this issue Nov 1, 2022

Support reduction for more than 2^31 items #589

Merged

jrhemstad added the cub label Feb 22, 2023

GregoryKimball mentioned this issue Apr 17, 2023

[FEA] Add 64-bit size type option at build-time for libcudf rapidsai/cudf#13159

Open

10 tasks

This was referenced Apr 21, 2023

[EPIC] Universal 64-bit index type support in Thrust/CUB algorithms NVIDIA/cccl#47

Open

[FEA]: CUB large input support NVIDIA/cccl#50

Open

jrhemstad closed this as completed Apr 21, 2023

elstehle mentioned this issue Feb 20, 2024

Identify CUB algorithms that do/do not support large inputs NVIDIA/cccl#1408

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transparent support for 64-bit indexing in device algorithms #212

Transparent support for 64-bit indexing in device algorithms #212

alliepiper commented Oct 13, 2020 •

edited

Loading

leofang commented Oct 13, 2020

alliepiper commented Oct 13, 2020

leofang commented Oct 13, 2020

alliepiper commented Nov 5, 2020

ngimel commented Feb 8, 2021

alliepiper commented Feb 9, 2021

alliepiper commented Sep 23, 2021 •

edited

Loading

cliffburdick commented Sep 23, 2021

jrhemstad commented Sep 23, 2021

jrhemstad commented Sep 23, 2021 •

edited

Loading

alliepiper commented Sep 23, 2021

alliepiper commented Sep 23, 2021

alliepiper commented Sep 28, 2021 •

edited

Loading

alliepiper commented Sep 30, 2021

jrhemstad commented Aug 4, 2022

jrhemstad commented Jan 5, 2023

jrhemstad commented Apr 21, 2023

Transparent support for 64-bit indexing in device algorithms #212

Transparent support for 64-bit indexing in device algorithms #212

Comments

alliepiper commented Oct 13, 2020 • edited Loading

Summary

Current Workarounds

Details

Deliverables

leofang commented Oct 13, 2020

alliepiper commented Oct 13, 2020

leofang commented Oct 13, 2020

alliepiper commented Nov 5, 2020

ngimel commented Feb 8, 2021

alliepiper commented Feb 9, 2021

alliepiper commented Sep 23, 2021 • edited Loading

cliffburdick commented Sep 23, 2021

jrhemstad commented Sep 23, 2021

jrhemstad commented Sep 23, 2021 • edited Loading

alliepiper commented Sep 23, 2021

alliepiper commented Sep 23, 2021

alliepiper commented Sep 28, 2021 • edited Loading

alliepiper commented Sep 30, 2021

jrhemstad commented Aug 4, 2022

jrhemstad commented Jan 5, 2023

jrhemstad commented Apr 21, 2023

alliepiper commented Oct 13, 2020 •

edited

Loading

alliepiper commented Sep 23, 2021 •

edited

Loading

jrhemstad commented Sep 23, 2021 •

edited

Loading

alliepiper commented Sep 28, 2021 •

edited

Loading