Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Transparent support for 64-bit indexing in device algorithms #212

Closed
7 tasks
alliepiper opened this issue Oct 13, 2020 · 17 comments
Closed
7 tasks

Transparent support for 64-bit indexing in device algorithms #212

alliepiper opened this issue Oct 13, 2020 · 17 comments
Labels
cub helps: quda Helps or needed by QUDA. P0: must have Absolutely necessary. Critical issue, major blocker, etc. type: bug: functional Does not work as intended.

Comments

@alliepiper
Copy link
Collaborator

alliepiper commented Oct 13, 2020

Summary

The user-friendly cub::Device* entry points into the CUB device algorithms assume that the problem size can be indexed with a 32-bit int. As evidenced by a slew of bug report against both CUB and Thrust, this often surprises users.

Current Workarounds

In #129, the recommendation is made to have users reach into the implementation details of CUB to directly instantiate and use the underlying cub::Dispatch* interfaces with 64-bit OffsetT types. This is what Thrust has been doing with its THRUST_INDEX_TYPE_DISPATCH macros.

Details

Note: this section is out of date and under active discussion. See the comments below for the current proposal.

Currently, the CUB device algorithms will fail if the problem size cannot be indexed using 32-bit integers. Users must reach past the public APIs and into the Dispatch layer to directly instantiate the algorithms with 64-bit index types if they want to use larger inputs. CUB's test harness does not check whether that these instantiations yield correct results, and while performance is expected to suffer with 64-bit indices, this impact is not quantified.

This situation is confusing for users and fragile. Large problem sizes are not uncommon in modern HPC applications, and we should fully support, test, and evaluate the performance of these usecases.

Some concerns have been raised about increasing compile times. Since the algorithm implementations must be instantiated twice, once for each index type, the build time is expected to roughly double. This can be worked around by controlling the instantiation with preprocessor macros. Users that will primarily target large datasets may only want to instantiate the 64-bit indexed path, while users who exclusively deal with smaller data can safely restrict the instantiations to the 32-bit indexed path.

Disabling either of the 32-bit or 64-bit indexed paths will limit the capability of the algorithm at runtime, and users should be able to detect if their problem size is inappropriate for the available indexing options. If the problem size is too large for 32-bit indexing and 64-bit indexing is disabled, the algorithm will fail gracefully with a clearly explained diagnostic. If the problem size is appropriate for 32-bit indexing but only 64-bit indexing is available, the user will be able to request that a warning is written to the CubLog diagnostic stream.

An additional concern is that this will require changes to the device algorithm call signatures, which exclusively use int parameters to pass problem sizes. These will need to be updated to size_t or cuda::std::int64_t. This change will be source compatible with existing usage, and is safe to do in a minor release.

Deliverables

  • Device algorithms will detect problem sizes at runtime and dispatch to either 32-bit or 64-bit indexed instantiations.
    • Use a mechanism similar to what Thrust does with the indexing macros.
  • Opt-out of 64-bit or 32-bit instantiations via preprocessor macros.
    • Save compile time for power users who are able to restrict problem sizes.
  • Preprocessor option to log warnings when 32-bit indexing could be used but is disabled.
  • Graceful failure with clearly explained diagnostic when 64-bit indexing is required but disabled.
  • Test all device algorithms with large problem sizes.
  • Unit test for '64-bit indexing required but disabled' diagnostic.
  • Remove dispatching code from Thrust and just call CUB device algorithms directly.
    • Makes Thrust code easier to read, write, and maintain.
@leofang
Copy link
Member

leofang commented Oct 13, 2020

Thank you for raising this concern @allisonvacanti. A point of reference for this discussion can be found in cupy/cupy#3309, which in turn points to #129 that you already referenced. An acceptable compromise to me is to keep cub::Dispatch* documented as a public API. Downstream libraries like CuPy need to get this kind of strong guarantee in order to move on and actually use it. We don't have the bandwidth to chase after changes in CUB if it's considered private. Thanks.

@alliepiper
Copy link
Collaborator Author

@leofang My suggestion was to keep the Dispatch layer private, but automatically switch to 64-bit indices in the Device layer when needed. This way, there's only one entry point that users should need to worry about and they won't have to even consider their input size when calling into CUB -- we'll just do "The Right Thing" for the provided inputs.

Is there a reason you would prefer to continue using the Dispatch layer instead?

@leofang
Copy link
Member

leofang commented Oct 13, 2020

I was merely summarizing my thoughts from cupy/cupy#3309. As I said it's a compromise, so if there's an alternative way (like what you suggested) it would be certainly better! 🙂 We haven't done any work using the dispatch layer yet.

@alliepiper
Copy link
Collaborator Author

Related: #215 from @RAMitchell fixed a number of index truncations in the Agent layer.

@alliepiper alliepiper changed the title Fix CUB device algorithms to support large (>32 bit) indices Transparent support for 64-bit indexing in Device Algorithms Nov 19, 2020
@alliepiper alliepiper changed the title Transparent support for 64-bit indexing in Device Algorithms Transparent support for 64-bit indexing in device algorithms Nov 19, 2020
@ngimel
Copy link

ngimel commented Feb 8, 2021

It is certainly true that adding int64_t instantiations increases compile time, and that they come with a non-trivial performance penalty. In pytorch land we are working around both these problem by splitting inputs into the chunks that can be processed with int32_t indexing. This may be hard to impossible to do for some algorithms (e.g. sort), but other (reduction, scan, compaction) lend themselves very well to this optimization. In fact, pytorch has its own wrapper around scan that allows it to use int32_t scan from cub for arbitrarily large inputs https://github.com/pytorch/pytorch/blob/d9e6750759b78c68e7d98b80202c67bea7ba24ec/aten/src/ATen/native/cuda/ScanKernels.cu#L474-L518

@alliepiper alliepiper modified the milestones: 1.12.0, 1.13.0 Feb 8, 2021
@alliepiper
Copy link
Collaborator Author

Just to update the status of this issue, it's been pushed back to the 1.13-1.14 milestone. I'm working on a new performance monitoring and tuning framework for Thrust/CUB, and I'd like to have that in place before I make any large changes that could impact performance. 64-bit indexing is still one of my top priorities once I can safely start making sweeping changes like this.

@ngimel Thanks for sharing that example -- it sounds an approach that we should explore to optimize these.

@alliepiper alliepiper modified the milestones: 1.13.0, 1.14.0 Mar 1, 2021
@maddyscientist maddyscientist added the helps: quda Helps or needed by QUDA. label Apr 20, 2021
@alliepiper
Copy link
Collaborator Author

alliepiper commented Sep 23, 2021

This came up in the context of #340, making a note here.

The "preprocessor macros to control 32-bit vs 64-bit vs both code paths" implementation could be something like the following (note that I've omitted error handling when num_items exceeds to the chosen OffsetT. The actual implementation should check for this).

// Create cub/detail/offset_dispatch.cuh and define this CUB_OFFSET_DISPATCH
// macro:

#if defined(CUB_NO_32BIT_OFFSETS) // Always 64-bit offsets

#define CUB_OFFSET_DISPATCH(impl32, impl64, num_items) impl64

#elif defined(CUB_NO_64BIT_OFFSETS) // Always 32-bit offsets

#define CUB_OFFSET_DISPATCH(impl32, impl64, num_items) impl32

#else // Default; Runtime check + select best offsets

#define CUB_OFFSET_DISPATCH(impl32, impl64, num_items)                         \
  do                                                                           \
  {                                                                            \
    if (num_items <= cub::NumericTraits<std::int32_t>::Max())                  \
    {                                                                          \
      impl32;                                                                  \
    }                                                                          \
    else                                                                       \
    {                                                                          \
      impl64;                                                                  \
    }                                                                          \
  } while (false)
#endif

// Use CUB_OFFSET_DISPATCH in the Device layer to pick a Dispatch
// implementation:
struct DeviceRadixSort
{
  template <typename KeyT>
  CUB_RUNTIME_FUNCTION static cudaError_t SortKeys(...,
                                                   std::size_t num_items,
                                                   ...)
  {
    using Dispatch32 = DispatchRadixSort<false, KeyT, NullType, std::int32_t>;
    using Dispatch64 = DispatchRadixSort<false, KeyT, NullType, std::int64_t>;

    CUB_OFFSET_DISPATCH(
      return Dispatch32::Dispatch(...,
                                  static_cast<std::int32_t>(num_items),
                                  ...),
      return Dispatch64::Dispatch(...,
                                  static_cast<std::int64_t>(num_items),
                                  ...),
      num_items);
  }
};

Other notes:

  • The Device layer functions should accept std::size_t num_items.
  • Internally we'll use OffsetT = std::int64_t or std::int32_t when instantiating the Dispatch layer templates.
    • We can revisit this later, but for now stick with signed types. The current implementations are written and tested using signed types, and we'd risk introducing regressions by switching to unsigned.
  • We should leave the Device layer functions that accept int num_items in place.
    • This way users won't have to work-around signed-vs-unsigned warnings in code that currently passes int.
    • Once we have all CUB device algorithms ported to support std::size_t num_items, we can deprecate all of the int num_items overloads at once.

@cliffburdick
Copy link

I think other libraries were using int64_t or long long int rather than size_t with the thought being that 2^63 was sufficiently large to handle all cases, and the sign bit could help with sentinels or error codes. If I'm understanding correctly, is the user going to be exposed to a size_t, but internally it was cast to a signed 64-bit? Can the exposed API just be signed 64?

@jrhemstad
Copy link
Collaborator

It is certainly true that adding int64_t instantiations increases compile time, and that they come with a non-trivial performance penalty. In pytorch land we are working around both these problem by splitting inputs into the chunks that can be processed with int32_t indexing.

To add more context, in RAPIDS cuDF our input sizes are also limited to int32_t (but this is not without a lot of consternation from users).

This requirement largely comes from the Apache Arrow spec, which historically limited the size of a column to an int32_t.

I'd be curious to look at how various STL implementations handle this problem in various std:: algorithms. Maybe they just always use size_t and don't worry about it.

@jrhemstad
Copy link
Collaborator

jrhemstad commented Sep 23, 2021

Couldn't the index type also be inferred from the std::iterator_traits<It>::difference_type? That might annoy existing users passing in raw pointers, that's going to default to 64 bit indices (ptrdiff_t). CUB could provide a simple wrapper to allow using a smaller difference_type if they know they want that.

Looking at something like std::for_each_n, the size parameter type is a template, so that allows configuration and avoids any double instantiation without any macro stuff.

It wouldn't work for existing code, but one option would be making all CUB algorithms be *_n algorithms where the size type is a template.

This is much further out, but we could also look at solving this problem with a ranges like interface for CUB algorithms. The return type of ranges::size is deduced from the static properties of the viewed range. So users could select the CUB instantiation implicitly based on the size_type of their container/range.

@alliepiper
Copy link
Collaborator Author

Can the exposed API just be signed 64?

I'm concerned about signed/unsigned warnings in the common case of using a container (e.g. std::size_t thrust::device_vector<T>::size()) to produce num_items. If we want to use the extra bit for sentinals, etc, that'd be an implementation detail that shouldn't be exposed in the public API.

Could this be a template parameter?

I haven't thought this through completely, but I'm not opposed to this idea. The int case for existing users would still work, and we can use runtime checks to ensure that the passed in type can be converted to a known-good OffsetT, erroring out if needed.

Since the STL handles it this way I imagine it should be fairly robust. Maybe @brycelelbach @griwes or @ericniebler have thoughts on this.

One potential issue is that if a user calls the same function with different types for the size, they'll instantiate multiple instances of what is essentially the same function. Since pretty much everything in CUB is marked __forceinline__, this could be surprisingly problematic. But maybe this will just motivate us to remove some of those annotations...

Couldn't the index type also be inferred from the std::iterator_traits<It>::difference_type?

This would be correct, but I'd prefer the template over this, mainly for the simpler API :P The "64-bit ptrdiff_t" issue shouldn't be a problem, since we'd check at runtime whether 32 or 64 bit indices should be used in the actual implementation based on the value of num_items.

@alliepiper
Copy link
Collaborator Author

Updating with an offline conversation, we may want to replace the macros with a per-call solution via a policy class. This would prevent surprises (like changing the behavior of Thrust/stdpar).

It would also open the door for additional customizations, like tuning parameter overrides, launch bounds, etc.

@alliepiper
Copy link
Collaborator Author

alliepiper commented Sep 28, 2021

Another update from an offline conversation:

Currently the preferred approach is to take the SizeT provided by the user (that is, the inferred type of num_items) and use it to statically deduce OffsetT, adjusting for signedness, etc as required by the algorithm's implementation:

struct DeviceRadixSort
{
  template <..., typename SizeT, ...>
  CUB_RUNTIME_FUNCTION static
  cudaError_t SortKeys(..., SizeT num_items, ...)
  {
    static_assert(std::is_integral<SizeT>::value, "SizeT must be an integral type.");
    using OffsetT = /* Either int32 or int64 depending on whether or not int32 can represent the full range of SizeT */
    using Dispatch = DispatchRadixSort<..., OffsetT>;
    return Dispatch::Dispatch(..., static_cast<OffsetT>(num_items), ...);
  }
};

This is a much simpler approach than I previously proposed, and provides the user with the most control and flexibility. For each call into a CUB algorithm, the user can choose the exact behavior they desire:

  • Don't care about peak performance and just want decent performance, ease of use, fast(ish) compiles, and reliability? Just pass std::size_t.
  • Don't care about 64-bit support and want maximum performance for your smallish datasets? Just pass an int.
  • Want peak performance under variable conditions and don't mind paying the extra compilation cost? Explicitly use runtime dispatch.

If we properly document this as a performance consideration, users should be able to get exactly what they want out of it.

Additional considerations:

  • Avoiding unnecessary template instantiations is important.
    • Compiling CUB code is slow. For instance, Thrust's test suite takes 2-3 minutes to build for the CPU threaded backends, but 20-60 minutes for the CUB backend (depending on arch flags).
    • Consumers of CUB often instantiate our algorithms for at least a dozen types in generic template code. According to @jrhemstad, RAPIDS needs to support 25 distinct types. Other libraries I've worked on stick with the 14 built-in numeric types, and face frequent user complaints about compile times and memory requirements when building the CUB-based backends.
  • Initial benchmarks show that the overhead of using 64-bit indices for inputs that could be indexed with int32 is 10% at most, and 1-3% in most cases.
    • This isn't great, but it isn't catastrophic. Users can write their own runtime dispatch if they're concerned about this.
    • @griwes points out that this will be a perf regression for anyone who currently passes vec.size() or similar to a CUB algorithm, and it will require them to update their code to resolve this.
      • I'm not too concerned about this case. Those users currently have buggy code if they're relying on an implicit narrowing conversion, and should update their usage anyway. Their code will be more robust after we make this change.
  • How many OffsetT should we support in the deduction?
    • We'll need to explicitly add tests for each type we support, so I think two should be plenty: int32 and int64 uint32 and uint64.
    • We've had issues with some algorithm implementations breaking when OffsetT is unsigned. As we port these algorithms to have 64-bit support, we should make them work with unsigned offsets. Otherwise we'll have a weird perf cliff where SizeT = int32_t -> OffsetT = int32_t, but SizeT = uint32_t -> OffsetT = int64_t.
    • There's no point in using less than 4 byte offset types because the hardware will promote smaller offsets to 4B.
  • We need a way to determine the maximum value of a SizeT at compile-time from host/device code.
    • std::numeric_limits<T>::max() is not usable from device code.
    • cub::NumericTraits<T>::Max() is not constexpr (We can change this -- this is likely the short term fix)
    • Eventually we'll be adding dependency on libcu++, so we can use their numeric_limits implementation when that happens. This is the best long-term fix.
  • Since the Device APIs may be instantiated for multiple SizeT that map to the same OffsetT, we should make sure that we aren't force-inlining the Dispatch layer -- otherwise we'll bloat the binaries with multiple copies of the same Dispatch instantiations.

@alliepiper
Copy link
Collaborator Author

Updated previous comment to incorporate additional feedback from @canonizer @jrhemstad and @dumerrill.

Changes:

  • Added static_assert(std::is_integral<SizeT>::value) to Device function body.
  • Use unsigned offsets instead of signed -- since we'll be porting these algorithms incrementally, we can afford to spend some time fixing any issues that arise from the change in signedness. This avoids surprises when SizeT is uint32 vs int32 (the former would be forced to use int64 signed offsets otherwise).
  • Added note that there's no point in optimizing for sizeof(OffsetT) < 4, since hardware will promote to 4B anyway.

@jrhemstad
Copy link
Collaborator

Use unsigned offsets instead of signed -- since we'll be porting these algorithms incrementally, we can afford to spend some time fixing any issues that arise from the change in signedness. This avoids surprises when SizeT is uint32 vs int32 (the former would be forced to use int64 signed offsets otherwise).

I just happened to be re-reviewing this issue and noticed this.

We should be very careful about forcing things to unsigned. Using unsigned types for loops/indexing can impact performance as the compiler isn't usually able to unroll a loop that uses an unsigned loop variable as it has to allow for the unsigned type to overflow whereas it can assume a signed type will not overflow.

@jrhemstad
Copy link
Collaborator

How many OffsetT should we support in the deduction?
We'll need to explicitly add tests for each type we support, so I think two should be plenty: int32 and int64 uint32 and uint64.

I'm revisiting this conversation after a while and I wanted to double check that our current thinking is to infer the OffsetT from the user-provided iterator and we'll always statically select uint32_t or uint64_t?

If that's the case, I understand the motivation of simplifying things to always use an unsigned type, but I am still concerned about the performance implications of always using unsigned like I mentioned in my earlier comment.

That said, I am also sympathetic to the extra testing burden that would come from having to test with 4 offset types instead of 2. Though do we really need to explicitly test all 4 offset types? I'd think any bugs that would show up from using a signed offset type would stem from overflow that CUB can't do anything about anyways, right?

@jrhemstad
Copy link
Collaborator

I'm going to close this in favor of NVIDIA/cccl#47 where we will be distill the relevant conclusions from the discussion here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cub helps: quda Helps or needed by QUDA. P0: must have Absolutely necessary. Critical issue, major blocker, etc. type: bug: functional Does not work as intended.
Projects
Archived in project
Development

No branches or pull requests

7 participants