-
Notifications
You must be signed in to change notification settings - Fork 326
Migrate cuco HLL #6666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Migrate cuco HLL #6666
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
pre-commit.ci autofix |
|
/ok to test 6862c49 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 2fe8809 |
This comment has been minimized.
This comment has been minimized.
fbusato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just started with one file. Please propagate the suggestions and refine the implementation. After that, I will review other files
cudax/include/cuda/experimental/__cuco/detail/hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/detail/hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/detail/hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
| _CCCL_API constexpr _Finalizer(int __precision_) | ||
| : __precision{__precision_} | ||
| , __m{1 << __precision_} | ||
| {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an assertion to check the precision range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the assertion is still missing
cudax/include/cuda/experimental/__cuco/detail/hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/detail/hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/detail/hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/detail/hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/detail/hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
- uses cuda::contiguous_iterator instead of thrust::* - uses `_CCCL_TRY_CUDA_API` instead of depending `stf::cuda_safe_call` - adds missing include thrust/raw_pointer_cast.h
This comment has been minimized.
This comment has been minimized.
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/tuning.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/tuning.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/kernels.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/kernels.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
|
/ok to test 0e46dcd |
cudax/include/cuda/experimental/__cuco/__hyperloglog/kernels.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/kernels.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/kernels.cuh
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
cudax/include/cuda/experimental/__cuco/__utility/strong_type.cuh
Outdated
Show resolved
Hide resolved
|
FYI there's a bugfix PR in cuco to match Spark's HLL behavior. We should also integrate it into this PR. NVIDIA/cuCollections#792 |
|
/ok to test 76c8417 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 5153b0b |
cudax/include/cuda/experimental/__cuco/__hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/finalizer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
| //! @param hash The hash function used to hash items | ||
| _CCCL_API constexpr _HyperLogLog_Impl(::cuda::std::span<::cuda::std::byte> sketch_span, const _Hash& hash) | ||
| : __hash{hash} | ||
| , __precision{::cuda::std::countr_zero( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the assertion
| //! @brief Adds an item to the estimator. | ||
| //! | ||
| //! @param item The item to be counted | ||
| _CCCL_DEVICE constexpr void __add(const _Tp& __item) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see a mix of them
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/kernels.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/kernels.cuh
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh
Outdated
Show resolved
Hide resolved
| int __device = -1; | ||
| _CCCL_TRY_CUDA_API(::cudaGetDevice, "cudaGetDevice failed", &__device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davebayer I believe there are some new cudart functiosn for that?
| #ifndef CUDAX_CUCO_HLL_TUNING_ARR_DECL | ||
| # define CUDAX_CUCO_HLL_TUNING_ARR_DECL __device__ static constexpr ::cuda::std::array | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should just be an _CCCL_DEVICE static inline constexpr float __meow[] = {...}`
Note that AFAIK this will fail for windows, because it treats floating point constants differently, so there it would need to be _CCCL_GLOBAL_CONSTANT
|
/ok to test 72b589f |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 507d169 |
😬 CI Workflow Results🟥 Finished in 15m 34s: Pass: 92%/39 | Total: 2h 34m | Max: 14m 05s | Hits: 99%/18089See results here. |
| _CCCL_API constexpr _Finalizer(int __precision_) | ||
| : __precision{__precision_} | ||
| , __m{1 << __precision_} | ||
| {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the assertion is still missing
| //! @tparam _Scope The scope in which operations will be performed by individual threads | ||
| //! @tparam _Hash Hash function used to hash items | ||
| template <class _Tp, ::cuda::thread_scope _Scope, class _Hash> | ||
| class _HyperLogLog_Impl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the decision on the class name?
| _CCCL_THROW(::std::invalid_argument{"Sketch storage has insufficient alignment"}); | ||
| } | ||
|
|
||
| if (__precision < 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also needs to check the upper-bound, at least internally with _CCCL_ASSERT
| #include <cuda/experimental/__cuco/hash_functions.cuh> | ||
| #include <cuda/experimental/memory_resource.cuh> | ||
|
|
||
| #include <cooperative_groups.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a final decision for CG
| // https://github.com/apache/spark/blob/6a27789ad7d59cd133653a49be0bb49729542abe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala#L43 | ||
|
|
||
| auto const __precision_from_sd = static_cast<int>( | ||
| ::cuda::std::ceil(2.0 * ::cuda::std::log(1.106 / __standard_deviation) / ::cuda::std::numbers::ln2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it not equivalent to log2(x)?
| __idx += __loop_stride; | ||
| } | ||
| // a single thread processes the remaining items | ||
| #if defined(CUDART_VERSION) && (CUDART_VERSION >= 12010) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use _CCCL_CTK macro here?
| { | ||
| if (__other.__precision != __precision) | ||
| { | ||
| _CCCL_THROW(::std::invalid_argument{"Cannot merge estimators with different sketch sizes"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we recently changed this to
| _CCCL_THROW(::std::invalid_argument{"Cannot merge estimators with different sketch sizes"}); | |
| _CCCL_THROW(::std::invalid_argument, "Cannot merge estimators with different sketch sizes"); |
| static constexpr auto __thread_scope = _Scope; ///< CUDA thread scope | ||
|
|
||
| template <::cuda::thread_scope _NewScope> | ||
| using with_scope = _HyperLogLog_Impl<_Tp, _NewScope, _Hash>; ///< Ref type with different thread scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it internal or external?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's public facing. The idea is to have an easy way for the user to switch to a different thread scope, e.g., when migrating a HLL sketch from global to shared memory. In this case, getting the new type is as simple as using block_scope_type = device_scope_type::with_scope<cuda::thread_scope_block>.
| friend struct _HyperLogLog_Impl; | ||
|
|
||
| public: | ||
| static constexpr auto __thread_scope = _Scope; ///< CUDA thread scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably private
Description
closes #6506
Things to consider / TODO:
memory_resourcedefaulted to typecuda::device_memory_pool_refinstead of using old allocator's. However, user is always expected passmemory_resourceobject. There are possibilites the code could break, when only template parameter is passed and not the object.sketch_kbandstandard_deviationsketch_kba strong type aliased todoubleto represent the size of sketch (memory used to for the data-structure) inkb.sketch_kb, there is alsostandard_deviationa strong type aliased todouble.