Skip to content

Conversation

@srinivasyadav18
Copy link
Contributor

@srinivasyadav18 srinivasyadav18 commented Nov 18, 2025

Description

closes #6506

Things to consider / TODO:

  • Current constructor's accept memory_resource defaulted to typecuda::device_memory_pool_ref instead of using old allocator's. However, user is always expected pass memory_resource object. There are possibilites the code could break, when only template parameter is passed and not the object.
  • Should we add strong types sketch_kb and standard_deviation
  • cuco uses sketch_kb a strong type aliased to double to represent the size of sketch (memory used to for the data-structure) in kb.
  • similar to sketch_kb, there is also standard_deviation a strong type aliased to double.
  • Add merge host APIs (blocked by proper usage of memory_resource's)
  • Add more tests
  • Add examples
  • Add benchmark

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 18, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 18, 2025
@srinivasyadav18
Copy link
Contributor Author

pre-commit.ci autofix

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@srinivasyadav18
Copy link
Contributor Author

/ok to test 6862c49

@srinivasyadav18 srinivasyadav18 marked this pull request as ready for review November 20, 2025 03:20
@srinivasyadav18 srinivasyadav18 requested a review from a team as a code owner November 20, 2025 03:20
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Nov 20, 2025
@github-actions

This comment has been minimized.

@srinivasyadav18
Copy link
Contributor Author

/ok to test 2fe8809

@github-actions

This comment has been minimized.

Copy link
Contributor

@fbusato fbusato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just started with one file. Please propagate the suggestions and refine the implementation. After that, I will review other files

_CCCL_API constexpr _Finalizer(int __precision_)
: __precision{__precision_}
, __m{1 << __precision_}
{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an assertion to check the precision range

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the assertion is still missing

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Nov 20, 2025
- uses cuda::contiguous_iterator instead of thrust::*
- uses `_CCCL_TRY_CUDA_API` instead of depending `stf::cuda_safe_call`
- adds missing include thrust/raw_pointer_cast.h
@github-actions

This comment has been minimized.

@srinivasyadav18
Copy link
Contributor Author

/ok to test 0e46dcd

@github-actions

This comment has been minimized.

@sleeepyjack
Copy link
Contributor

sleeepyjack commented Jan 23, 2026

FYI there's a bugfix PR in cuco to match Spark's HLL behavior. We should also integrate it into this PR. NVIDIA/cuCollections#792

@srinivasyadav18
Copy link
Contributor Author

/ok to test 76c8417

@github-actions

This comment has been minimized.

@srinivasyadav18
Copy link
Contributor Author

/ok to test 5153b0b

//! @param hash The hash function used to hash items
_CCCL_API constexpr _HyperLogLog_Impl(::cuda::std::span<::cuda::std::byte> sketch_span, const _Hash& hash)
: __hash{hash}
, __precision{::cuda::std::countr_zero(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the assertion

//! @brief Adds an item to the estimator.
//!
//! @param item The item to be counted
_CCCL_DEVICE constexpr void __add(const _Tp& __item) noexcept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see a mix of them

@github-actions

This comment has been minimized.

Comment on lines +609 to +610
int __device = -1;
_CCCL_TRY_CUDA_API(::cudaGetDevice, "cudaGetDevice failed", &__device);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davebayer I believe there are some new cudart functiosn for that?

Comment on lines 33 to 35
#ifndef CUDAX_CUCO_HLL_TUNING_ARR_DECL
# define CUDAX_CUCO_HLL_TUNING_ARR_DECL __device__ static constexpr ::cuda::std::array
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should just be an _CCCL_DEVICE static inline constexpr float __meow[] = {...}`

Note that AFAIK this will fail for windows, because it treats floating point constants differently, so there it would need to be _CCCL_GLOBAL_CONSTANT

See https://godbolt.org/z/5z9c63zTM

@srinivasyadav18
Copy link
Contributor Author

/ok to test 72b589f

@github-actions

This comment has been minimized.

@srinivasyadav18
Copy link
Contributor Author

/ok to test 507d169

@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 15m 34s: Pass: 92%/39 | Total: 2h 34m | Max: 14m 05s | Hits: 99%/18089

See results here.

_CCCL_API constexpr _Finalizer(int __precision_)
: __precision{__precision_}
, __m{1 << __precision_}
{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the assertion is still missing

//! @tparam _Scope The scope in which operations will be performed by individual threads
//! @tparam _Hash Hash function used to hash items
template <class _Tp, ::cuda::thread_scope _Scope, class _Hash>
class _HyperLogLog_Impl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the decision on the class name?

_CCCL_THROW(::std::invalid_argument{"Sketch storage has insufficient alignment"});
}

if (__precision < 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also needs to check the upper-bound, at least internally with _CCCL_ASSERT

#include <cuda/experimental/__cuco/hash_functions.cuh>
#include <cuda/experimental/memory_resource.cuh>

#include <cooperative_groups.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a final decision for CG

// https://github.com/apache/spark/blob/6a27789ad7d59cd133653a49be0bb49729542abe/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/HyperLogLogPlusPlusHelper.scala#L43

auto const __precision_from_sd = static_cast<int>(
::cuda::std::ceil(2.0 * ::cuda::std::log(1.106 / __standard_deviation) / ::cuda::std::numbers::ln2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not equivalent to log2(x)?

__idx += __loop_stride;
}
// a single thread processes the remaining items
#if defined(CUDART_VERSION) && (CUDART_VERSION >= 12010)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use _CCCL_CTK macro here?

{
if (__other.__precision != __precision)
{
_CCCL_THROW(::std::invalid_argument{"Cannot merge estimators with different sketch sizes"});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we recently changed this to

Suggested change
_CCCL_THROW(::std::invalid_argument{"Cannot merge estimators with different sketch sizes"});
_CCCL_THROW(::std::invalid_argument, "Cannot merge estimators with different sketch sizes");

static constexpr auto __thread_scope = _Scope; ///< CUDA thread scope

template <::cuda::thread_scope _NewScope>
using with_scope = _HyperLogLog_Impl<_Tp, _NewScope, _Hash>; ///< Ref type with different thread scope
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it internal or external?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's public facing. The idea is to have an easy way for the user to switch to a different thread scope, e.g., when migrating a HLL sketch from global to shared memory. In this case, getting the new type is as simple as using block_scope_type = device_scope_type::with_scope<cuda::thread_scope_block>.

friend struct _HyperLogLog_Impl;

public:
static constexpr auto __thread_scope = _Scope; ///< CUDA thread scope
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably private

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Migrate cuco HyperLogLog to CCCL

5 participants