Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Add multiset host-bulk count APIs #462

Closed
PointKernel opened this issue Apr 17, 2024 · 1 comment
Closed

[FEATURE]: Add multiset host-bulk count APIs #462

PointKernel opened this issue Apr 17, 2024 · 1 comment
Assignees
Labels
helps: rapids Helps or needed by RAPIDS topic: static_multiset Issue related to the static_multiset type: feature request New feature request

Comments

@PointKernel
Copy link
Member

PointKernel commented Apr 17, 2024

Is your feature request related to a problem? Please describe.

Add multiset host-bulk count APIs

Describe the solution you'd like

The basic API must have:

  /**
   * @brief Counts the occurrences of keys in `[first, last)` contained in the multiset.
   *
   * @tparam Input Device accessible input iterator
   * 
   * @param first Beginning of the sequence of keys to count
   * @param last End of the sequence of keys to count
   * @param stream CUDA stream used for count
   * 
   * @return The sum of total occurrences of all keys in `[first, last)`
   */
  template <typename InputIt>
  size_type count(InputIt first,
                  InputIt last,
                  cuda_stream_ref stream = {}) const;

Note: There is no restriction for the iterator value_type due to heterogeneous lookup support.

More specifically, libcudf requires the below count variants

  /**
   * @brief Counts the occurrences of keys in `[first, last)` contained in the multiset.
   *
   * @tparam Input Device accessible input iterator
   * @tparam ProbeKeyEqual Binary callable
   * @tparam ProbeHash Unary hash callable
   * 
   * @param first Beginning of the sequence of keys to count
   * @param last End of the sequence of keys to count
   * @param probe_key_equal Binary function to compare two keys for equality
   * @param probe_hash Unary callable to hash a given key
   * @param stream CUDA stream used for count
   * 
   * @return The sum of total occurrences of all keys in `[first, last)`
   */
  template <typename InputIt, typename ProbeKeyEqual, typename ProbeHash>
  size_type count(InputIt first,
                  InputIt last,
                  ProbeKeyEqual probe_key_equal,
                  ProbeHash probe_hash, 
                  cuda_stream_ref stream = {}) const;

  /**
   * @brief Counts the occurrences of keys in `[first, last)` contained in the multiset.
   * 
   * @note: If a given key has no matches, its occurrence is 1.
   * 
   * @tparam Input Device accessible input iterator
   * @tparam ProbeKeyEqual Binary callable
   * @tparam ProbeHash Unary hash callable
   * 
   * @param first Beginning of the sequence of keys to count
   * @param last End of the sequence of keys to count
   * @param probe_key_equal Binary function to compare two keys for equality
   * @param probe_hash Unary callable to hash a given key
   * @param stream CUDA stream used for count
   * 
   * @return The sum of total occurrences of all keys in `[first, last)` where keys have no matches
   * are considered to have a single occurrence.
   */
  template <typename InputIt, typename ProbeKeyEqual, typename ProbeHash>
  size_type count_outer(InputIt first,
                        InputIt last,
                        ProbeKeyEqual probe_key_equal,
                        ProbeHash probe_hash, 
                        cuda_stream_ref stream = {}) const;

Describe alternatives you've considered

Can the _outer variant be represented in a more natural expression?

Could we use cub::DeviceReduce?

Additional context

  • All count APIs are synchronous thus count_ async/count_outer_async don't make sense
  • Conditional count APIs like count_if and count_outer_if that takes probe keyequal/hash are also desired. Good to have but not necessary for the current scope.
  • Hashers must be default constructible and can be constructed from an integer value (required by double hashing)take
@PointKernel PointKernel added type: feature request New feature request helps: rapids Helps or needed by RAPIDS topic: static_multiset Issue related to the static_multiset labels Apr 17, 2024
@PointKernel PointKernel added this to the static_multiset milestone Apr 17, 2024
PointKernel added a commit that referenced this issue May 22, 2024
Contributes to #462

This PR adds a `multiset::count` host-bulk API.
PointKernel added a commit that referenced this issue May 28, 2024
Closes #462, closes #488

This PR adds custom `count` APIs desired by libcudf hash join.

---------

Co-authored-by: Daniel Jünger <2955913+sleeepyjack@users.noreply.github.com>
@PointKernel
Copy link
Member Author

Closed by #490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helps: rapids Helps or needed by RAPIDS topic: static_multiset Issue related to the static_multiset type: feature request New feature request
Projects
None yet
Development

No branches or pull requests

1 participant