Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Add host-bulk insert_and_find #477

Open
PointKernel opened this issue May 7, 2024 · 5 comments
Open

[FEATURE]: Add host-bulk insert_and_find #477

PointKernel opened this issue May 7, 2024 · 5 comments
Assignees
Labels
good first issue Good for newcomers P2: Nice to have Desired, but not necessary topic: static_map Issue related to the static_map topic: static_set Issue related to the static_set type: feature request New feature request

Comments

@PointKernel
Copy link
Member

PointKernel commented May 7, 2024

Is your feature request related to a problem? Please describe.

Add host-bulk insert_and_find function for static_set and static_map

Describe the solution you'd like

/**
  * @brief Asynchronously inserts all elements in the range `[first, last)`.
  *
  * @note: For a given element `*(first + i)`, if the container doesn't already contain an element
  * with an equivalent key, inserts the element at a location pointed by `iter` and writes
  * `thrust::pair{iter, true}` to `output_begin + i`. Otherwise, finds the location of the
  * equivalent element, `iter` and writes `thrust::pair{iter, false}` to `output_begin + i`.
  * 
  * @tparam InputIt Device accessible random access input iterator
  * @tparam OutputIt Device accessible random access output iterator whose `value_type`
  * is constructible from `thrust::pair<map::iterator, bool>`
  *
  * @param first Beginning of the sequence of elements
  * @param last End of the sequence of elements
  * @param stream CUDA stream used for insert
  */
template<typename InputIt, typename OutputIt>
void insert_and_find_async (InputIt first, InputIt last, OutputIt output_begin, cuda_stream_ref stream = {});

Describe alternatives you've considered

/**
  * @brief Asynchronously inserts all elements in the range `[first, last)`.
  *
  * @note: For a given element `*(first + i)`, if the container doesn't already contain an element
  * with an equivalent key, inserts the element at a location pointed by `iter` and writes
  * `iter` to `location_begin + i` and writes `true` to `inserted_begin + i`. Otherwise, finds the location of the
  * equivalent element, `iter` and writes `iter` to `location_begin + i` and writes `false` to `inserted_begin + i`.
  * 
  * @tparam InputIt Device accessible random access input iterator
  * @tparam LocationIt Device accessible random access output iterator whose `value_type`
  * is constructible from `map::iterator` type
  * @tparam Boolt Device accessible random access output iterator whose `value_type`
  * is constructible from `bool`
  *
  * @param first Beginning of the sequence of elements
  * @param last End of the sequence of elements
  * @param stream CUDA stream used for insert
  */
template<typename InputIt, typename LocationIt, typename Boolt>
void insert_and_find_async (InputIt first, InputIt last, LocationIt location_begin, Boolt inserted_begin, cuda_stream_ref stream = {});

We can also take the location iterator and insert-result iterator separately since users might have to use zip iterator in the above design.

@PointKernel PointKernel added type: feature request New feature request topic: static_map Issue related to the static_map P2: Nice to have Desired, but not necessary topic: static_set Issue related to the static_set labels May 7, 2024
@PointKernel
Copy link
Member Author

cc @esoha-nvidia

@esoha-nvidia
Copy link

@PointKernel For correctness, can you fix the comment of the alternative described?

/**
  * @brief Asynchronously inserts all elements in the range `[first, last)`.
  *
  * @note: For a given element `*(first + i)`, if the container doesn't already contain an element
  * with an equivalent key, inserts the element at a location pointed by `iter` and writes
  * `iter` to `location_begin + i` and writes `true` to `inserted_begin + i`. Otherwise, finds the location of the
  * equivalent element, `iter` and writes `iter` to `location_begin + i` and writes `false` to `inserted_begin + i`.
  * 
  * @tparam InputIt Device accessible random access input iterator
  * @tparam LocationIt Device accessible random access output iterator whose `value_type`
  * is constructible from `map::iterator` type
  * @tparam Boolt Device accessible random access output iterator whose `value_type`
  * is constructible from `bool`
  *
  * @param first Beginning of the sequence of elements
  * @param last End of the sequence of elements
  * @param stream CUDA stream used for insert
  */
template<typename InputIt, typename LocationIt, typename Boolt>
void insert_and_find_async (InputIt first, InputIt last, LocationIt location_begin, Boolt inserted_begin, cuda_stream_ref stream = {});

@esoha-nvidia
Copy link

I prefer the alternative over the version with pairs because it will run faster. The reason is that the steps directly after this is going to be:

  • The user doing a dereference of all the location_begin.
  • A thrust::copy_if using inserted_begin.

The user will surely want to do these. If we return two arrays, like in the alternative, then the pointer to slots and the booleans will have a nice access pattern. But if we return tuples then those kernels will have a poor access pattern. This is similar to the issue of column-major versus row-major, aka AoS versus SoA.

In my testing, the fastest was to just fold those kernels directly into the insert_and_find because what else am I going to do with all these iterators and booleans! Of course I'm going to dereference and copy_if. We can do the copy_if as part of the insert_and_find using atomics to store just the true elements. I understand that this is not a convenient paradigm for the hash table, though. In one of my examples, though, I've got 300 million rows, all of which are false and only 4 true. I would not like to write false to memory nearly 300 million times just to throw all that effort away!

How about this?

/**
  * @brief Asynchronously inserts all elements in the range `[first, last)`.
  *
  * @note: For a given element `*(first + i)`, if the container doesn't already contain an element
  * with an equivalent key, inserts the element at a location pointed by `iter` and writes
  * `iter` to `indices_inserted_begin + x` where `x` is the next unwritten row of those arrays.  Also writes `x` to 
  * `reverse_lookup_begin + i`.   Also writes `iter` to `locations_begin + i`.
  * 
  * If the container already contains the element, found at `iter`, then `iter` is written to `locations_begin + i`.
  * 
  * The total number of elements written into `indices_inserted_begin` will never more than `std::distance(first, last)`.
  * It will be written to the GPU memory pointed to by `total_inserts`.
  * 
  * @param first Beginning of the sequence of elements
  * @param last End of the sequence of elements
  * @param stream CUDA stream used for insert
  */
template<typename InputIt, typename IndicesInsertedBegin, typename LocationIt>
void insert_and_find_async (InputIt first, InputIt last, IndicesInsertedBeginIt indices_inserted_begin, size_t* reverse_lookup_begin, LocationIt location_begin, size_t* total_inserts, cuda_stream_ref stream = {});

This is a useful function for groupby operations in databases. The indices_inserted_begin and total_inserts is naturally done by use atomicAdd with a combination of shared memory and global memory for speed.

@PointKernel
Copy link
Member Author

I prefer the alternative solution also but struggling so much to get better names for the output

@PointKernel PointKernel added the good first issue Good for newcomers label May 9, 2024
@srinivasyadav18
Copy link
Contributor

Hi @PointKernel. I would like to work on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers P2: Nice to have Desired, but not necessary topic: static_map Issue related to the static_map topic: static_set Issue related to the static_set type: feature request New feature request
Projects
None yet
Development

No branches or pull requests

3 participants