[FEATURE]: Add host-bulk `insert_and_find` #477

PointKernel · 2024-05-07T02:05:00Z

Is your feature request related to a problem? Please describe.

Add host-bulk insert_and_find function for static_set and static_map

Describe the solution you'd like

/**
  * @brief Asynchronously inserts all elements in the range `[first, last)`.
  *
  * @note: For a given element `*(first + i)`, if the container doesn't already contain an element
  * with an equivalent key, inserts the element at a location pointed by `iter` and writes
  * `thrust::pair{iter, true}` to `output_begin + i`. Otherwise, finds the location of the
  * equivalent element, `iter` and writes `thrust::pair{iter, false}` to `output_begin + i`.
  * 
  * @tparam InputIt Device accessible random access input iterator
  * @tparam OutputIt Device accessible random access output iterator whose `value_type`
  * is constructible from `thrust::pair<map::iterator, bool>`
  *
  * @param first Beginning of the sequence of elements
  * @param last End of the sequence of elements
  * @param stream CUDA stream used for insert
  */
template<typename InputIt, typename OutputIt>
void insert_and_find_async (InputIt first, InputIt last, OutputIt output_begin, cuda_stream_ref stream = {});

Describe alternatives you've considered

/**
  * @brief Asynchronously inserts all elements in the range `[first, last)`.
  *
  * @note: For a given element `*(first + i)`, if the container doesn't already contain an element
  * with an equivalent key, inserts the element at a location pointed by `iter` and writes
  * `iter` to `location_begin + i` and writes `true` to `inserted_begin + i`. Otherwise, finds the location of the
  * equivalent element, `iter` and writes `iter` to `location_begin + i` and writes `false` to `inserted_begin + i`.
  * 
  * @tparam InputIt Device accessible random access input iterator
  * @tparam LocationIt Device accessible random access output iterator whose `value_type`
  * is constructible from `map::iterator` type
  * @tparam Boolt Device accessible random access output iterator whose `value_type`
  * is constructible from `bool`
  *
  * @param first Beginning of the sequence of elements
  * @param last End of the sequence of elements
  * @param stream CUDA stream used for insert
  */
template<typename InputIt, typename LocationIt, typename Boolt>
void insert_and_find_async (InputIt first, InputIt last, LocationIt location_begin, Boolt inserted_begin, cuda_stream_ref stream = {});

We can also take the location iterator and insert-result iterator separately since users might have to use zip iterator in the above design.

The text was updated successfully, but these errors were encountered:

PointKernel · 2024-05-07T02:08:32Z

cc @esoha-nvidia

esoha-nvidia · 2024-05-07T15:29:52Z

@PointKernel For correctness, can you fix the comment of the alternative described?

/**
  * @brief Asynchronously inserts all elements in the range `[first, last)`.
  *
  * @note: For a given element `*(first + i)`, if the container doesn't already contain an element
  * with an equivalent key, inserts the element at a location pointed by `iter` and writes
  * `iter` to `location_begin + i` and writes `true` to `inserted_begin + i`. Otherwise, finds the location of the
  * equivalent element, `iter` and writes `iter` to `location_begin + i` and writes `false` to `inserted_begin + i`.
  * 
  * @tparam InputIt Device accessible random access input iterator
  * @tparam LocationIt Device accessible random access output iterator whose `value_type`
  * is constructible from `map::iterator` type
  * @tparam Boolt Device accessible random access output iterator whose `value_type`
  * is constructible from `bool`
  *
  * @param first Beginning of the sequence of elements
  * @param last End of the sequence of elements
  * @param stream CUDA stream used for insert
  */
template<typename InputIt, typename LocationIt, typename Boolt>
void insert_and_find_async (InputIt first, InputIt last, LocationIt location_begin, Boolt inserted_begin, cuda_stream_ref stream = {});

esoha-nvidia · 2024-05-07T16:02:48Z

I prefer the alternative over the version with pairs because it will run faster. The reason is that the steps directly after this is going to be:

The user doing a dereference of all the location_begin.
A thrust::copy_if using inserted_begin.

The user will surely want to do these. If we return two arrays, like in the alternative, then the pointer to slots and the booleans will have a nice access pattern. But if we return tuples then those kernels will have a poor access pattern. This is similar to the issue of column-major versus row-major, aka AoS versus SoA.

In my testing, the fastest was to just fold those kernels directly into the insert_and_find because what else am I going to do with all these iterators and booleans! Of course I'm going to dereference and copy_if. We can do the copy_if as part of the insert_and_find using atomics to store just the true elements. I understand that this is not a convenient paradigm for the hash table, though. In one of my examples, though, I've got 300 million rows, all of which are false and only 4 true. I would not like to write false to memory nearly 300 million times just to throw all that effort away!

How about this?

/**
  * @brief Asynchronously inserts all elements in the range `[first, last)`.
  *
  * @note: For a given element `*(first + i)`, if the container doesn't already contain an element
  * with an equivalent key, inserts the element at a location pointed by `iter` and writes
  * `iter` to `indices_inserted_begin + x` where `x` is the next unwritten row of those arrays.  Also writes `x` to 
  * `reverse_lookup_begin + i`.   Also writes `iter` to `locations_begin + i`.
  * 
  * If the container already contains the element, found at `iter`, then `iter` is written to `locations_begin + i`.
  * 
  * The total number of elements written into `indices_inserted_begin` will never more than `std::distance(first, last)`.
  * It will be written to the GPU memory pointed to by `total_inserts`.
  * 
  * @param first Beginning of the sequence of elements
  * @param last End of the sequence of elements
  * @param stream CUDA stream used for insert
  */
template<typename InputIt, typename IndicesInsertedBegin, typename LocationIt>
void insert_and_find_async (InputIt first, InputIt last, IndicesInsertedBeginIt indices_inserted_begin, size_t* reverse_lookup_begin, LocationIt location_begin, size_t* total_inserts, cuda_stream_ref stream = {});

This is a useful function for groupby operations in databases. The indices_inserted_begin and total_inserts is naturally done by use atomicAdd with a combination of shared memory and global memory for speed.

PointKernel · 2024-05-07T16:56:10Z

I prefer the alternative solution also but struggling so much to get better names for the output

srinivasyadav18 · 2024-06-07T17:22:45Z

Hi @PointKernel. I would like to work on this issue.

PointKernel added type: feature request New feature request topic: static_map Issue related to the static_map P2: Nice to have Desired, but not necessary topic: static_set Issue related to the static_set labels May 7, 2024

PointKernel mentioned this issue May 7, 2024

[FEA] Add an overload for insert that returns the inserted (or existing) value #111

Closed

PointKernel added the good first issue Good for newcomers label May 9, 2024

PointKernel assigned srinivasyadav18 Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Add host-bulk `insert_and_find` #477

[FEATURE]: Add host-bulk `insert_and_find` #477

PointKernel commented May 7, 2024 •

edited

PointKernel commented May 7, 2024

esoha-nvidia commented May 7, 2024

esoha-nvidia commented May 7, 2024

PointKernel commented May 7, 2024

srinivasyadav18 commented Jun 7, 2024

[FEATURE]: Add host-bulk insert_and_find #477

[FEATURE]: Add host-bulk insert_and_find #477

Comments

PointKernel commented May 7, 2024 • edited

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

PointKernel commented May 7, 2024

esoha-nvidia commented May 7, 2024

esoha-nvidia commented May 7, 2024

PointKernel commented May 7, 2024

srinivasyadav18 commented Jun 7, 2024

[FEATURE]: Add host-bulk `insert_and_find` #477

[FEATURE]: Add host-bulk `insert_and_find` #477

PointKernel commented May 7, 2024 •

edited