-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]: Add host-bulk insert_and_find
#477
Comments
@PointKernel For correctness, can you fix the comment of the alternative described? /**
* @brief Asynchronously inserts all elements in the range `[first, last)`.
*
* @note: For a given element `*(first + i)`, if the container doesn't already contain an element
* with an equivalent key, inserts the element at a location pointed by `iter` and writes
* `iter` to `location_begin + i` and writes `true` to `inserted_begin + i`. Otherwise, finds the location of the
* equivalent element, `iter` and writes `iter` to `location_begin + i` and writes `false` to `inserted_begin + i`.
*
* @tparam InputIt Device accessible random access input iterator
* @tparam LocationIt Device accessible random access output iterator whose `value_type`
* is constructible from `map::iterator` type
* @tparam Boolt Device accessible random access output iterator whose `value_type`
* is constructible from `bool`
*
* @param first Beginning of the sequence of elements
* @param last End of the sequence of elements
* @param stream CUDA stream used for insert
*/
template<typename InputIt, typename LocationIt, typename Boolt>
void insert_and_find_async (InputIt first, InputIt last, LocationIt location_begin, Boolt inserted_begin, cuda_stream_ref stream = {}); |
I prefer the alternative over the version with pairs because it will run faster. The reason is that the steps directly after this is going to be:
The user will surely want to do these. If we return two arrays, like in the alternative, then the pointer to slots and the booleans will have a nice access pattern. But if we return tuples then those kernels will have a poor access pattern. This is similar to the issue of column-major versus row-major, aka AoS versus SoA. In my testing, the fastest was to just fold those kernels directly into the How about this? /**
* @brief Asynchronously inserts all elements in the range `[first, last)`.
*
* @note: For a given element `*(first + i)`, if the container doesn't already contain an element
* with an equivalent key, inserts the element at a location pointed by `iter` and writes
* `iter` to `indices_inserted_begin + x` where `x` is the next unwritten row of those arrays. Also writes `x` to
* `reverse_lookup_begin + i`. Also writes `iter` to `locations_begin + i`.
*
* If the container already contains the element, found at `iter`, then `iter` is written to `locations_begin + i`.
*
* The total number of elements written into `indices_inserted_begin` will never more than `std::distance(first, last)`.
* It will be written to the GPU memory pointed to by `total_inserts`.
*
* @param first Beginning of the sequence of elements
* @param last End of the sequence of elements
* @param stream CUDA stream used for insert
*/
template<typename InputIt, typename IndicesInsertedBegin, typename LocationIt>
void insert_and_find_async (InputIt first, InputIt last, IndicesInsertedBeginIt indices_inserted_begin, size_t* reverse_lookup_begin, LocationIt location_begin, size_t* total_inserts, cuda_stream_ref stream = {}); This is a useful function for groupby operations in databases. The |
I prefer the alternative solution also but struggling so much to get better names for the output |
Hi @PointKernel. I would like to work on this issue. |
Is your feature request related to a problem? Please describe.
Add host-bulk
insert_and_find
function forstatic_set
andstatic_map
Describe the solution you'd like
Describe alternatives you've considered
We can also take the location iterator and insert-result iterator separately since users might have to use zip iterator in the above design.
The text was updated successfully, but these errors were encountered: