Skip to content

Conversation

@1RyanK
Copy link
Contributor

@1RyanK 1RyanK commented Apr 21, 2025

Previously, set_categories was using several groupbys to set categories for a Categorical array to some new set of categories. This was inefficient. I figured out a new way to tackle this - union the categories together (as Strings arrays), get the index of the old categories in the new ones, and then remap. I was told that a union1d function already exists, but looking into it, it did not seem to perform very well. I created my own version that ignores order. Here's how it works:

  1. Union together the data within each locale across the arrays we're doing the union over
  2. Take the hash of each string modulo numLocales to determine which locale to send each string to
  3. After every locale sends all of its strings to all the other locales (including itself), look at what everyone else has sent me
  4. Each locale creates a new set containing the strings that were sent to it (there are now no duplicates of strings across all locales)
  5. Join the data together into a new SegString object

This seems to be the fastest way to do this. Getting incredible performance out of this and major asymptotic improvement over union1d.

Closes #4244: Make an unordered set union of two Strings arrays function

@1RyanK 1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch 2 times, most recently from fd0a3ae to 5474af8 Compare April 21, 2025 18:01
@1RyanK 1RyanK marked this pull request as ready for review April 22, 2025 15:15
@1RyanK 1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch from 5474af8 to bd87800 Compare April 22, 2025 17:42
Copy link
Contributor Author

@1RyanK 1RyanK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, I tried doing

        x = cast(pdarray, unique(cast(pdarray, concatenate((unique(strings[0]), unique(strings[1])), ordered=False))))
        return x

in strings.py in place of what I had. Here's the performance comparison on two arrays of size 200 each (chosen from 1,000 values) and ten locales:
unique(concatenate(unique(A), unique(B))) approach...
Time elapsed: 39.54281187057495 seconds

+----+-------+-------+--------------+-------------------+-----------------+
|    |   put |   get |   execute_on |   execute_on_fast |   execute_on_nb |
+====+=======+=======+==============+===================+=================+
|  0 |  1324 |  1942 |         1400 |                 0 |            9666 |
+----+-------+-------+--------------+-------------------+-----------------+
|  1 |  1579 |  3859 |         1458 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  2 |  1449 |  3839 |         1486 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  3 |  1606 |  3822 |         1494 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  4 |  1452 |  3854 |         1556 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  5 |  1425 |  3793 |         1480 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  6 |  1590 |  3846 |         1474 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  7 |  1423 |  3810 |         1500 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  8 |  1560 |  3819 |         1422 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  9 |  1423 |  3804 |         1508 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+

New and improved approach...
Time elapsed: 5.029598236083984 seconds

+----+-------+-------+--------------+-----------------+
|    |   put |   get |   execute_on |   execute_on_nb |
+====+=======+=======+==============+=================+
|  0 |   873 |   664 |            0 |            1395 |
+----+-------+-------+--------------+-----------------+
|  1 |    89 |  1823 |          180 |             118 |
+----+-------+-------+--------------+-----------------+
|  2 |    88 |  1822 |          180 |             119 |
+----+-------+-------+--------------+-----------------+
|  3 |    89 |  1827 |          180 |             119 |
+----+-------+-------+--------------+-----------------+
|  4 |    89 |  1836 |          180 |             119 |
+----+-------+-------+--------------+-----------------+
|  5 |    89 |  1823 |          180 |             119 |
+----+-------+-------+--------------+-----------------+
|  6 |    89 |  1825 |          180 |             118 |
+----+-------+-------+--------------+-----------------+
|  7 |    89 |  1824 |          180 |             118 |
+----+-------+-------+--------------+-----------------+
|  8 |    89 |  1815 |          180 |             118 |
+----+-------+-------+--------------+-----------------+
|  9 |    89 |  1799 |          180 |             117 |
+----+-------+-------+--------------+-----------------+

Okay, so YMMV with the time elapsed, obviously I don't have an extremely high powered computer I'm running this on, but at the very least, you can see the comm diagnostics are much better. I think this is due in large part to unique which goes through a GroupBy. Furthermore, my approach scales better because the only column that increases with bigger arrays is the get column. That's it. I'm pretty sure the current approach sees the execute_on column increase wildly, which I think can make things slow.

@1RyanK 1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch 6 times, most recently from 3c16520 to 1e85b78 Compare April 30, 2025 12:41
@1RyanK 1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch 2 times, most recently from 162c042 to b5f00e6 Compare April 30, 2025 16:03
@1RyanK 1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch 3 times, most recently from 66f560c to 5439df6 Compare May 5, 2025 17:01
Copy link
Contributor

@ajpotts ajpotts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I might add some additional comments, but I would be OK merging in as is as it seems correct and is an improvement over the other functions available for this purpose. I like the comm diagnostics too :)

@1RyanK 1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch from b7b0915 to 8d52083 Compare May 6, 2025 16:27
Copy link
Contributor

@drculhane drculhane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have questions for you about some of the code, but I've run a number of tests on this myself, large and small, and it passes everything. I'll keep studying what you've done, but in the meantime, I think this was good work and should be merged.

@ajpotts ajpotts added this pull request to the merge queue May 6, 2025
Merged via the queue into Bears-R-Us:master with commit aad397c May 6, 2025
25 checks passed
@1RyanK 1RyanK deleted the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch June 3, 2025 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make an unordered set union of two Strings arrays function

3 participants