Closes #4244: Make an unordered set union of two Strings arrays function #4245

1RyanK · 2025-04-21T17:56:04Z

Previously, set_categories was using several groupbys to set categories for a Categorical array to some new set of categories. This was inefficient. I figured out a new way to tackle this - union the categories together (as Strings arrays), get the index of the old categories in the new ones, and then remap. I was told that a union1d function already exists, but looking into it, it did not seem to perform very well. I created my own version that ignores order. Here's how it works:

Union together the data within each locale across the arrays we're doing the union over
Take the hash of each string modulo numLocales to determine which locale to send each string to
After every locale sends all of its strings to all the other locales (including itself), look at what everyone else has sent me
Each locale creates a new set containing the strings that were sent to it (there are now no duplicates of strings across all locales)
Join the data together into a new SegString object

This seems to be the fastest way to do this. Getting incredible performance out of this and major asymptotic improvement over union1d.

Closes #4244: Make an unordered set union of two Strings arrays function

1RyanK

For what it's worth, I tried doing

        x = cast(pdarray, unique(cast(pdarray, concatenate((unique(strings[0]), unique(strings[1])), ordered=False))))
        return x

in strings.py in place of what I had. Here's the performance comparison on two arrays of size 200 each (chosen from 1,000 values) and ten locales:
unique(concatenate(unique(A), unique(B))) approach...
Time elapsed: 39.54281187057495 seconds

+----+-------+-------+--------------+-------------------+-----------------+
|    |   put |   get |   execute_on |   execute_on_fast |   execute_on_nb |
+====+=======+=======+==============+===================+=================+
|  0 |  1324 |  1942 |         1400 |                 0 |            9666 |
+----+-------+-------+--------------+-------------------+-----------------+
|  1 |  1579 |  3859 |         1458 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  2 |  1449 |  3839 |         1486 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  3 |  1606 |  3822 |         1494 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  4 |  1452 |  3854 |         1556 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  5 |  1425 |  3793 |         1480 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  6 |  1590 |  3846 |         1474 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  7 |  1423 |  3810 |         1500 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  8 |  1560 |  3819 |         1422 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+
|  9 |  1423 |  3804 |         1508 |                 4 |             181 |
+----+-------+-------+--------------+-------------------+-----------------+

New and improved approach...
Time elapsed: 5.029598236083984 seconds

+----+-------+-------+--------------+-----------------+
|    |   put |   get |   execute_on |   execute_on_nb |
+====+=======+=======+==============+=================+
|  0 |   873 |   664 |            0 |            1395 |
+----+-------+-------+--------------+-----------------+
|  1 |    89 |  1823 |          180 |             118 |
+----+-------+-------+--------------+-----------------+
|  2 |    88 |  1822 |          180 |             119 |
+----+-------+-------+--------------+-----------------+
|  3 |    89 |  1827 |          180 |             119 |
+----+-------+-------+--------------+-----------------+
|  4 |    89 |  1836 |          180 |             119 |
+----+-------+-------+--------------+-----------------+
|  5 |    89 |  1823 |          180 |             119 |
+----+-------+-------+--------------+-----------------+
|  6 |    89 |  1825 |          180 |             118 |
+----+-------+-------+--------------+-----------------+
|  7 |    89 |  1824 |          180 |             118 |
+----+-------+-------+--------------+-----------------+
|  8 |    89 |  1815 |          180 |             118 |
+----+-------+-------+--------------+-----------------+
|  9 |    89 |  1799 |          180 |             117 |
+----+-------+-------+--------------+-----------------+

Okay, so YMMV with the time elapsed, obviously I don't have an extremely high powered computer I'm running this on, but at the very least, you can see the comm diagnostics are much better. I think this is due in large part to unique which goes through a GroupBy. Furthermore, my approach scales better because the only column that increases with bigger arrays is the get column. That's it. I'm pretty sure the current approach sees the execute_on column increase wildly, which I think can make things slow.

tests/numpy/string_test.py

src/ConcatenateMsg.chpl

ajpotts

Great work! I might add some additional comments, but I would be OK merging in as is as it seems correct and is an improvement over the other functions available for this purpose. I like the comm diagnostics too :)

…rays function

drculhane

I may have questions for you about some of the code, but I've run a number of tests on this myself, large and small, and it passes everything. I'll keep studying what you've done, but in the meantime, I think this was good work and should be merged.

1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch 2 times, most recently from fd0a3ae to 5474af8 Compare April 21, 2025 18:01

1RyanK requested review from ajpotts, drculhane, e-kayrakli and jaketrookman April 21, 2025 19:09

1RyanK marked this pull request as ready for review April 22, 2025 15:15

1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch from 5474af8 to bd87800 Compare April 22, 2025 17:42

1RyanK commented Apr 22, 2025

View reviewed changes

1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch 6 times, most recently from 3c16520 to 1e85b78 Compare April 30, 2025 12:41

ajpotts reviewed Apr 30, 2025

View reviewed changes

tests/numpy/string_test.py Outdated Show resolved Hide resolved

ajpotts reviewed Apr 30, 2025

View reviewed changes

src/ConcatenateMsg.chpl Outdated Show resolved Hide resolved

1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch 2 times, most recently from 162c042 to b5f00e6 Compare April 30, 2025 16:03

ajpotts reviewed May 2, 2025

View reviewed changes

src/ConcatenateMsg.chpl Outdated Show resolved Hide resolved

src/ConcatenateMsg.chpl Outdated Show resolved Hide resolved

src/ConcatenateMsg.chpl Outdated Show resolved Hide resolved

src/ConcatenateMsg.chpl Outdated Show resolved Hide resolved

1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch 3 times, most recently from 66f560c to 5439df6 Compare May 5, 2025 17:01

ajpotts reviewed May 5, 2025

View reviewed changes

src/ConcatenateMsg.chpl Show resolved Hide resolved

ajpotts approved these changes May 5, 2025

View reviewed changes

Closes Bears-R-Us#4244: Make an unordered set union of two Strings ar…

8d52083

…rays function

1RyanK force-pushed the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch from b7b0915 to 8d52083 Compare May 6, 2025 16:27

drculhane approved these changes May 6, 2025

View reviewed changes

ajpotts added this pull request to the merge queue May 6, 2025

Merged via the queue into Bears-R-Us:master with commit aad397c May 6, 2025
25 checks passed

1RyanK deleted the 4244-Make_an_unordered_set_union_of_two_Strings_arrays_function branch June 3, 2025 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #4244: Make an unordered set union of two Strings arrays function #4245

Closes #4244: Make an unordered set union of two Strings arrays function #4245

Uh oh!

1RyanK commented Apr 21, 2025

Uh oh!

1RyanK left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajpotts left a comment

Uh oh!

drculhane left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Closes #4244: Make an unordered set union of two Strings arrays function #4245

Closes #4244: Make an unordered set union of two Strings arrays function #4245

Uh oh!

Conversation

1RyanK commented Apr 21, 2025

Uh oh!

1RyanK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajpotts left a comment

Choose a reason for hiding this comment

Uh oh!

drculhane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants