-
Notifications
You must be signed in to change notification settings - Fork 97
Closes #4244: Make an unordered set union of two Strings arrays function #4245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #4244: Make an unordered set union of two Strings arrays function #4245
Conversation
fd0a3ae to
5474af8
Compare
5474af8 to
bd87800
Compare
1RyanK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, I tried doing
x = cast(pdarray, unique(cast(pdarray, concatenate((unique(strings[0]), unique(strings[1])), ordered=False))))
return x
in strings.py in place of what I had. Here's the performance comparison on two arrays of size 200 each (chosen from 1,000 values) and ten locales:
unique(concatenate(unique(A), unique(B))) approach...
Time elapsed: 39.54281187057495 seconds
+----+-------+-------+--------------+-------------------+-----------------+
| | put | get | execute_on | execute_on_fast | execute_on_nb |
+====+=======+=======+==============+===================+=================+
| 0 | 1324 | 1942 | 1400 | 0 | 9666 |
+----+-------+-------+--------------+-------------------+-----------------+
| 1 | 1579 | 3859 | 1458 | 4 | 181 |
+----+-------+-------+--------------+-------------------+-----------------+
| 2 | 1449 | 3839 | 1486 | 4 | 181 |
+----+-------+-------+--------------+-------------------+-----------------+
| 3 | 1606 | 3822 | 1494 | 4 | 181 |
+----+-------+-------+--------------+-------------------+-----------------+
| 4 | 1452 | 3854 | 1556 | 4 | 181 |
+----+-------+-------+--------------+-------------------+-----------------+
| 5 | 1425 | 3793 | 1480 | 4 | 181 |
+----+-------+-------+--------------+-------------------+-----------------+
| 6 | 1590 | 3846 | 1474 | 4 | 181 |
+----+-------+-------+--------------+-------------------+-----------------+
| 7 | 1423 | 3810 | 1500 | 4 | 181 |
+----+-------+-------+--------------+-------------------+-----------------+
| 8 | 1560 | 3819 | 1422 | 4 | 181 |
+----+-------+-------+--------------+-------------------+-----------------+
| 9 | 1423 | 3804 | 1508 | 4 | 181 |
+----+-------+-------+--------------+-------------------+-----------------+
New and improved approach...
Time elapsed: 5.029598236083984 seconds
+----+-------+-------+--------------+-----------------+
| | put | get | execute_on | execute_on_nb |
+====+=======+=======+==============+=================+
| 0 | 873 | 664 | 0 | 1395 |
+----+-------+-------+--------------+-----------------+
| 1 | 89 | 1823 | 180 | 118 |
+----+-------+-------+--------------+-----------------+
| 2 | 88 | 1822 | 180 | 119 |
+----+-------+-------+--------------+-----------------+
| 3 | 89 | 1827 | 180 | 119 |
+----+-------+-------+--------------+-----------------+
| 4 | 89 | 1836 | 180 | 119 |
+----+-------+-------+--------------+-----------------+
| 5 | 89 | 1823 | 180 | 119 |
+----+-------+-------+--------------+-----------------+
| 6 | 89 | 1825 | 180 | 118 |
+----+-------+-------+--------------+-----------------+
| 7 | 89 | 1824 | 180 | 118 |
+----+-------+-------+--------------+-----------------+
| 8 | 89 | 1815 | 180 | 118 |
+----+-------+-------+--------------+-----------------+
| 9 | 89 | 1799 | 180 | 117 |
+----+-------+-------+--------------+-----------------+
Okay, so YMMV with the time elapsed, obviously I don't have an extremely high powered computer I'm running this on, but at the very least, you can see the comm diagnostics are much better. I think this is due in large part to unique which goes through a GroupBy. Furthermore, my approach scales better because the only column that increases with bigger arrays is the get column. That's it. I'm pretty sure the current approach sees the execute_on column increase wildly, which I think can make things slow.
3c16520 to
1e85b78
Compare
162c042 to
b5f00e6
Compare
66f560c to
5439df6
Compare
ajpotts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I might add some additional comments, but I would be OK merging in as is as it seems correct and is an improvement over the other functions available for this purpose. I like the comm diagnostics too :)
b7b0915 to
8d52083
Compare
drculhane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have questions for you about some of the code, but I've run a number of tests on this myself, large and small, and it passes everything. I'll keep studying what you've done, but in the meantime, I think this was good work and should be merged.
Previously,
set_categorieswas using several groupbys to set categories for a Categorical array to some new set of categories. This was inefficient. I figured out a new way to tackle this - union the categories together (as Strings arrays), get the index of the old categories in the new ones, and then remap. I was told that aunion1dfunction already exists, but looking into it, it did not seem to perform very well. I created my own version that ignores order. Here's how it works:This seems to be the fastest way to do this. Getting incredible performance out of this and major asymptotic improvement over
union1d.Closes #4244: Make an unordered set union of two Strings arrays function