-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad performance of "by" function for random queries #1988
Comments
The performance improves by a factor of 40 for queries that work on Float64 colums only, and by a factor of four for queries with mixed column types, when the following code is used:
But first of all it is complicated, and secondly it doesn't work very well if the column types are mixed. |
When one has random queries, like I have come across (typical for generic applications) fast function would make a lot of difference! |
Couldn't you just pass a vector of pairs instead of a named tuple, and rename columns afterwards? I.e. do It's too bad that names are part of the type of named tuples in this case. I precisely ensured that named tuples are not used internally so that recompilation isn't needed when the names change. But the public API for specifying column names indeed relies on named tuples. Maybe we could mitigate the problem by collecting them into a vector as soon as possible so that only a very small function needs to be recompiled. I'm not sure |
@ufechner7 Can you try with DataFrames master and see whether your problem is fixed? |
The following code has a bad performance due to recompilation for each call of the "by" function:
Example output, second run:
Original post: https://discourse.julialang.org/t/bad-performance-of-group-by-of-dataframes-updated/30061/
The text was updated successfully, but these errors were encountered: