-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tables.columntable
, Table.allocatecolumns
runs slowly for tables with large number of columns
#209
Comments
The point you raise is exactly the reason why DataFrames.jl exists. It is designed for the case of very wide tables (and also tables that should allow changing of schema but this is a separate issue, as type-stable tables do not allow changing schema in-place). So in short:
and we provide both options in the ecosystem. The typical workflow with DataFrames.jl is to use type-unstable |
Vastly improves #209. I actually thought this would trip us up sooner, but kind of forgot about it. We had some generous use of `@generated` code in a number of places, most aggregiously in `Tables.allocatecolumns` and `Tables.eachcolumn`. It simplifies the code to either just use a slow fallback, or skip the generated code entirely by using macro expansion up to a certain limit. I tried to test out a number of cases, comparing time to compile vs. resulting compiled code performance, but it can be tricky to cover a wide variety of workflows. I might try to generate a wider variety to compare benchmarks pre/post this PR.
Thanks for reporting this @OkonSamuel; I've been doing some profiling of various parts of Tables.jl code lately, so this was timely to see the bottlenecks you're running into. I've proposed #211, which vastly improves the cases you listed above, while also simplifying several uses of I agree with @bkamins that a |
…211) Vastly improves #209. I actually thought this would trip us up sooner, but kind of forgot about it. We had some generous use of `@generated` code in a number of places, most aggregiously in `Tables.allocatecolumns` and `Tables.eachcolumn`. It simplifies the code to either just use a slow fallback, or skip the generated code entirely by using macro expansion up to a certain limit. I tried to test out a number of cases, comparing time to compile vs. resulting compiled code performance, but it can be tricky to cover a wide variety of workflows. I might try to generate a wider variety to compare benchmarks pre/post this PR.
Going to close this for now, but if people are aware of other extremely slow cases, feel free to comment or open a new issue. |
Hello world,
At MLJ we noticed that for tables with large number of columns,
Tables.columntable
,Table.allocatecolumns
and in fact almost any method relying on the creation of aTuple
orNamedTuple
runs slowlyAs an example
running the above second time changing all other args but on the same table is quite fast.
Is this the norm?
I suspect this might be an issue for tables implementing
Tables.rows
and relying onTables.jl
default implementation forTables.columns
but shouldn't affect tables implementing both methods(this is assuming the implementation provided for the table doesn't requireNamedTuples)
.Since using
NamedTuples
is fast for tables with small number of columns but so slow for tables with large number of columns i wonder if we could have like an heuristic that can switch to a relying on a more efficient data structure (maybe aDict
) for tables with large number of columns.@quinnj, @bkamins, @nalimilan your thoughts would be very much appreciated.
cc. @ablaom.
The text was updated successfully, but these errors were encountered: