`Tables.columntable`, `Table.allocatecolumns` runs slowly for tables with large number of columns #209

OkonSamuel · 2020-10-20T03:19:56Z

Hello world,
At MLJ we noticed that for tables with large number of columns, Tables.columntable, Table.allocatecolumns and in fact almost any method relying on the creation of a Tuple or NamedTuple runs slowly
As an example

using Tables
X = Tables.table(randn(200,10000));#returns almost immediately
col_table = Tables.columntable(X); # takes forever
or 
using DataFrames
df = DataFrame(randn(200,10000)); #returns almost immediately
col_table = Tables.columntable(df); # takes about 5 mins

# Creation of some large NamedTuples in julia isn't advisable 
names = tuple((Symbol(string("Column", j)) for j in 1:10000)...);
values = rand(10000);
NamedTuple{names}((values...,)); #takes forever

Tables.allocatecolumns(Tables.schema(df), 200); # takes about 3mins
# running the above second time changing all other args but on the same table is quite fast.
Tables.allocatecolumns(Tables.schema(df), 2000); # returns immediately

running the above second time changing all other args but on the same table is quite fast.
Is this the norm?

I suspect this might be an issue for tables implementing Tables.rows and relying on Tables.jl default implementation for Tables.columns but shouldn't affect tables implementing both methods(this is assuming the implementation provided for the table doesn't require NamedTuples).
Since using NamedTuples is fast for tables with small number of columns but so slow for tables with large number of columns i wonder if we could have like an heuristic that can switch to a relying on a more efficient data structure (maybe a Dict) for tables with large number of columns.

@quinnj, @bkamins, @nalimilan your thoughts would be very much appreciated.

cc. @ablaom.

The text was updated successfully, but these errors were encountered:

bkamins · 2020-10-20T08:15:24Z

The point you raise is exactly the reason why DataFrames.jl exists. It is designed for the case of very wide tables (and also tables that should allow changing of schema but this is a separate issue, as type-stable tables do not allow changing schema in-place). So in short:

if your data is narrow you can use some type-stable table
if your data is wide you can use a type-unstable table

and we provide both options in the ecosystem.

The typical workflow with DataFrames.jl is to use type-unstable DataFrame by default and fall back to type-stable table for transformations as usually they involve only few columns (DataFrames.jl in this case automatically falls-back to type stable mode and user does not have to manually handle it).

Vastly improves #209. I actually thought this would trip us up sooner, but kind of forgot about it. We had some generous use of `@generated` code in a number of places, most aggregiously in `Tables.allocatecolumns` and `Tables.eachcolumn`. It simplifies the code to either just use a slow fallback, or skip the generated code entirely by using macro expansion up to a certain limit. I tried to test out a number of cases, comparing time to compile vs. resulting compiled code performance, but it can be tricky to cover a wide variety of workflows. I might try to generate a wider variety to compare benchmarks pre/post this PR.

quinnj · 2020-11-06T06:59:19Z

Thanks for reporting this @OkonSamuel; I've been doing some profiling of various parts of Tables.jl code lately, so this was timely to see the bottlenecks you're running into. I've proposed #211, which vastly improves the cases you listed above, while also simplifying several uses of @generated in the code.

I agree with @bkamins that a DataFrame may be better suited to use-cases with very wide tables, but I'd still like Tables.jl to try and be smarter about things, which #211 helps a lot with. That said, there will always be a high cost to calling things like Tables.columntable since that's explicitly materializing a wide NamedTuple, which as you noted can be prohibitive, i.e. not much we can do about in Tables.jl.

…211) Vastly improves #209. I actually thought this would trip us up sooner, but kind of forgot about it. We had some generous use of `@generated` code in a number of places, most aggregiously in `Tables.allocatecolumns` and `Tables.eachcolumn`. It simplifies the code to either just use a slow fallback, or skip the generated code entirely by using macro expansion up to a certain limit. I tried to test out a number of cases, comparing time to compile vs. resulting compiled code performance, but it can be tricky to cover a wide variety of workflows. I might try to generate a wider variety to compare benchmarks pre/post this PR.

quinnj · 2020-11-06T16:22:34Z

Going to close this for now, but if people are aware of other extremely slow cases, feel free to comment or open a new issue.

quinnj mentioned this issue Nov 6, 2020

Introduce a specialization threshold and try to avoid generated code #211

Merged

quinnj closed this as completed Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Tables.columntable`, `Table.allocatecolumns` runs slowly for tables with large number of columns #209

`Tables.columntable`, `Table.allocatecolumns` runs slowly for tables with large number of columns #209

OkonSamuel commented Oct 20, 2020 •

edited

Loading

bkamins commented Oct 20, 2020

quinnj commented Nov 6, 2020

quinnj commented Nov 6, 2020

Tables.columntable, Table.allocatecolumns runs slowly for tables with large number of columns #209

Tables.columntable, Table.allocatecolumns runs slowly for tables with large number of columns #209

Comments

OkonSamuel commented Oct 20, 2020 • edited Loading

bkamins commented Oct 20, 2020

quinnj commented Nov 6, 2020

quinnj commented Nov 6, 2020

`Tables.columntable`, `Table.allocatecolumns` runs slowly for tables with large number of columns #209

`Tables.columntable`, `Table.allocatecolumns` runs slowly for tables with large number of columns #209

OkonSamuel commented Oct 20, 2020 •

edited

Loading