Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables.columntable, Table.allocatecolumns runs slowly for tables with large number of columns #209

Closed
OkonSamuel opened this issue Oct 20, 2020 · 3 comments

Comments

@OkonSamuel
Copy link
Contributor

OkonSamuel commented Oct 20, 2020

Hello world,
At MLJ we noticed that for tables with large number of columns, Tables.columntable, Table.allocatecolumns and in fact almost any method relying on the creation of a Tuple or NamedTuple runs slowly
As an example

using Tables
X = Tables.table(randn(200,10000));#returns almost immediately
col_table = Tables.columntable(X); # takes forever
or 
using DataFrames
df = DataFrame(randn(200,10000)); #returns almost immediately
col_table = Tables.columntable(df); # takes about 5 mins

# Creation of some large NamedTuples in julia isn't advisable 
names = tuple((Symbol(string("Column", j)) for j in 1:10000)...);
values = rand(10000);
NamedTuple{names}((values...,)); #takes forever

Tables.allocatecolumns(Tables.schema(df), 200); # takes about 3mins
# running the above second time changing all other args but on the same table is quite fast.
Tables.allocatecolumns(Tables.schema(df), 2000); # returns immediately

running the above second time changing all other args but on the same table is quite fast.
Is this the norm?

I suspect this might be an issue for tables implementing Tables.rows and relying on Tables.jl default implementation for Tables.columns but shouldn't affect tables implementing both methods(this is assuming the implementation provided for the table doesn't require NamedTuples).
Since using NamedTuples is fast for tables with small number of columns but so slow for tables with large number of columns i wonder if we could have like an heuristic that can switch to a relying on a more efficient data structure (maybe a Dict) for tables with large number of columns.

@quinnj, @bkamins, @nalimilan your thoughts would be very much appreciated.

cc. @ablaom.

@bkamins
Copy link
Member

bkamins commented Oct 20, 2020

The point you raise is exactly the reason why DataFrames.jl exists. It is designed for the case of very wide tables (and also tables that should allow changing of schema but this is a separate issue, as type-stable tables do not allow changing schema in-place). So in short:

  • if your data is narrow you can use some type-stable table
  • if your data is wide you can use a type-unstable table

and we provide both options in the ecosystem.

The typical workflow with DataFrames.jl is to use type-unstable DataFrame by default and fall back to type-stable table for transformations as usually they involve only few columns (DataFrames.jl in this case automatically falls-back to type stable mode and user does not have to manually handle it).

quinnj added a commit that referenced this issue Nov 6, 2020
Vastly improves #209. I actually thought this would trip us up sooner,
but kind of forgot about it. We had some generous use of `@generated`
code in a number of places, most aggregiously in
`Tables.allocatecolumns` and `Tables.eachcolumn`. It simplifies the code
to either just use a slow fallback, or skip the generated code entirely
by using macro expansion up to a certain limit. I tried to test out a
number of cases, comparing time to compile vs. resulting compiled code
performance, but it can be tricky to cover a wide variety of workflows.
I might try to generate a wider variety to compare benchmarks pre/post
this PR.
@quinnj
Copy link
Member

quinnj commented Nov 6, 2020

Thanks for reporting this @OkonSamuel; I've been doing some profiling of various parts of Tables.jl code lately, so this was timely to see the bottlenecks you're running into. I've proposed #211, which vastly improves the cases you listed above, while also simplifying several uses of @generated in the code.

I agree with @bkamins that a DataFrame may be better suited to use-cases with very wide tables, but I'd still like Tables.jl to try and be smarter about things, which #211 helps a lot with. That said, there will always be a high cost to calling things like Tables.columntable since that's explicitly materializing a wide NamedTuple, which as you noted can be prohibitive, i.e. not much we can do about in Tables.jl.

quinnj added a commit that referenced this issue Nov 6, 2020
…211)

Vastly improves #209. I actually thought this would trip us up sooner,
but kind of forgot about it. We had some generous use of `@generated`
code in a number of places, most aggregiously in
`Tables.allocatecolumns` and `Tables.eachcolumn`. It simplifies the code
to either just use a slow fallback, or skip the generated code entirely
by using macro expansion up to a certain limit. I tried to test out a
number of cases, comparing time to compile vs. resulting compiled code
performance, but it can be tricky to cover a wide variety of workflows.
I might try to generate a wider variety to compare benchmarks pre/post
this PR.
@quinnj
Copy link
Member

quinnj commented Nov 6, 2020

Going to close this for now, but if people are aware of other extremely slow cases, feel free to comment or open a new issue.

@quinnj quinnj closed this as completed Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants