Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow stored names/types in Schema for very large schemas #241

Merged
merged 7 commits into from
Jun 23, 2021

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Jun 18, 2021

There have been a few cases of extremely wide tables where users have
run into fundamental compiler limits for lengths of tuples (as discussed
with core devs). One example is
JuliaData/CSV.jl#635. This PR proposes for
very large schemas (> 65,000 columns), to store names/types in Vector
instead of tuples with the aim to avoid breaking the runtime. The aim
here is to be as non-disruptive as possible, hence the very high
threshold for switching over to store names/types. Another goal is that
downstream packages don't break with just these changes in place. I'm
not aware of any packages testing such wide tables, but in my own
testing, I've seen issues where packages are relying on the
Tables.Schema type parameters for names/types. There's also an issue
in DataFrames where Tables.schema attempts to construct a
Tables.Schema directly instead of using the Tables.Schema(names, types) constructor. So while this PR is needed, we'll need to play
whack-a-mole with downstream packages to ensure these really wide tables
can be properly supported end-to-end. Going through those downstream
package changes, we should probably make notes of how we can clarify
Tables.jl interface docs to hopefully help future implementors do so
properly and avoid the same pitfalls.

There have been a few cases of extremely wide tables where users have
run into fundamental compiler limits for lengths of tuples (as discussed
with core devs). One example is
JuliaData/CSV.jl#635. This PR proposes for
very large schemas (> 65,000 columns), to store names/types in `Vector`
instead of tuples with the aim to avoid breaking the runtime. The aim
here is to be as non-disruptive as possible, hence the very high
threshold for switching over to store names/types. Another goal is that
downstream packages don't break with just these changes in place. I'm
not aware of any packages testing such wide tables, but in my own
testing, I've seen issues where packages are relying on the
`Tables.Schema` type parameters for names/types. There's also an issue
in DataFrames where `Tables.schema` attempts to construct a
`Tables.Schema` directly instead of using the `Tables.Schema(names,
types)` constructor. So while this PR is needed, we'll need to play
whack-a-mole with downstream packages to ensure these really wide tables
can be properly supported end-to-end. Going through those downstream
package changes, we should probably make notes of how we can clarify
Tables.jl interface docs to hopefully help future implementors do so
properly and avoid the same pitfalls.
@codecov
Copy link

codecov bot commented Jun 18, 2021

Codecov Report

Merging #241 (142da37) into main (d038805) will increase coverage by 0.16%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #241      +/-   ##
==========================================
+ Coverage   94.59%   94.75%   +0.16%     
==========================================
  Files           7        7              
  Lines         610      629      +19     
==========================================
+ Hits          577      596      +19     
  Misses         33       33              
Impacted Files Coverage Δ
src/Tables.jl 88.27% <100.00%> (+0.77%) ⬆️
src/fallbacks.jl 97.82% <100.00%> (ø)
src/namedtuples.jl 100.00% <100.00%> (ø)
src/utils.jl 89.39% <100.00%> (+1.46%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d038805...142da37. Read the comment docs.

@bkamins
Copy link
Member

bkamins commented Jun 19, 2021

Do we test against empty schema?

@ablaom
Copy link
Contributor

ablaom commented Jun 19, 2021

Great, this may help JuliaAI/MLJBase.jl#428

@quinnj
Copy link
Member Author

quinnj commented Jun 22, 2021

Do we test against empty schema?

Looks like we don't have any explicit tests; I'll add a few.

Great, this may help JuliaAI/MLJBase.jl#428

I don't think this will affect the case mentioned in that issue specifically. The core issue here is that Julia's core Tuple/NamedTuple types hit fundamental issues when larger than 67,000 elements. Probably something about overflowing memory pages or something low-level like that. The fix I'm proposing in this PR is to avoid storing column names/types in Tuple and Tuple{} types when there are more columns than 67,000, and instead store them in Vector{Symbol} and Vector{Type}. I did add the ability to call Tables.Schema(names, types; stored=true), which would allow custom table types to always store names/types in Vectors if desired.

But none of that changes that when you try to do Tables.columntable(x) on a really wide table that it will be slow, because it's trying to generate a NamedTuple{names, types} object, and that uses Tuple/Tuple{}, so it's going to be slow for # of columns larger than 10K, and will completely break when > 67K. Which reminds me that I'm going to add a note in the docs for Tables.columntable about that limitation. These kinds of slowdowns/limitations also apply to TypedTables and JuliaDB, and is a key driving design for DataFrames.jl storing columns as Vector{AbstractVector}.

@quinnj
Copy link
Member Author

quinnj commented Jun 22, 2021

Ok, I've made Tables.columntable/Tables.rowtable/Tables.namedtupleiterator all throw ArgumentError when the input tables are too wide; that's at least better than showing weird compiler breakage. I also tested that various Tables.jl-provided tables work on really wide tables (dict tables, matrix, etc.).

quinnj added a commit to JuliaData/DataFrames.jl that referenced this pull request Jun 22, 2021
This is part of fixing errors like
JuliaData/CSV.jl#635 in addition to the
changes to support really wide tables in
JuliaData/Tables.jl#241. Luckily, there aren't
many cases I've found across Tables.jl implementations that make working
with really wide tables impossible, but this was a key place where for
really wide tables, we want the names/types to be stored as `Vector`s
instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any
noticeable change/affect for non-wide DataFrames and should be covered
by existing tests.
@bkamins
Copy link
Member

bkamins commented Jun 22, 2021

Great work - thank you!

get(io, :print_schema_header, true) && println(io, "Tables.Schema:")
Base.print_matrix(io, hcat(collect(names), types === nothing ? fill(nothing, length(names)) : collect(fieldtype(types, i) for i = 1:fieldcount(types))))
nms = sch.names
Base.print_matrix(io, hcat(nms isa Vector ? nms : collect(nms), sch.types === nothing ? fill(nothing, length(nms)) : collect(sch.types)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be sure - are both paths tested?

quinnj added a commit to JuliaData/DataFrames.jl that referenced this pull request Jun 22, 2021
…ly (#2797)

This is part of fixing errors like
JuliaData/CSV.jl#635 in addition to the
changes to support really wide tables in
JuliaData/Tables.jl#241. Luckily, there aren't
many cases I've found across Tables.jl implementations that make working
with really wide tables impossible, but this was a key place where for
really wide tables, we want the names/types to be stored as `Vector`s
instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any
noticeable change/affect for non-wide DataFrames and should be covered
by existing tests.
@quinnj quinnj merged commit aab3e55 into main Jun 23, 2021
@quinnj quinnj deleted the jq/reallywidetables branch June 23, 2021 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants