Allow stored names/types in Schema for very large schemas #241

quinnj · 2021-06-18T22:37:04Z

There have been a few cases of extremely wide tables where users have
run into fundamental compiler limits for lengths of tuples (as discussed
with core devs). One example is
JuliaData/CSV.jl#635. This PR proposes for
very large schemas (> 65,000 columns), to store names/types in Vector
instead of tuples with the aim to avoid breaking the runtime. The aim
here is to be as non-disruptive as possible, hence the very high
threshold for switching over to store names/types. Another goal is that
downstream packages don't break with just these changes in place. I'm
not aware of any packages testing such wide tables, but in my own
testing, I've seen issues where packages are relying on the
Tables.Schema type parameters for names/types. There's also an issue
in DataFrames where Tables.schema attempts to construct a
Tables.Schema directly instead of using the Tables.Schema(names, types) constructor. So while this PR is needed, we'll need to play
whack-a-mole with downstream packages to ensure these really wide tables
can be properly supported end-to-end. Going through those downstream
package changes, we should probably make notes of how we can clarify
Tables.jl interface docs to hopefully help future implementors do so
properly and avoid the same pitfalls.

There have been a few cases of extremely wide tables where users have run into fundamental compiler limits for lengths of tuples (as discussed with core devs). One example is JuliaData/CSV.jl#635. This PR proposes for very large schemas (> 65,000 columns), to store names/types in `Vector` instead of tuples with the aim to avoid breaking the runtime. The aim here is to be as non-disruptive as possible, hence the very high threshold for switching over to store names/types. Another goal is that downstream packages don't break with just these changes in place. I'm not aware of any packages testing such wide tables, but in my own testing, I've seen issues where packages are relying on the `Tables.Schema` type parameters for names/types. There's also an issue in DataFrames where `Tables.schema` attempts to construct a `Tables.Schema` directly instead of using the `Tables.Schema(names, types)` constructor. So while this PR is needed, we'll need to play whack-a-mole with downstream packages to ensure these really wide tables can be properly supported end-to-end. Going through those downstream package changes, we should probably make notes of how we can clarify Tables.jl interface docs to hopefully help future implementors do so properly and avoid the same pitfalls.

codecov · 2021-06-18T22:39:37Z

Codecov Report

Merging #241 (142da37) into main (d038805) will increase coverage by 0.16%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #241      +/-   ##
==========================================
+ Coverage   94.59%   94.75%   +0.16%     
==========================================
  Files           7        7              
  Lines         610      629      +19     
==========================================
+ Hits          577      596      +19     
  Misses         33       33

Impacted Files	Coverage Δ
src/Tables.jl	`88.27% <100.00%> (+0.77%)`	⬆️
src/fallbacks.jl	`97.82% <100.00%> (ø)`
src/namedtuples.jl	`100.00% <100.00%> (ø)`
src/utils.jl	`89.39% <100.00%> (+1.46%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d038805...142da37. Read the comment docs.

bkamins · 2021-06-19T07:05:01Z

Do we test against empty schema?

ablaom · 2021-06-19T21:28:30Z

Great, this may help JuliaAI/MLJBase.jl#428

quinnj · 2021-06-22T17:45:18Z

Do we test against empty schema?

Looks like we don't have any explicit tests; I'll add a few.

Great, this may help JuliaAI/MLJBase.jl#428

I don't think this will affect the case mentioned in that issue specifically. The core issue here is that Julia's core Tuple/NamedTuple types hit fundamental issues when larger than 67,000 elements. Probably something about overflowing memory pages or something low-level like that. The fix I'm proposing in this PR is to avoid storing column names/types in Tuple and Tuple{} types when there are more columns than 67,000, and instead store them in Vector{Symbol} and Vector{Type}. I did add the ability to call Tables.Schema(names, types; stored=true), which would allow custom table types to always store names/types in Vectors if desired.

But none of that changes that when you try to do Tables.columntable(x) on a really wide table that it will be slow, because it's trying to generate a NamedTuple{names, types} object, and that uses Tuple/Tuple{}, so it's going to be slow for # of columns larger than 10K, and will completely break when > 67K. Which reminds me that I'm going to add a note in the docs for Tables.columntable about that limitation. These kinds of slowdowns/limitations also apply to TypedTables and JuliaDB, and is a key driving design for DataFrames.jl storing columns as Vector{AbstractVector}.

quinnj · 2021-06-22T18:30:27Z

Ok, I've made Tables.columntable/Tables.rowtable/Tables.namedtupleiterator all throw ArgumentError when the input tables are too wide; that's at least better than showing weird compiler breakage. I also tested that various Tables.jl-provided tables work on really wide tables (dict tables, matrix, etc.).

This is part of fixing errors like JuliaData/CSV.jl#635 in addition to the changes to support really wide tables in JuliaData/Tables.jl#241. Luckily, there aren't many cases I've found across Tables.jl implementations that make working with really wide tables impossible, but this was a key place where for really wide tables, we want the names/types to be stored as `Vector`s instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any noticeable change/affect for non-wide DataFrames and should be covered by existing tests.

bkamins · 2021-06-22T19:39:42Z

Great work - thank you!

bkamins · 2021-06-22T19:45:22Z

src/Tables.jl

    get(io, :print_schema_header, true) && println(io, "Tables.Schema:")
-    Base.print_matrix(io, hcat(collect(names), types === nothing ? fill(nothing, length(names)) : collect(fieldtype(types, i) for i = 1:fieldcount(types))))
+    nms = sch.names
+    Base.print_matrix(io, hcat(nms isa Vector ? nms : collect(nms), sch.types === nothing ? fill(nothing, length(nms)) : collect(sch.types)))


just to be sure - are both paths tested?

…ly (#2797) This is part of fixing errors like JuliaData/CSV.jl#635 in addition to the changes to support really wide tables in JuliaData/Tables.jl#241. Luckily, there aren't many cases I've found across Tables.jl implementations that make working with really wide tables impossible, but this was a key place where for really wide tables, we want the names/types to be stored as `Vector`s instead of `Tuple`/`Tuple{}` in `Tables.Schema`. This shouldn't have any noticeable change/affect for non-wide DataFrames and should be covered by existing tests.

Add tests; update eachcolumn/eachcolumns

018b7e8

quinnj added 2 commits June 22, 2021 12:28

Add some more testing for Tables.jl-provided types

11fee9f

fix

d4f5455

fix2

59e41b3

quinnj mentioned this pull request Jun 22, 2021

Use standard Tables.Schema constructor instead of constructing directly JuliaData/DataFrames.jl#2797

Merged

quinnj added 2 commits June 22, 2021 12:37

fix corner case

3d64d89

fix tests

142da37

bkamins reviewed Jun 22, 2021

View reviewed changes

bkamins approved these changes Jun 22, 2021

View reviewed changes

quinnj merged commit aab3e55 into main Jun 23, 2021

quinnj deleted the jq/reallywidetables branch June 23, 2021 02:34

OkonSamuel mentioned this pull request Aug 13, 2021

schema fails when selecting columns from a really wide table JuliaData/TableOperations.jl#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow stored names/types in Schema for very large schemas #241

Allow stored names/types in Schema for very large schemas #241

quinnj commented Jun 18, 2021

codecov bot commented Jun 18, 2021 •

edited

Loading

bkamins commented Jun 19, 2021

ablaom commented Jun 19, 2021

quinnj commented Jun 22, 2021

quinnj commented Jun 22, 2021

bkamins commented Jun 22, 2021

bkamins Jun 22, 2021

Allow stored names/types in Schema for very large schemas #241

Allow stored names/types in Schema for very large schemas #241

Conversation

quinnj commented Jun 18, 2021

codecov bot commented Jun 18, 2021 • edited Loading

Codecov Report

bkamins commented Jun 19, 2021

ablaom commented Jun 19, 2021

quinnj commented Jun 22, 2021

quinnj commented Jun 22, 2021

bkamins commented Jun 22, 2021

bkamins Jun 22, 2021

Choose a reason for hiding this comment

codecov bot commented Jun 18, 2021 •

edited

Loading