add Iterators.partition for DataFrameRows #3299

bkamins · 2023-03-12T20:16:01Z

@svilupp - could you please review this PR against your requirements?

@nalimilan - do you think it should be patch or should go to 1.6 release?

bkamins · 2023-03-12T20:18:20Z

@svilupp - a related decision is #3054. Would you need Tables.rows(df) and Tables.colums(df) to be a valid Tables.jl table, or it is enough for you that they are iterators?

svilupp · 2023-03-13T08:02:32Z

Wow! So fast! Thank you, @bkamins!

This fixes apache/arrow-julia#403

Missigness is now retained through Arrow serialization and the test case succeeds:

df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
Arrow.write(fn, Iterators.partition(Tables.rows(df), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]|>eltype  
# Union{Missing, String}

Out of curiosity, it's not entirely clear what makes the default fallback lose the type information given that it's views. Is it this line that fixed it?

It would be awesome if this could be a patch, because there are other partitioning use cases where people rely on the general fallback for Tables-sources Iterators.partition(Tables.rows(df),chunksize) and DataFrames are arguably the most popular source.
I think there are other packages affected as well (eg, here)

The broader context is that most basic users aren't aware that they need to partition their datasets manually to leverage threading in Arrow (which makes Arrow pretty slow in comparison with PyArrow / Polars, see here).

I hope to change that by chunking data by default (PR here), which is something that PyArrow does automatically (reference).
For the automatic chunking to work, we'd need to have a consistent chunking method for all Tables-sources (hence, the need for the above call to work). I appreciate that we could add some specializations by type but that would required heavier dependencies.

bkamins · 2023-03-13T08:02:35Z

CI error on nightly is unrelated (it is a problem with compilation time and I cannot reproduce it locally)

svilupp · 2023-03-13T08:15:04Z

@svilupp - a related decision is #3054. Would you need Tables.rows(df) and Tables.colums(df) to be a valid Tables.jl table, or it is enough for you that they are iterators?

For the Arrow.write use case, I don't think we need more than what we have (take it with a pinch of salt -- it's my first week of getting to know Arrow.jl code base).
The only Tables methods I see used are: partitions, columns (and Columns), schema (and Schema), rowcount, columnnames, getcolumn and eachcolumn.

I think rowcount is the equivalent of nrow (mentioned in the issue 3054), but it already has a pretty decent fallback

Perhaps @quinnj is better suited to provide his opinion here?

bkamins · 2023-03-13T08:18:45Z

Out of curiosity, it's not entirely clear what makes the default fallback lose the type information given that it's views.

No, the reason was that previously we relied on Julia inferring the type of elements of the collection by inspecting them (and a narrowest eltype was therefore picked). Now we pass object that carries over schema information.

Aside. The pattern:

Iterators.partition(Tables.rows(df),chunksize)

does not seem to be fully correct. The point is that Tables.rows contract does not guarantee to return a valid Tables.jl table. It only guarantees to return an AbstractRow-compatible iterator , which is a weaker requirement. However, @quinnj probably should verify this claim. However, this is the reason of your problem.

This is the situation before this PR:

julia> isnothing(Tables.schema(first(Iterators.partition(eachrow(df), 2))))
true

what the PR changes is that the above statement returns a schema of the underlying data frame (as this is doable, however, I believe that Tables.jl contract does not guarantee this).

src/abstractdataframe/iteration.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2023-04-08T15:39:11Z

Thank you!

add Iterators.partition for DataFrameRows

530f7e2

bkamins requested a review from nalimilan March 12, 2023 20:16

bkamins added the feature label Mar 12, 2023

bkamins added this to the patch milestone Mar 12, 2023

nalimilan approved these changes Apr 7, 2023

View reviewed changes

src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved

src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved

src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved

bkamins and others added 2 commits April 8, 2023 13:29

Apply suggestions from code review

8f3b475

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Merge branch 'main' into bk/partition_DataFrameRows

b29825a

bkamins merged commit 23a28b1 into main Apr 8, 2023
6 of 7 checks passed

bkamins deleted the bk/partition_DataFrameRows branch April 8, 2023 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Iterators.partition for DataFrameRows #3299

add Iterators.partition for DataFrameRows #3299

bkamins commented Mar 12, 2023

bkamins commented Mar 12, 2023

svilupp commented Mar 13, 2023

bkamins commented Mar 13, 2023

svilupp commented Mar 13, 2023

bkamins commented Mar 13, 2023 •

edited

bkamins commented Apr 8, 2023

add Iterators.partition for DataFrameRows #3299

add Iterators.partition for DataFrameRows #3299

Conversation

bkamins commented Mar 12, 2023

bkamins commented Mar 12, 2023

svilupp commented Mar 13, 2023

bkamins commented Mar 13, 2023

svilupp commented Mar 13, 2023

bkamins commented Mar 13, 2023 • edited

bkamins commented Apr 8, 2023

bkamins commented Mar 13, 2023 •

edited