Skip to content

Conversation

@bkamins
Copy link
Member

@bkamins bkamins commented Oct 30, 2022

Having this function would be convenient in cases like https://discourse.julialang.org/t/arrow-stream-usage-clarification/89508.

With this PR adding new functionalities for 1.5 release starts. I will manage patches to 1.4.x releases in a separate branch if they are needed.

@bkamins bkamins requested a review from nalimilan October 30, 2022 09:21
@bkamins bkamins added this to the 1.5 milestone Oct 30, 2022
@bkamins
Copy link
Member Author

bkamins commented Oct 30, 2022

CI failure is unrelated.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Nov 2, 2022

Thank you!

I will wait with merging this. First I will finalize #3213 and make a patch release (the problem #3213 fixes is pretty common in combination with CSV.jl).

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great addition!

@bkamins
Copy link
Member Author

bkamins commented Nov 2, 2022

@quinnj - as you probably know I want to add it to allow easy splitting of a data frame into record batches for Arrow.jl.

@quinnj
Copy link
Member

quinnj commented Nov 2, 2022

Yeah, I saw the discourse post. I wonder if we should add a note about Tables.partitions as well. I.e. doing Arrow.write(filename, Tables.partitions(Iterators.partition(df, n))) will work; is that the recommended invocation you have in mind?

@bkamins
Copy link
Member Author

bkamins commented Nov 2, 2022

Ah - you are right. I need to change the implementation so that Iterations.PartitionIterator is returned, as I see that Tables.partition does not recognize any Generator, but requires a specific one. Thank you!

@bkamins
Copy link
Member Author

bkamins commented Nov 2, 2022

After this change it will be enough to write Arrow.write(filename, Iterators.partition(df, n)), as Arrow.write calls Tables.partitions internally anyway.

@quinnj
Copy link
Member

quinnj commented Nov 2, 2022

After this change it will be enough to write Arrow.write(filename, Iterators.partition(df, n)), as Arrow.write calls Tables.partitions internally anyway.

Ah yes, you're right. Nice; that's very simple/clean.

@bkamins
Copy link
Member Author

bkamins commented Nov 2, 2022

OK - I have made the changes and tested that Arrow.write works as expected. Thank you!

@nalimilan - as usual (no rush) - it would be great if you had a look at the PR again.

bkamins and others added 3 commits November 23, 2022 10:49
@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2022

Thank you! I will merge this after we make DataFrames.jl 1.4.4 release.

@bkamins bkamins merged commit 2bbcd57 into main Dec 2, 2022
@bkamins bkamins deleted the bk/partition branch December 2, 2022 10:34
@bkamins
Copy link
Member Author

bkamins commented Dec 2, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants