Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Writer #66

Merged
merged 50 commits into from
May 23, 2020
Merged

Parquet Writer #66

merged 50 commits into from
May 23, 2020

Conversation

xiaodaigh
Copy link
Contributor

No description provided.

@xiaodaigh xiaodaigh closed this May 13, 2020
@xiaodaigh xiaodaigh reopened this May 13, 2020
Copy link
Member

@tanmaykm tanmaykm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for initiating this. This seems like a good start. Any idea why the CI has failed?
I had a quick look and have added a few suggestions.
Will be looking at it more in the next few days.

src/writer.jl Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
src/writer.jl Outdated Show resolved Hide resolved
src/writer.jl Outdated Show resolved Hide resolved
@xiaodaigh
Copy link
Contributor Author

xiaodaigh commented May 14, 2020

Thanks for initiating this. This seems like a good start. Any idea why the CI has failed?
I had a quick look and have added a few suggestions.
Will be looking at it more in the next few days.

Seems to be failing due to nonmissingtype which is not available in Base prior to Julia 1.3. Added. Hope no further issues. Missings.jl is added to deps for this reason.

I added CategoricalyArrays.jl purely so that I can use it to tell users it's not supported, please let me know if you think it's ok.

Update:
It's passing all tests now.

@xiaodaigh xiaodaigh mentioned this pull request May 14, 2020
src/writer.jl Outdated Show resolved Hide resolved
src/writer.jl Outdated Show resolved Hide resolved
src/writer.jl Outdated Show resolved Hide resolved
src/writer.jl Outdated Show resolved Hide resolved
src/writer.jl Outdated Show resolved Hide resolved
src/writer.jl Show resolved Hide resolved
src/writer.jl Outdated Show resolved Hide resolved
@xiaodaigh
Copy link
Contributor Author

Resolved conflict with master

@xiaodaigh
Copy link
Contributor Author

Broke by #81 #82

Tried to accommodate #81 but #82 is definitely a bug, as R and Python can read it OK.

test/test_writer.jl Outdated Show resolved Hide resolved
@tanmaykm
Copy link
Member

I feel we should have a write API that allows data to be written iteratively. Something like:

parwriter = ParFileWriter(path)

for chunk in table_chunks # chunk being a table itself
    write(parwriter, chunk)
end

Each iteration will add a row group to the file. And the file metadata can be updated at the end of the iteration. Looks like the underlying functions are already there here. So it should not be difficult. What do you think?

@xiaodaigh
Copy link
Contributor Author

What do you think?

I am ok with that in addition to an API for writing a full table. Cos the table_chunks have to come from somewhere, so for ppl to use it they would need a way to chunk tables, which is rarely done. I'd say 90% of use cases are where the user just provides a Table so it's most straightforward to just write it.

How about this? We merge this one first, if no objection to the write API as it is, then in a future release we write this chunk based writer (to cover what I guess is just 10% of use cases).

@xiaodaigh
Copy link
Contributor Author

xiaodaigh commented May 22, 2020

More serious issue is that tests broke by #82

src/Parquet.jl Outdated Show resolved Hide resolved
test/test_writer.jl Outdated Show resolved Hide resolved
@xiaodaigh
Copy link
Contributor Author

Merged latest changes from master and passing all tests. If there are no further changes requested from the writer, can I suggest we merge this. And implement the chunk based writer in the next release?

@tanmaykm
Copy link
Member

👍 will merge in a bit

@tanmaykm tanmaykm merged commit 6f445a8 into JuliaIO:master May 23, 2020
@tanmaykm
Copy link
Member

Oops, should have squashed and merged. Maybe I should rebase on the master, else it will become too unwieldy?

@tanmaykm
Copy link
Member

doesn't look possible now

@xiaodaigh
Copy link
Contributor Author

xiaodaigh commented May 23, 2020

Maybe I should rebase on the master, else it will become too unwieldy?

Feel free. To me it's not that big a deal. But it can be harder to pin down an issue

@xiaodaigh xiaodaigh deleted the xiaodaigh/parquet-writer branch May 23, 2020 02:49
tanmaykm pushed a commit that referenced this pull request May 24, 2020
tanmaykm pushed a commit that referenced this pull request May 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants