Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add parquet-support #445

Merged
merged 10 commits into from
Jul 31, 2023
Merged

add parquet-support #445

merged 10 commits into from
Jul 31, 2023

Conversation

savente93
Copy link
Contributor

@savente93 savente93 commented Jul 17, 2023

Issue addressed

Fixes #390

Explanation

Parquet is a much faster and cloud optimized tabular data format, which supports much more functionalities such as lazy loading, metadata scanning and compression. It's quite widely used in data science so aside from it being good to support it for users, it will also significantly help moving to the cloud and improving performance. There are also a lot of other libraries that support it, opening other possibilities but that is a separate issue. The implementation is very simple, pandas itself support parquet reading, so we just use that. Basically everywhere that csv is supported parquet should be supported as well. The only difference is that parquet doesn't need to be told to parse date times because that information is stored in the actual file metadata, so in some places the kwargs had to be adjusted slightly, but other than that, everything remains the same.

Checklist

  • Updated tests or added new tests
  • Branch is up to date with main
  • Tests & pre-commit hooks pass
  • Updated documentation if needed
  • Updated changelog.rst if needed

Additional Notes (optional)

As usual I forgot to update the documentation, let me just fix that right now.

@savente93 savente93 marked this pull request as ready for review July 17, 2023 09:09
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this PR looks great! Just a quick first comment. I'd prefer not to deduct drivers based on extensions in this part of the code. If that's something we want I suggest to make a drivers extension mapping per adapter in each of the data_adapter scripts rather than here. In that case we would not have a default driver, but rather deduct the driver if not provided by the user. Also the parquet drivers is not available for RaterDataset and GeoDataset adapters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't necessarily like trying to guess drivers, the reason I did it here is that that was what most of the other code did. I'll switch it over to using a driver argument, which I agree is nicer.

Do you mean with your last comment that you'd like parquet support added for those adapters or that it needs to be better documented or that that is a blocker to add this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you have implemented parquet for geodataframe and dataframe adaptors, which are the correct places. For rasterdatasets and geodataframes I don't think parquet can be used and you also haven't implemented it. However in the DataCatalog you do check the file extension also in get_rasterdataset and get_geodataset methods. But as mentioned before I think it's best to not include it in any DataCatalog method, but rather at the DataAdapter side if we choose to infer the driver from the extension.

@savente93
Copy link
Contributor Author

@DirkEilander As far as I know all the issues on this one you mentioned are resolved. Could you let me know if I missed anything or else approve?

Copy link
Contributor

@DirkEilander DirkEilander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@savente93 savente93 merged commit 50e50b5 into main Jul 31, 2023
7 checks passed
@savente93 savente93 deleted the parquet-support branch July 31, 2023 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support parquet as tabular data driver
2 participants