-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add parquet-support #445
add parquet-support #445
Conversation
hydromt/data_catalog.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this PR looks great! Just a quick first comment. I'd prefer not to deduct drivers based on extensions in this part of the code. If that's something we want I suggest to make a drivers extension mapping per adapter in each of the data_adapter scripts rather than here. In that case we would not have a default driver, but rather deduct the driver if not provided by the user. Also the parquet drivers is not available for RaterDataset and GeoDataset adapters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't necessarily like trying to guess drivers, the reason I did it here is that that was what most of the other code did. I'll switch it over to using a driver argument, which I agree is nicer.
Do you mean with your last comment that you'd like parquet support added for those adapters or that it needs to be better documented or that that is a blocker to add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that you have implemented parquet for geodataframe and dataframe adaptors, which are the correct places. For rasterdatasets and geodataframes I don't think parquet can be used and you also haven't implemented it. However in the DataCatalog you do check the file extension also in get_rasterdataset
and get_geodataset
methods. But as mentioned before I think it's best to not include it in any DataCatalog method, but rather at the DataAdapter side if we choose to infer the driver from the extension.
@DirkEilander As far as I know all the issues on this one you mentioned are resolved. Could you let me know if I missed anything or else approve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Issue addressed
Fixes #390
Explanation
Parquet is a much faster and cloud optimized tabular data format, which supports much more functionalities such as lazy loading, metadata scanning and compression. It's quite widely used in data science so aside from it being good to support it for users, it will also significantly help moving to the cloud and improving performance. There are also a lot of other libraries that support it, opening other possibilities but that is a separate issue. The implementation is very simple, pandas itself support parquet reading, so we just use that. Basically everywhere that csv is supported parquet should be supported as well. The only difference is that parquet doesn't need to be told to parse date times because that information is stored in the actual file metadata, so in some places the kwargs had to be adjusted slightly, but other than that, everything remains the same.
Checklist
main
Additional Notes (optional)
As usual I forgot to update the documentation, let me just fix that right now.