add parquet-support #445

savente93 · 2023-07-17T08:54:36Z

Issue addressed

Fixes #390

Explanation

Parquet is a much faster and cloud optimized tabular data format, which supports much more functionalities such as lazy loading, metadata scanning and compression. It's quite widely used in data science so aside from it being good to support it for users, it will also significantly help moving to the cloud and improving performance. There are also a lot of other libraries that support it, opening other possibilities but that is a separate issue. The implementation is very simple, pandas itself support parquet reading, so we just use that. Basically everywhere that csv is supported parquet should be supported as well. The only difference is that parquet doesn't need to be told to parse date times because that information is stored in the actual file metadata, so in some places the kwargs had to be adjusted slightly, but other than that, everything remains the same.

Checklist

Updated tests or added new tests
Branch is up to date with main
Tests & pre-commit hooks pass
Updated documentation if needed
Updated changelog.rst if needed

Additional Notes (optional)

As usual I forgot to update the documentation, let me just fix that right now.

DirkEilander · 2023-07-17T13:01:27Z

hydromt/data_catalog.py

Generally this PR looks great! Just a quick first comment. I'd prefer not to deduct drivers based on extensions in this part of the code. If that's something we want I suggest to make a drivers extension mapping per adapter in each of the data_adapter scripts rather than here. In that case we would not have a default driver, but rather deduct the driver if not provided by the user. Also the parquet drivers is not available for RaterDataset and GeoDataset adapters.

I also don't necessarily like trying to guess drivers, the reason I did it here is that that was what most of the other code did. I'll switch it over to using a driver argument, which I agree is nicer.

Do you mean with your last comment that you'd like parquet support added for those adapters or that it needs to be better documented or that that is a blocker to add this?

I mean that you have implemented parquet for geodataframe and dataframe adaptors, which are the correct places. For rasterdatasets and geodataframes I don't think parquet can be used and you also haven't implemented it. However in the DataCatalog you do check the file extension also in get_rasterdataset and get_geodataset methods. But as mentioned before I think it's best to not include it in any DataCatalog method, but rather at the DataAdapter side if we choose to infer the driver from the extension.

savente93 · 2023-07-27T10:20:33Z

@DirkEilander As far as I know all the issues on this one you mentioned are resolved. Could you let me know if I missed anything or else approve?

DirkEilander

LGTM!

savente93 added 3 commits July 17, 2023 10:50

add parquet-support

111d1d7

update changelog

4e016a5

add docs about parquet

8b9d7ac

savente93 marked this pull request as ready for review July 17, 2023 09:09

DirkEilander reviewed Jul 17, 2023

View reviewed changes

savente93 added 5 commits July 17, 2023 21:32

remove driver guessing in catalog

8ba7c43

Merge branch 'main' into parquet-support

a48cf70

remove defaults channel

5bedf31

Merge branch 'main' into parquet-support

f6fc3eb

Merge branch 'main' into parquet-support

fbe1716

savente93 added 2 commits July 31, 2023 11:28

Merge branch 'main' into parquet-support

e166434

fix stray warning

64508cd

DirkEilander approved these changes Jul 31, 2023

View reviewed changes

savente93 merged commit 50e50b5 into main Jul 31, 2023
7 checks passed

savente93 deleted the parquet-support branch July 31, 2023 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add parquet-support #445

add parquet-support #445

savente93 commented Jul 17, 2023 •

edited

Loading

DirkEilander Jul 17, 2023

savente93 Jul 17, 2023

DirkEilander Jul 17, 2023

savente93 commented Jul 27, 2023

DirkEilander left a comment

add parquet-support #445

add parquet-support #445

Conversation

savente93 commented Jul 17, 2023 • edited Loading

Issue addressed

Explanation

Checklist

Additional Notes (optional)

DirkEilander Jul 17, 2023

Choose a reason for hiding this comment

savente93 Jul 17, 2023

Choose a reason for hiding this comment

DirkEilander Jul 17, 2023

Choose a reason for hiding this comment

savente93 commented Jul 27, 2023

DirkEilander left a comment

Choose a reason for hiding this comment

savente93 commented Jul 17, 2023 •

edited

Loading