-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow to change schema and/or metadata in to_parquet #3182
Comments
How would you like this to work together with GeoParquet metadata requirements? We need to ensure these are correct but I'd like to avoid writing some form of a validation if a user passes custom schema. Would you create it all by yourself? Or do you just want to alter parts of it, or extend it? The example you show is tricky as we can't just use |
I don't want to change the geo metadata at all. I want to add my own key to the metadata with completely independant metadata. So a simple merge of the metadata would be fine for my use case. So in the end it might have for example geo (unmodified), license and specification_version as keys. I think similarly for the fields in pq_schema. The base should be what geopandas does currently and whatever is also available in the fields the user provides should override the corresponding fields created by geopandas schema. I figured in the meantime that casting the columns can somewhat be used to change the field data types, so that's not so important, but that there's no way to add more to the Parquet metadata is really restricting. Disclaimer: This thinking might be biased by my use case and it's good to hear other opinions. |
We already add the geoparquet "geo" metadata to the schema right now as well, which at the moment is created by pyarrow in the geopandas/geopandas/io/arrow.py Lines 269 to 278 in 0898e66
So I don't think it is really the problem that the user specified schema would already contain metadata (as long as it is not called "geo", or else that would get overwritten by geopandas). I think the main problem is what to do with the geometry column(s)? If a user specifies a schema, do we assume it already contains the geometry columns as well? But in that case, the user needs to know that it should (at the moment, with WKB encoding) be specified as variable size binary. Or do we assume it is the schema for all non-geometry columns, and we add fields for the geometry columns to that schema? |
Thanks! If the geo key gets overwritten, I'd assume the geometry column(s) also get overwritten, based on their name. The other reasonsable alternative would be to not override if geo metadata or a geometry column is available and thus making the user fully responsible for this actions.
Good to know! |
Is your feature request related to a problem?
I'd like to provide a custom schema to to_parquet, mostly for adding metadata, but I could also see me tweaking the column data types.
Describe the solution you'd like
Example:
This returns an error:
(Maybe this is a bug, but felt more like a feature request.)
I'd like a way to change schema and/or metadata in to_parquet, e.g. as proposed above.
I guess this may have some complications with regards to the existing schema created for the GeoDataFrame.
API breaking implications
I'd assume there are no breaking changes required.
Describe alternatives you've considered
Well, I guess the alternative is to copy the source code of to_parquet into my source code or to not use geopandas.
Additional context
None
The text was updated successfully, but these errors were encountered: