ENH: Allow to change schema and/or metadata in to_parquet #3182

m-mohr · 2024-02-09T17:47:59Z

Is your feature request related to a problem?

I'd like to provide a custom schema to to_parquet, mostly for adding metadata, but I could also see me tweaking the column data types.

Describe the solution you'd like

Example:

  pq_schema = pa.schema([
    pa.field("example", pa.string())
  ])
  pq_schema = pq_schema.with_metadata({"example": json.dumps({"foo": "bar"}).encode("utf-8")})
  data = GeoDataFrame.from_features(features, columns = ["example"], index = False)
  data.to_parquet(output_file, schema = pq_schema)

This returns an error:

TypeError: ParquetWriter.init() got multiple values for argument 'schema'

(Maybe this is a bug, but felt more like a feature request.)

I'd like a way to change schema and/or metadata in to_parquet, e.g. as proposed above.
I guess this may have some complications with regards to the existing schema created for the GeoDataFrame.

API breaking implications

I'd assume there are no breaking changes required.

Describe alternatives you've considered

Well, I guess the alternative is to copy the source code of to_parquet into my source code or to not use geopandas.

Additional context

None

The text was updated successfully, but these errors were encountered:

martinfleis · 2024-02-11T13:53:15Z

How would you like this to work together with GeoParquet metadata requirements? We need to ensure these are correct but I'd like to avoid writing some form of a validation if a user passes custom schema. Would you create it all by yourself? Or do you just want to alter parts of it, or extend it?

The example you show is tricky as we can't just use pq_schema since it is not valid for GeoParquet. I am not sure I have a clear idea how should this work.

m-mohr · 2024-02-11T14:26:18Z

I don't want to change the geo metadata at all. I want to add my own key to the metadata with completely independant metadata. So a simple merge of the metadata would be fine for my use case. So in the end it might have for example geo (unmodified), license and specification_version as keys.
(If the user schema contains a geo key it should probably override the geopandas geo key, but then it would certainly up to the user to make sure this makes sense and works.)

I think similarly for the fields in pq_schema. The base should be what geopandas does currently and whatever is also available in the fields the user provides should override the corresponding fields created by geopandas schema.

I figured in the meantime that casting the columns can somewhat be used to change the field data types, so that's not so important, but that there's no way to add more to the Parquet metadata is really restricting.

Disclaimer: This thinking might be biased by my use case and it's good to hear other opinions.

jorisvandenbossche · 2024-02-12T10:16:37Z

How would you like this to work together with GeoParquet metadata requirements?

We already add the geoparquet "geo" metadata to the schema right now as well, which at the moment is created by pyarrow in the Table.from_pandas call, and thus will have "pandas" metadata as well (to which we add the "geo" metadata):

geopandas/geopandas/io/arrow.py

Lines 269 to 278 in 0898e66

    
           df = df.to_wkb(**kwargs) 
        
           table = Table.from_pandas(df, preserve_index=index) 
        
           # Store geopandas specific file-level metadata 
        
           # This must be done AFTER creating the table or it is not persisted 
        
           metadata = table.schema.metadata 
        
           metadata.update({b"geo": _encode_metadata(geo_metadata)}) 
        
           return table.replace_schema_metadata(metadata)

So I don't think it is really the problem that the user specified schema would already contain metadata (as long as it is not called "geo", or else that would get overwritten by geopandas).
In the above code snippet from _geopandas_to_arrow, we could pass a user specified schema to Table.from_pandas to allow the user to specify column types and metadata. And I certainly think it makes sense to let the user do that (pandas' to_parquet also does that).

I think the main problem is what to do with the geometry column(s)? If a user specifies a schema, do we assume it already contains the geometry columns as well? But in that case, the user needs to know that it should (at the moment, with WKB encoding) be specified as variable size binary. Or do we assume it is the schema for all non-geometry columns, and we add fields for the geometry columns to that schema?

m-mohr · 2024-02-12T11:09:41Z

Thanks!

If the geo key gets overwritten, I'd assume the geometry column(s) also get overwritten, based on their name.

The other reasonsable alternative would be to not override if geo metadata or a geometry column is available and thus making the user fully responsible for this actions.

And I certainly think it makes sense to let the user do that (pandas' to_parquet also does that).

Good to know!

m-mohr added the enhancement label Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allow to change schema and/or metadata in to_parquet #3182

ENH: Allow to change schema and/or metadata in to_parquet #3182

m-mohr commented Feb 9, 2024 •

edited

martinfleis commented Feb 11, 2024

m-mohr commented Feb 11, 2024 •

edited

jorisvandenbossche commented Feb 12, 2024

m-mohr commented Feb 12, 2024 •

edited

ENH: Allow to change schema and/or metadata in to_parquet #3182

ENH: Allow to change schema and/or metadata in to_parquet #3182

Comments

m-mohr commented Feb 9, 2024 • edited

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

martinfleis commented Feb 11, 2024

m-mohr commented Feb 11, 2024 • edited

jorisvandenbossche commented Feb 12, 2024

m-mohr commented Feb 12, 2024 • edited

m-mohr commented Feb 9, 2024 •

edited

m-mohr commented Feb 11, 2024 •

edited

m-mohr commented Feb 12, 2024 •

edited