Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow to change schema and/or metadata in to_parquet #3182

Open
m-mohr opened this issue Feb 9, 2024 · 4 comments
Open

ENH: Allow to change schema and/or metadata in to_parquet #3182

m-mohr opened this issue Feb 9, 2024 · 4 comments

Comments

@m-mohr
Copy link

m-mohr commented Feb 9, 2024

Is your feature request related to a problem?

I'd like to provide a custom schema to to_parquet, mostly for adding metadata, but I could also see me tweaking the column data types.

Describe the solution you'd like

Example:

  pq_schema = pa.schema([
    pa.field("example", pa.string())
  ])
  pq_schema = pq_schema.with_metadata({"example": json.dumps({"foo": "bar"}).encode("utf-8")})
  data = GeoDataFrame.from_features(features, columns = ["example"], index = False)
  data.to_parquet(output_file, schema = pq_schema)

This returns an error:

TypeError: ParquetWriter.init() got multiple values for argument 'schema'

(Maybe this is a bug, but felt more like a feature request.)

I'd like a way to change schema and/or metadata in to_parquet, e.g. as proposed above.
I guess this may have some complications with regards to the existing schema created for the GeoDataFrame.

API breaking implications

I'd assume there are no breaking changes required.

Describe alternatives you've considered

Well, I guess the alternative is to copy the source code of to_parquet into my source code or to not use geopandas.

Additional context

None

@martinfleis
Copy link
Member

How would you like this to work together with GeoParquet metadata requirements? We need to ensure these are correct but I'd like to avoid writing some form of a validation if a user passes custom schema. Would you create it all by yourself? Or do you just want to alter parts of it, or extend it?

The example you show is tricky as we can't just use pq_schema since it is not valid for GeoParquet. I am not sure I have a clear idea how should this work.

@m-mohr
Copy link
Author

m-mohr commented Feb 11, 2024

I don't want to change the geo metadata at all. I want to add my own key to the metadata with completely independant metadata. So a simple merge of the metadata would be fine for my use case. So in the end it might have for example geo (unmodified), license and specification_version as keys.
(If the user schema contains a geo key it should probably override the geopandas geo key, but then it would certainly up to the user to make sure this makes sense and works.)

I think similarly for the fields in pq_schema. The base should be what geopandas does currently and whatever is also available in the fields the user provides should override the corresponding fields created by geopandas schema.

I figured in the meantime that casting the columns can somewhat be used to change the field data types, so that's not so important, but that there's no way to add more to the Parquet metadata is really restricting.

Disclaimer: This thinking might be biased by my use case and it's good to hear other opinions.

@jorisvandenbossche
Copy link
Member

How would you like this to work together with GeoParquet metadata requirements?

We already add the geoparquet "geo" metadata to the schema right now as well, which at the moment is created by pyarrow in the Table.from_pandas call, and thus will have "pandas" metadata as well (to which we add the "geo" metadata):

df = df.to_wkb(**kwargs)
table = Table.from_pandas(df, preserve_index=index)
# Store geopandas specific file-level metadata
# This must be done AFTER creating the table or it is not persisted
metadata = table.schema.metadata
metadata.update({b"geo": _encode_metadata(geo_metadata)})
return table.replace_schema_metadata(metadata)

So I don't think it is really the problem that the user specified schema would already contain metadata (as long as it is not called "geo", or else that would get overwritten by geopandas).
In the above code snippet from _geopandas_to_arrow, we could pass a user specified schema to Table.from_pandas to allow the user to specify column types and metadata. And I certainly think it makes sense to let the user do that (pandas' to_parquet also does that).

I think the main problem is what to do with the geometry column(s)? If a user specifies a schema, do we assume it already contains the geometry columns as well? But in that case, the user needs to know that it should (at the moment, with WKB encoding) be specified as variable size binary. Or do we assume it is the schema for all non-geometry columns, and we add fields for the geometry columns to that schema?

@m-mohr
Copy link
Author

m-mohr commented Feb 12, 2024

Thanks!

If the geo key gets overwritten, I'd assume the geometry column(s) also get overwritten, based on their name.

The other reasonsable alternative would be to not override if geo metadata or a geometry column is available and thus making the user fully responsible for this actions.

And I certainly think it makes sense to let the user do that (pandas' to_parquet also does that).

Good to know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants