Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface #3156

Open
kylebarron opened this issue Jan 24, 2024 · 5 comments
Open

ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface #3156

kylebarron opened this issue Jan 24, 2024 · 5 comments

Comments

@kylebarron
Copy link
Contributor

Is your feature request related to a problem?

In light of pandas-dev/pandas#56587, it would be awesome if GeoPandas were able to handle the Arrow PyCapsule Interface for interoperability with GeoArrow for reading or writing or both.

Describe the solution you'd like

As part of geopandas.read_parquet there exists an _arrow_to_geopandas function and as part of GeoDataFrame.to_parquet there exists a _geopandas_to_arrow function. It would be nice to have a public API for exporting to and (possibly) importing from GeoArrow data.

Now that Pandas has included (pandas-dev/pandas#56587) an __arrow_c_stream__ dunder, I think the best public API for exporting GeoArrow data is to add an __arrow_c_stream__ method to GeoDataFrame as well.

Exporting data is simpler than importing data because third-party data is not guaranteed to have WKB-encoded geometries and shapely does not yet support arbitrary GeoArrow geometries (ref shapely/shapely#1953).

One question is whether the data export process should always use WKB for simplicity and interoperability or ever try to use to_ragged_array. In geoarrow/geoarrow-rs#477 I found that converting a GeoDataFrame to GeoArrow via to_ragged_array could be up to 4x faster than converting to WKB and parsing the WKB.

Importing data is more involved, due to the variation of encodings that GeoArrow supports. It may be best to leave this for a separate discussion?

API breaking implications

None.

Describe alternatives you've considered

This data conversion could be implemented in a separate library, however I think that there are good reasons to implement it here. The benefit of Arrow and the PyCapsule Interface specifically is that the ecosystem no longer needs to write N * M connectors (where every application needs to write direct support for every other application). E.g. in geoarrow/geoarrow-rs#477 and in lonboard I wrote a special cases for GeoDataFrames, but if we implement __arrow_c_stream__ directly on GeoDataFrame, then the need for special casing GeoDataFrames for Arrow will lessen over time.

Outside of my own libraries other beneficiaries would include GDAL support. For example if/when OSGeo/gdal#9132 and/or geopandas/pyogrio#314 are implemented, if they look for an __arrow_c_stream__ method on input, then you'd be able to pass a GeoDataFrame to those functions directly, without the user needing to think about conversions.

Additional context

@martinfleis
Copy link
Member

+1 on that. It would be a nice feature to have in 1.0 if we manage to get it in on time (by the end of March roughly).

@kylebarron
Copy link
Contributor Author

The actual code would likely be quite simple. More challenging is coming to consensus on some different options like: should the export always go through WKB?

@jorisvandenbossche
Copy link
Member

Thanks for opening the issue!

More challenging is coming to consensus on some different options like: should the export always go through WKB?

That's indeed the tricky question. I have been long thinking to improve the (Py)Arrow compatibility (but it seems I never opened an issue for this). Right now a geopandas -> pyarrow conversion doing pyarrow.table(gdf) fails because it doesn't know how to handle the geometry column (people using the pyarrow APIs with dataframes regularly bump into that, because we indeed hide this conversion in our _arrow_to_geopandas helper for now). Even before the Arrow PyCapsule protocol, we could have long fixed this by implementing GeometryArray.__arrow_array__ dunder (without the "c", so essentially the same but returning a pyarrow object instead of capsule).

But one of the reasons I hesitated doing that was because I didn't know the answer on which conversion best to choose .. ;)

I am also thinking that, given there are multiple options, we should probably also create a public user-facing methods that does this arrow conversion, and where you can then specify which type conversion you want. In that case, you have an alternative for when the default in __arrow_c_stream__ isn't what you want.
For example, we could have a to_arrow() method, with a geometry_encoding option or something like that.

@jorisvandenbossche
Copy link
Member

@kylebarron In addition (but I don't know if other libraries would actually make use of this), this could be a use case for requested_schema .. In theory you could check the schema, infer which columns are geometries and check their type, and if it's not what you want modify it and pass as requested_schema.
I suppose one difficulty here is also that, to be able to change a "gearrow.wkb" type to one of the native types, you need to know the geometry type (i.e. whether it are points or linestrings or ..).

@kylebarron
Copy link
Contributor Author

But one of the reasons I hesitated doing that was because I didn't know the answer on which conversion best to choose .. ;)

I am also thinking that, given there are multiple options, we should probably also create a public user-facing methods that does this arrow conversion, and where you can then specify which type conversion you want. In that case, you have an alternative for when the default in __arrow_c_stream__ isn't what you want. For example, we could have a to_arrow() method, with a geometry_encoding option or something like that.

After a bit of reflection, I agree and possibly we should hold off for a bit before implementing a default __arrow_c_stream__ implementation. Maybe first implement a to_arrow() method (to_geoarrow?) and then as we see how the ecosystem progresses, advance to a default pycapsule implementation. Or we could say that WKB is good enough for a default implementation now, and we could (maybe?) change it later. Though that would probably be a large change.

As a general question, where should geopandas -> geoarrow conversion live? Given that any geoarrow integration is limited to shapely methods and pyarrow is already an optional dependency, it seems like having an implementation in geopandas is reasonable.

What about, as you said, an initial to_arrow function that defaulted to WKB with a geometry_encoding option and a wkb_fallback option. If the user passes geometry_encoding="geoarrow", it will first try to call shapely.to_ragged_array. If that errors and wkb_fallback is True, it will continue with WKB encoding.

I've grown accustomed to working with arrow metadata on the Field object instead of as registered pyarrow extension types. I'd be content to use that sort of implementation here, which would avoid a dependency on geoarrow-pyarrow.

(As an aside, does shapely.to_ragged_array error on the first incompatible geometry it finds? In some naive tests, an array of points with a polygon at index 0 seemed to error no faster than at the last index, but that testing wasn't scientific).

Ideally if/when shapely adds full geoarrow integration (shapely/shapely#1953), the wkb_fallback option can be removed and all geometry arrays could be converted to geoarrow.

In the longer term, options to customize coordinate type (interleaved vs separated) and row group size may be desired 1.

I suppose one difficulty here is also that, to be able to change a "gearrow.wkb" type to one of the native types, you need to know the geometry type (i.e. whether it are points or linestrings or ..).

Yeah I'm not quite sure how that would work here.

Footnotes

  1. I hit this today in lonboard where I implemented multithreaded reprojection across each arrow chunk via pyproj. But that doesn't give a speedup when the input table only has a single chunk.

    In geoarrow-rs I handle chunked conversion to GeoArrow for the ChunkedGeometryArray.from_shapely method. It's not yet implemented though for the top-level from_geopandas function because I haven't implemented the cross-chunk casting (e.g. if the first N rows have only Point geoms but row N + 1 has a MultiPoint geom, then you want to cast the first chunk to a MultiPointArray to match). Maybe there are similar considerations for attribute columns? (Maybe it would be nice if Table.from_pandas had a chunksize parameter)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants