ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface #3156

kylebarron · 2024-01-24T19:58:05Z

Is your feature request related to a problem?

In light of pandas-dev/pandas#56587, it would be awesome if GeoPandas were able to handle the Arrow PyCapsule Interface for interoperability with GeoArrow for reading or writing or both.

Describe the solution you'd like

As part of geopandas.read_parquet there exists an _arrow_to_geopandas function and as part of GeoDataFrame.to_parquet there exists a _geopandas_to_arrow function. It would be nice to have a public API for exporting to and (possibly) importing from GeoArrow data.

Now that Pandas has included (pandas-dev/pandas#56587) an __arrow_c_stream__ dunder, I think the best public API for exporting GeoArrow data is to add an __arrow_c_stream__ method to GeoDataFrame as well.

Exporting data is simpler than importing data because third-party data is not guaranteed to have WKB-encoded geometries and shapely does not yet support arbitrary GeoArrow geometries (ref shapely/shapely#1953).

One question is whether the data export process should always use WKB for simplicity and interoperability or ever try to use to_ragged_array. In geoarrow/geoarrow-rs#477 I found that converting a GeoDataFrame to GeoArrow via to_ragged_array could be up to 4x faster than converting to WKB and parsing the WKB.

Importing data is more involved, due to the variation of encodings that GeoArrow supports. It may be best to leave this for a separate discussion?

API breaking implications

None.

Describe alternatives you've considered

This data conversion could be implemented in a separate library, however I think that there are good reasons to implement it here. The benefit of Arrow and the PyCapsule Interface specifically is that the ecosystem no longer needs to write N * M connectors (where every application needs to write direct support for every other application). E.g. in geoarrow/geoarrow-rs#477 and in lonboard I wrote a special cases for GeoDataFrames, but if we implement __arrow_c_stream__ directly on GeoDataFrame, then the need for special casing GeoDataFrames for Arrow will lessen over time.

Outside of my own libraries other beneficiaries would include GDAL support. For example if/when OSGeo/gdal#9132 and/or geopandas/pyogrio#314 are implemented, if they look for an __arrow_c_stream__ method on input, then you'd be able to pass a GeoDataFrame to those functions directly, without the user needing to think about conversions.

Additional context

The text was updated successfully, but these errors were encountered:

martinfleis · 2024-01-24T20:43:43Z

+1 on that. It would be a nice feature to have in 1.0 if we manage to get it in on time (by the end of March roughly).

kylebarron · 2024-01-26T15:36:18Z

The actual code would likely be quite simple. More challenging is coming to consensus on some different options like: should the export always go through WKB?

jorisvandenbossche · 2024-01-26T16:09:39Z

Thanks for opening the issue!

More challenging is coming to consensus on some different options like: should the export always go through WKB?

That's indeed the tricky question. I have been long thinking to improve the (Py)Arrow compatibility (but it seems I never opened an issue for this). Right now a geopandas -> pyarrow conversion doing pyarrow.table(gdf) fails because it doesn't know how to handle the geometry column (people using the pyarrow APIs with dataframes regularly bump into that, because we indeed hide this conversion in our _arrow_to_geopandas helper for now). Even before the Arrow PyCapsule protocol, we could have long fixed this by implementing GeometryArray.__arrow_array__ dunder (without the "c", so essentially the same but returning a pyarrow object instead of capsule).

But one of the reasons I hesitated doing that was because I didn't know the answer on which conversion best to choose .. ;)

I am also thinking that, given there are multiple options, we should probably also create a public user-facing methods that does this arrow conversion, and where you can then specify which type conversion you want. In that case, you have an alternative for when the default in __arrow_c_stream__ isn't what you want.
For example, we could have a to_arrow() method, with a geometry_encoding option or something like that.

jorisvandenbossche · 2024-01-26T16:11:39Z

@kylebarron In addition (but I don't know if other libraries would actually make use of this), this could be a use case for requested_schema .. In theory you could check the schema, infer which columns are geometries and check their type, and if it's not what you want modify it and pass as requested_schema.
I suppose one difficulty here is also that, to be able to change a "gearrow.wkb" type to one of the native types, you need to know the geometry type (i.e. whether it are points or linestrings or ..).

kylebarron · 2024-01-30T01:20:12Z

But one of the reasons I hesitated doing that was because I didn't know the answer on which conversion best to choose .. ;)

I am also thinking that, given there are multiple options, we should probably also create a public user-facing methods that does this arrow conversion, and where you can then specify which type conversion you want. In that case, you have an alternative for when the default in __arrow_c_stream__ isn't what you want. For example, we could have a to_arrow() method, with a geometry_encoding option or something like that.

After a bit of reflection, I agree and possibly we should hold off for a bit before implementing a default __arrow_c_stream__ implementation. Maybe first implement a to_arrow() method (to_geoarrow?) and then as we see how the ecosystem progresses, advance to a default pycapsule implementation. Or we could say that WKB is good enough for a default implementation now, and we could (maybe?) change it later. Though that would probably be a large change.

As a general question, where should geopandas -> geoarrow conversion live? Given that any geoarrow integration is limited to shapely methods and pyarrow is already an optional dependency, it seems like having an implementation in geopandas is reasonable.

What about, as you said, an initial to_arrow function that defaulted to WKB with a geometry_encoding option and a wkb_fallback option. If the user passes geometry_encoding="geoarrow", it will first try to call shapely.to_ragged_array. If that errors and wkb_fallback is True, it will continue with WKB encoding.

I've grown accustomed to working with arrow metadata on the Field object instead of as registered pyarrow extension types. I'd be content to use that sort of implementation here, which would avoid a dependency on geoarrow-pyarrow.

(As an aside, does shapely.to_ragged_array error on the first incompatible geometry it finds? In some naive tests, an array of points with a polygon at index 0 seemed to error no faster than at the last index, but that testing wasn't scientific).

Ideally if/when shapely adds full geoarrow integration (shapely/shapely#1953), the wkb_fallback option can be removed and all geometry arrays could be converted to geoarrow.

In the longer term, options to customize coordinate type (interleaved vs separated) and row group size may be desired ¹.

I suppose one difficulty here is also that, to be able to change a "gearrow.wkb" type to one of the native types, you need to know the geometry type (i.e. whether it are points or linestrings or ..).

Yeah I'm not quite sure how that would work here.

I hit this today in lonboard where I implemented multithreaded reprojection across each arrow chunk via pyproj. But that doesn't give a speedup when the input table only has a single chunk.

In geoarrow-rs I handle chunked conversion to GeoArrow for the ChunkedGeometryArray.from_shapely method. It's not yet implemented though for the top-level from_geopandas function because I haven't implemented the cross-chunk casting (e.g. if the first N rows have only Point geoms but row N + 1 has a MultiPoint geom, then you want to cast the first chunk to a MultiPointArray to match). Maybe there are similar considerations for attribute columns? (Maybe it would be nice if Table.from_pandas had a chunksize parameter) ↩

kylebarron added the enhancement label Jan 24, 2024

kylebarron mentioned this issue Jan 30, 2024

[FEA]: Implement Arrow PyCapsule Interface rapidsai/cuspatial#1332

Open

kylebarron mentioned this issue Feb 29, 2024

GeoPandas 1.0 release #3201

Open

11 tasks

kylebarron mentioned this issue Mar 13, 2024

ENH: Add export to GeoArrow #3219

Merged

kylebarron mentioned this issue Apr 26, 2024

GDF support duckdb/duckdb_spatial#311

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface #3156

ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface #3156

kylebarron commented Jan 24, 2024

martinfleis commented Jan 24, 2024

kylebarron commented Jan 26, 2024

jorisvandenbossche commented Jan 26, 2024

jorisvandenbossche commented Jan 26, 2024

kylebarron commented Jan 30, 2024

ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface #3156

ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface #3156

Comments

kylebarron commented Jan 24, 2024

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

martinfleis commented Jan 24, 2024

kylebarron commented Jan 26, 2024

jorisvandenbossche commented Jan 26, 2024

jorisvandenbossche commented Jan 26, 2024

kylebarron commented Jan 30, 2024

Footnotes