ENH: support writing and filtered reading from bbox columns in GeoParquet #3282

nicholas-ys-tan · 2024-05-11T05:14:12Z

Resolves #3252

When write_bbox_covering==True, the covering.bbox in the metadata, and per-row bbox is written. Default is for bbox not to be written. The format of the bbox is a tuple of (xmin, ymin, xmax, ymax), same as output of GeoDataFrame.bounds

When reading, bbox can be specified and is converted to a pyarrow.compute.Expression to filter read_parquet. It first checks schema to ensure the file has a covering.bbox data in it and raises an informative error message if not. As conversion of struct data to geopandas is expensive, default is that this column will not be read in unless explicitly specified.

TODO:

[x] Currently still getting it to work wtih partitioned datasets and fsspec. It needs to read the schema/metadata early to check for the bbox column if we are going to filter by it. Also need to omit reading the bbox column as default behaviour, but feeding directories and fsspec to parquet.read_schema is not as well supported as parquet.read_table.

edit - Addressed by using pyarrow.dataset.dataset() to obtain the schema, this will work with file strings, directories, file objects and fsspec in memory. I assume this function does not actually load the data and allows efficient access to the schema. Part of me wants to also use this to change the read_table function, but perhaps that is out of scope.

[x] Figure out how to splice filters kwarg with bboxif filters was fed with DNF format and bbox is currently converted to pyarrow.compute expression.

use parquet.filter_to_expression()

[?] Resolve whether flexibility of what column name to give bbox should be provided to user. Edge cases to be considered is if that column name already exists, what flexibility should be afforded to user in reading. How do we address if user reads from a parquet file with a non-standard bbox column name and wants to save with that same non-standard name.

Currently only allows writing with bbox column name "bbox"

[?] Confirm if acceptable that read_bbox_column predicate is ignored if an explicit list of column names is fed to read_parquet.

nicholas-ys-tan · 2024-05-12T13:55:17Z

Realised that I have enforced bbox as the column name that contains the bbox. Will need to make read_parquet column name agnostic, and give flexibility on what to column name to_parquet can write to, probably change it from a predicate to str input

nicholas-ys-tan · 2024-05-14T00:36:20Z

Realised that I have enforced bbox as the column name that contains the bbox. Will need to make read_parquet column name agnostic, and give flexibility on what to column name to_parquet can write to, probably change it from a predicate to str input

To appropriately navigate the filtering of bbox and dropping of the bbox column by default, we need to read in the metadata and filter before running read_table. This is slightly problematic as all the metadata validation is conducted after reading in the table within _arrow_to_geopandas.

I think I actually need to start with a refactor to bring the metadata/schema reading and validation to the front in both _read_parquet and _read_feather. Then add in the logic to check for covering.bbox, filter by bbox and drop bbox column as default behaviour in between metadata validation and reading of the table.

jorisvandenbossche

Thanks a lot for working om this! Didn't yet take a detailed look, but just a few quick comments

jorisvandenbossche · 2024-05-15T14:52:26Z

geopandas/io/arrow.py

@@ -78,6 +78,9 @@ def _create_metadata(df, schema_version=None):
    schema_version : {'0.1.0', '0.4.0', '1.0.0-beta.1', '1.0.0', None}
        GeoParquet specification version; if not provided will default to
        latest supported version.
+    bbox_column_name : str, default None


I am not sure if it is needed to give the user control over the exact name (although the geoparquet spec allows it). This could also be a boolean flag to enable writing it or not.

jorisvandenbossche · 2024-05-15T14:56:06Z

geopandas/io/arrow.py

+        geometry_bbox = df.bounds.rename(
+            OrderedDict(
+                [("minx", "xmin"), ("miny", "ymin"), ("maxx", "xmax"), ("maxy", "ymax")]
+            ),
+            axis=1,
+        )
+        df[bbox_column_name] = geometry_bbox.to_dict("records")


I think this can be done more efficiently, avoiding we go through the python dictionaries. We can create a pyarrow StructArray manually (and add this as column to the converted table), by using pyarrow.StructArray.from_arrays.

Something like

bounds = df.geometry.bounds bbox_col = pa.StructArray.from_arrays([bounds["minx"], ...], names=["xmin", ...])

jorisvandenbossche · 2024-05-15T15:00:49Z

geopandas/io/arrow.py

+            # read_metadata does not accept a filesystem keyword, so need to
+            # handle this manually (https://issues.apache.org/jira/browse/ARROW-16719)
+            if filesystem is not None:
+                pa_filesystem = _ensure_arrow_fs(filesystem)
+                with pa_filesystem.open_input_file(path) as source:
+                    metadata = parquet.read_metadata(source).metadata


Probably fine for this PR as you just moved this around, but, we just noticing that we should be able to simplify this now, because read_metadata has a filesystem keyword for several pyarrow releases

jorisvandenbossche · 2024-05-15T15:05:19Z

edit - Addressed by using pyarrow.dataset.dataset() to obtain the schema, this will work with file strings, directories, file objects and fsspec in memory. I assume this function does not actually load the data and allows efficient access to the schema. Part of me wants to also use this to change the read_table function, but perhaps that is out of scope.

This sounds good. In theory we could keep a fallback to use read_metadata in case someone has a pyarrow installation without dataset enabled (we actually already have this fallback, so it's maybe just a matter of putting a try/except around the ds.dataset call)

jorisvandenbossche · 2024-05-18T07:25:08Z

geopandas/io/arrow.py

+
+    if bbox_column_name:
+        bounds = df.bounds
+        df[bbox_column_name] = StructArray.from_arrays(


I think you will have to add this column directly to the pyarrow Table (so after table = Table.from_pandas(df, preserve_index=index) below). Otherwise this line will still convert the pyarrow struct array to a numpy object dtype array of python dicts, so not avoiding this unnecessary costly conversion

geopandas/io/arrow.py

jorisvandenbossche · 2024-05-19T07:36:16Z

Figure out how to splice filters kwarg with bboxif filters was fed with DNF format and bbox is currently convered to pyarrow.compute expression.

PyArrow provides am utility function to convert a DNF format filters to an expression: pyarrow.parquet.filters_to_expression, so you can use that and then combine both expressions.

nicholas-ys-tan · 2024-05-19T07:45:22Z

Figure out how to splice filters kwarg with bboxif filters was fed with DNF format and bbox is currently convered to pyarrow.compute expression.

PyArrow provides am utility function to convert a DNF format filters to an expression: pyarrow.parquet.filters_to_expression, so you can use that and then combine both expressions.

Thanks Joris, I had just found that a couple of hours ago - the latest commit has that incorporated.

I think my last outstanding item was about the user's control of writing the bbox column name. I was wondering if we needed to account for the use case where a user may read a geoparquet file with a non-standard bbox column name, make edits and then save, but wanting to preserve the non-standard bbox column name?

If not, I can simplify to use save the bbox column as a predicate and just use bbox as the only allowable name.

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

m-richards

Not a complete review but some small initial comments on docstrings and edge cases

geopandas/io/arrow.py

m-richards · 2024-05-26T12:28:57Z

geopandas/io/arrow.py

+    return True
+
+
+def _check_if_bbox_column_in_parquet(geo_metadata):


seems like this does not have the most descriptive name, since this appears to check if the "covering" metadata key is present?

m-richards · 2024-05-26T12:34:52Z

geopandas/io/arrow.py

+
+    bbox_filter = _get_parquet_bbox_filter(geo_metadata, bbox) if bbox else None
+
+    if_bbox_column_exists = _check_if_bbox_column_in_parquet(geo_metadata)


seems like this should throw if read_bbox_column is true and if_bbox_column_exists is false

jorisvandenbossche

Thanks for the update! some more comments

geopandas/geodataframe.py

geopandas/io/arrow.py

jorisvandenbossche · 2024-05-26T11:34:04Z

geopandas/io/arrow.py

+    read_bbox_column: bool, default False
+        The bbox column is a struct with the minimum rectangular box that
+        encompasses the geometry. It is computationally expensive to read
+        in a struct into a GeoDataFrame. As such, it is default to not


Suggested change

in a struct into a GeoDataFrame. As such, it is default to not

into a GeoDataFrame. As such, the default is to not

(when in a GeoDataFrame, it's not really a "struct" anymore, but in practice a python dictionary at the moment (until pandas would support a native struct type), so let's just keep it general "read into a geodataframe")

jorisvandenbossche · 2024-05-26T11:34:38Z

geopandas/io/arrow.py

+        encompasses the geometry. It is computationally expensive to read
+        in a struct into a GeoDataFrame. As such, it is default to not
+        read in this column unless explictly specified as True.
+        If  ``columns`` arguement is used and contains ``bbox``,


Suggested change

If ``columns`` arguement is used and contains ``bbox``,

If ``columns`` argument is used and contains ``bbox``,

(and +1 on letting the columns keyword override this)

geopandas/io/arrow.py

geopandas/io/tests/test_arrow.py

jorisvandenbossche · 2024-05-26T12:46:28Z

geopandas/io/tests/test_arrow.py

+    ]
+
+
+def test_filters_format_as_expression(tmpdir, naturalearth_lowres):


Maybe parametrize the above and this one on a filters param, as I think that's the only thing that differs here? (to avoid some duplication)

Co-authored-by: Matt Richards <45483497+m-richards@users.noreply.github.com>

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche · 2024-05-27T09:05:00Z

Was discussing with @martinfleis in person here, and related to the read_covering_column=True option: given that it results in this object dtype column in pandas, it's not very useful right now. It's mostly useful for debugging, but then at that point you will want to read the file with pyarrow directly to get the different struct fields directly.
Therefore, I pushed a commit to remove that keyword for now. That also resolves the naming discussion, is in line with GDAL, and given the default is False it is always possible to add that keyword later without behaviour change.

m-richards · 2024-05-27T09:55:33Z

Was discussing with @martinfleis in person here, and related to the read_covering_column=True option: given that it results in this object dtype column in pandas, it's not very useful right now. It's mostly useful for debugging, but then at that point you will want to read the file with pyarrow directly to get the different struct fields directly. Therefore, I pushed a commit to remove that keyword for now. That also resolves the naming discussion, is in line with GDAL, and given the default is False it is always possible to add that keyword later without behaviour change.

Could we back this column as a pyarrow struct series (https://github.com/pandas-dev/pandas/pull/54977/files)? However as you say it's unlikely to be useful, and one can also get it if explicitly specified in columns

jorisvandenbossche · 2024-05-27T11:24:44Z

In theory one can indeed specify bbox in the list of columns and then specify through pandas to get it as a ArrowDtype with struct. But since that is not yet a default, I would still leave that up to the user.

jorisvandenbossche · 2024-05-27T11:26:48Z

I am going to be a bit unconventional and already merge this (we want to show this today at GeoPython's tutorial). But please still leave some review, there are a few follow-ups to take care of anyway:

ensure this is robust for the geometry column name (no hardcoded "geometry")
test reading a file that uses a different column name as "bbox"
test writing in case your geodataframe already has a bbox column

nicholas-ys-tan · 2024-05-27T13:14:31Z

I am going to be a bit unconventional and already merge this (we want to show this today at GeoPython's tutorial). But please still leave some review, there are a few follow-ups to take care of anyway:
* ensure this is robust for the geometry column name (no hardcoded "geometry")

* test reading a file that uses a different column name as "bbox"

* test writing in case your geodataframe already has a bbox column

Thanks for your help and guidance on this one @jorisvandenbossche , @m-richards , @martinfleis

Should I open a new issue for these follow ups to tie this one off?

Good luck at the geopython tutorial today.

martinfleis · 2024-05-27T14:14:17Z

Should I open a new issue for these follow ups to tie this one off?

that may be wise, thanks!

nicholas-ys-tan marked this pull request as ready for review May 12, 2024 10:02

nicholas-ys-tan marked this pull request as draft May 12, 2024 13:54

martinfleis requested a review from jorisvandenbossche May 12, 2024 14:20

nicholas-ys-tan force-pushed the issue3252 branch 3 times, most recently from adef50e to d817d6b Compare May 15, 2024 13:42

jorisvandenbossche reviewed May 15, 2024

View reviewed changes

martinfleis added this to the 1.0 milestone May 17, 2024

jorisvandenbossche reviewed May 18, 2024

View reviewed changes

geopandas/io/arrow.py Outdated Show resolved Hide resolved

nicholas-ys-tan force-pushed the issue3252 branch from 1a7bf68 to 3bed861 Compare May 19, 2024 06:49

nicholas-ys-tan force-pushed the issue3252 branch 2 times, most recently from 88c9618 to ee3d6ff Compare May 20, 2024 14:41

nicholas-ys-tan and others added 12 commits May 21, 2024 00:44

REF: frontload metadata validation

b77f289

REF: rename function

9a17f22

REF: frontload metadata validation for _read_feather

0725190

REF: remove duplicate metadata validation code from _arrow_to_pandas

196e67d

REF: import parquet as optional dependency, remove redundant line

3ac2483

ENH: write failing tests

c618c30

ENH: add write bbox column to _to_parquet

20ae75b

ENH: add bbox filtering to _read_parquet

74d4177

ENH: add read_bbox_column to _read_parquet

29046e5

ENH: update geometry bbox calculation method

90560d5

ENH: fix docstring of bbox tuple order

ba31f04

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

ENH: update how bbox column gets appended to table

381186a

Fix bbox to include all intersections

1be2f60

m-richards reviewed May 26, 2024

View reviewed changes

jorisvandenbossche reviewed May 26, 2024

View reviewed changes

nicholas-ys-tan and others added 12 commits May 27, 2024 13:31

Update docstring

f503c22

Co-authored-by: Matt Richards <45483497+m-richards@users.noreply.github.com>

Update docstring

7aeefac

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Update test description

611c6ba

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

simplify line

a4583d6

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

update bbox_filter line

248f68d

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

remove redundant kwarg in test

7e40345

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

docstring formatting

9662fd2

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

docstring formatting

d91a8ef

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

updated from review

fac0e7c

linting geodataframe

163b7d7

remove read_covering_column

7c1e1a6

simplify tests a bit

cecb4d6

jorisvandenbossche mentioned this pull request May 27, 2024

Change read_parquet to read metadata upfront before reading the data #3305

Merged

Change read_parquet to read metadata upfront before reading the data

d06cfcd

jorisvandenbossche added 3 commits May 27, 2024 12:16

fix for older pyarrow

4fb7e6d

move validation to existing validation function

efc149c

Merge remote-tracking branch 'upstream/main' into issue3252

7291940

jorisvandenbossche merged commit 785056e into geopandas:main May 27, 2024
20 checks passed

This was referenced May 28, 2024

TST: increase testing coverage of read_parquet for alternate covering bounding box field names #3308

Open

TST/ENH: Continuation of #3282, add testing coverage and fixes #3318

Open

jorisvandenbossche mentioned this pull request Jun 1, 2024

TST: add test for reading Parquet file generated by GDAL 3.9 #3319

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support writing and filtered reading from bbox columns in GeoParquet #3282

ENH: support writing and filtered reading from bbox columns in GeoParquet #3282

nicholas-ys-tan commented May 11, 2024 •

edited

nicholas-ys-tan commented May 12, 2024

nicholas-ys-tan commented May 14, 2024 •

edited

jorisvandenbossche left a comment

jorisvandenbossche May 15, 2024

jorisvandenbossche May 15, 2024

jorisvandenbossche May 15, 2024

jorisvandenbossche commented May 15, 2024

jorisvandenbossche May 18, 2024

jorisvandenbossche commented May 19, 2024

nicholas-ys-tan commented May 19, 2024

m-richards left a comment

m-richards May 26, 2024

m-richards May 26, 2024

jorisvandenbossche left a comment

jorisvandenbossche May 26, 2024

jorisvandenbossche May 26, 2024

jorisvandenbossche May 26, 2024

jorisvandenbossche commented May 27, 2024

m-richards commented May 27, 2024

jorisvandenbossche commented May 27, 2024

jorisvandenbossche commented May 27, 2024

nicholas-ys-tan commented May 27, 2024

martinfleis commented May 27, 2024

		return True


		def _check_if_bbox_column_in_parquet(geo_metadata):


		bbox_filter = _get_parquet_bbox_filter(geo_metadata, bbox) if bbox else None

		if_bbox_column_exists = _check_if_bbox_column_in_parquet(geo_metadata)

	in a struct into a GeoDataFrame. As such, it is default to not
	into a GeoDataFrame. As such, the default is to not

	If ``columns`` arguement is used and contains ``bbox``,
	If ``columns`` argument is used and contains ``bbox``,

		]


		def test_filters_format_as_expression(tmpdir, naturalearth_lowres):

ENH: support writing and filtered reading from bbox columns in GeoParquet #3282

ENH: support writing and filtered reading from bbox columns in GeoParquet #3282

Conversation

nicholas-ys-tan commented May 11, 2024 • edited

nicholas-ys-tan commented May 12, 2024

nicholas-ys-tan commented May 14, 2024 • edited

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 15, 2024

Choose a reason for hiding this comment

jorisvandenbossche commented May 19, 2024

nicholas-ys-tan commented May 19, 2024

m-richards left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 27, 2024

m-richards commented May 27, 2024

jorisvandenbossche commented May 27, 2024

jorisvandenbossche commented May 27, 2024

nicholas-ys-tan commented May 27, 2024

martinfleis commented May 27, 2024

nicholas-ys-tan commented May 11, 2024 •

edited

nicholas-ys-tan commented May 14, 2024 •

edited