Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support writing and filtered reading from bbox columns in GeoParquet #3282

Merged
merged 43 commits into from
May 27, 2024

Conversation

nicholas-ys-tan
Copy link
Contributor

@nicholas-ys-tan nicholas-ys-tan commented May 11, 2024

Resolves #3252

When write_bbox_covering==True, the covering.bbox in the metadata, and per-row bbox is written. Default is for bbox not to be written. The format of the bbox is a tuple of (xmin, ymin, xmax, ymax), same as output of GeoDataFrame.bounds

When reading, bbox can be specified and is converted to a pyarrow.compute.Expression to filter read_parquet. It first checks schema to ensure the file has a covering.bbox data in it and raises an informative error message if not. As conversion of struct data to geopandas is expensive, default is that this column will not be read in unless explicitly specified.

TODO:

[x] Currently still getting it to work wtih partitioned datasets and fsspec. It needs to read the schema/metadata early to check for the bbox column if we are going to filter by it. Also need to omit reading the bbox column as default behaviour, but feeding directories and fsspec to parquet.read_schema is not as well supported as parquet.read_table.

edit - Addressed by using pyarrow.dataset.dataset() to obtain the schema, this will work with file strings, directories, file objects and fsspec in memory. I assume this function does not actually load the data and allows efficient access to the schema. Part of me wants to also use this to change the read_table function, but perhaps that is out of scope.

[x] Figure out how to splice filters kwarg with bboxif filters was fed with DNF format and bbox is currently converted to pyarrow.compute expression.

  • use parquet.filter_to_expression()

[?] Resolve whether flexibility of what column name to give bbox should be provided to user. Edge cases to be considered is if that column name already exists, what flexibility should be afforded to user in reading. How do we address if user reads from a parquet file with a non-standard bbox column name and wants to save with that same non-standard name.

  • Currently only allows writing with bbox column name "bbox"

[?] Confirm if acceptable that read_bbox_column predicate is ignored if an explicit list of column names is fed to read_parquet.

@nicholas-ys-tan nicholas-ys-tan marked this pull request as ready for review May 12, 2024 10:02
@nicholas-ys-tan nicholas-ys-tan marked this pull request as draft May 12, 2024 13:54
@nicholas-ys-tan
Copy link
Contributor Author

Realised that I have enforced bbox as the column name that contains the bbox. Will need to make read_parquet column name agnostic, and give flexibility on what to column name to_parquet can write to, probably change it from a predicate to str input

@nicholas-ys-tan
Copy link
Contributor Author

nicholas-ys-tan commented May 14, 2024

Realised that I have enforced bbox as the column name that contains the bbox. Will need to make read_parquet column name agnostic, and give flexibility on what to column name to_parquet can write to, probably change it from a predicate to str input

To appropriately navigate the filtering of bbox and dropping of the bbox column by default, we need to read in the metadata and filter before running read_table. This is slightly problematic as all the metadata validation is conducted after reading in the table within _arrow_to_geopandas.

I think I actually need to start with a refactor to bring the metadata/schema reading and validation to the front in both _read_parquet and _read_feather. Then add in the logic to check for covering.bbox, filter by bbox and drop bbox column as default behaviour in between metadata validation and reading of the table.

@nicholas-ys-tan nicholas-ys-tan force-pushed the issue3252 branch 3 times, most recently from adef50e to d817d6b Compare May 15, 2024 13:42
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working om this! Didn't yet take a detailed look, but just a few quick comments

@@ -78,6 +78,9 @@ def _create_metadata(df, schema_version=None):
schema_version : {'0.1.0', '0.4.0', '1.0.0-beta.1', '1.0.0', None}
GeoParquet specification version; if not provided will default to
latest supported version.
bbox_column_name : str, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it is needed to give the user control over the exact name (although the geoparquet spec allows it). This could also be a boolean flag to enable writing it or not.

Comment on lines 270 to 276
geometry_bbox = df.bounds.rename(
OrderedDict(
[("minx", "xmin"), ("miny", "ymin"), ("maxx", "xmax"), ("maxy", "ymax")]
),
axis=1,
)
df[bbox_column_name] = geometry_bbox.to_dict("records")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be done more efficiently, avoiding we go through the python dictionaries. We can create a pyarrow StructArray manually (and add this as column to the converted table), by using pyarrow.StructArray.from_arrays.

Something like

bounds = df.geometry.bounds
bbox_col = pa.StructArray.from_arrays([bounds["minx"], ...], names=["xmin", ...])

Comment on lines 710 to 818
# read_metadata does not accept a filesystem keyword, so need to
# handle this manually (https://issues.apache.org/jira/browse/ARROW-16719)
if filesystem is not None:
pa_filesystem = _ensure_arrow_fs(filesystem)
with pa_filesystem.open_input_file(path) as source:
metadata = parquet.read_metadata(source).metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine for this PR as you just moved this around, but, we just noticing that we should be able to simplify this now, because read_metadata has a filesystem keyword for several pyarrow releases

@jorisvandenbossche
Copy link
Member

edit - Addressed by using pyarrow.dataset.dataset() to obtain the schema, this will work with file strings, directories, file objects and fsspec in memory. I assume this function does not actually load the data and allows efficient access to the schema. Part of me wants to also use this to change the read_table function, but perhaps that is out of scope.

This sounds good. In theory we could keep a fallback to use read_metadata in case someone has a pyarrow installation without dataset enabled (we actually already have this fallback, so it's maybe just a matter of putting a try/except around the ds.dataset call)

@martinfleis martinfleis added this to the 1.0 milestone May 17, 2024

if bbox_column_name:
bounds = df.bounds
df[bbox_column_name] = StructArray.from_arrays(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will have to add this column directly to the pyarrow Table (so after table = Table.from_pandas(df, preserve_index=index) below). Otherwise this line will still convert the pyarrow struct array to a numpy object dtype array of python dicts, so not avoiding this unnecessary costly conversion

@jorisvandenbossche
Copy link
Member

Figure out how to splice filters kwarg with bboxif filters was fed with DNF format and bbox is currently convered to pyarrow.compute expression.

PyArrow provides am utility function to convert a DNF format filters to an expression: pyarrow.parquet.filters_to_expression, so you can use that and then combine both expressions.

@nicholas-ys-tan
Copy link
Contributor Author

Figure out how to splice filters kwarg with bboxif filters was fed with DNF format and bbox is currently convered to pyarrow.compute expression.

PyArrow provides am utility function to convert a DNF format filters to an expression: pyarrow.parquet.filters_to_expression, so you can use that and then combine both expressions.

Thanks Joris, I had just found that a couple of hours ago - the latest commit has that incorporated.

I think my last outstanding item was about the user's control of writing the bbox column name. I was wondering if we needed to account for the use case where a user may read a geoparquet file with a non-standard bbox column name, make edits and then save, but wanting to preserve the non-standard bbox column name?

If not, I can simplify to use save the bbox column as a predicate and just use bbox as the only allowable name.

@nicholas-ys-tan nicholas-ys-tan force-pushed the issue3252 branch 2 times, most recently from 88c9618 to ee3d6ff Compare May 20, 2024 14:41
Copy link
Member

@m-richards m-richards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review but some small initial comments on docstrings and edge cases

geopandas/io/arrow.py Outdated Show resolved Hide resolved
geopandas/io/arrow.py Outdated Show resolved Hide resolved
geopandas/io/arrow.py Outdated Show resolved Hide resolved
return True


def _check_if_bbox_column_in_parquet(geo_metadata):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this does not have the most descriptive name, since this appears to check if the "covering" metadata key is present?


bbox_filter = _get_parquet_bbox_filter(geo_metadata, bbox) if bbox else None

if_bbox_column_exists = _check_if_bbox_column_in_parquet(geo_metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this should throw if read_bbox_column is true and if_bbox_column_exists is false

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! some more comments

geopandas/geodataframe.py Outdated Show resolved Hide resolved
geopandas/io/arrow.py Outdated Show resolved Hide resolved
geopandas/io/arrow.py Outdated Show resolved Hide resolved
read_bbox_column: bool, default False
The bbox column is a struct with the minimum rectangular box that
encompasses the geometry. It is computationally expensive to read
in a struct into a GeoDataFrame. As such, it is default to not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in a struct into a GeoDataFrame. As such, it is default to not
into a GeoDataFrame. As such, the default is to not

(when in a GeoDataFrame, it's not really a "struct" anymore, but in practice a python dictionary at the moment (until pandas would support a native struct type), so let's just keep it general "read into a geodataframe")

encompasses the geometry. It is computationally expensive to read
in a struct into a GeoDataFrame. As such, it is default to not
read in this column unless explictly specified as True.
If ``columns`` arguement is used and contains ``bbox``,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If ``columns`` arguement is used and contains ``bbox``,
If ``columns`` argument is used and contains ``bbox``,

(and +1 on letting the columns keyword override this)

geopandas/io/arrow.py Outdated Show resolved Hide resolved
geopandas/io/arrow.py Outdated Show resolved Hide resolved
geopandas/io/tests/test_arrow.py Outdated Show resolved Hide resolved
geopandas/io/tests/test_arrow.py Outdated Show resolved Hide resolved
]


def test_filters_format_as_expression(tmpdir, naturalearth_lowres):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe parametrize the above and this one on a filters param, as I think that's the only thing that differs here? (to avoid some duplication)

nicholas-ys-tan and others added 12 commits May 27, 2024 13:31
Co-authored-by: Matt Richards <45483497+m-richards@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member

Was discussing with @martinfleis in person here, and related to the read_covering_column=True option: given that it results in this object dtype column in pandas, it's not very useful right now. It's mostly useful for debugging, but then at that point you will want to read the file with pyarrow directly to get the different struct fields directly.
Therefore, I pushed a commit to remove that keyword for now. That also resolves the naming discussion, is in line with GDAL, and given the default is False it is always possible to add that keyword later without behaviour change.

@m-richards
Copy link
Member

Was discussing with @martinfleis in person here, and related to the read_covering_column=True option: given that it results in this object dtype column in pandas, it's not very useful right now. It's mostly useful for debugging, but then at that point you will want to read the file with pyarrow directly to get the different struct fields directly. Therefore, I pushed a commit to remove that keyword for now. That also resolves the naming discussion, is in line with GDAL, and given the default is False it is always possible to add that keyword later without behaviour change.

Could we back this column as a pyarrow struct series (https://github.com/pandas-dev/pandas/pull/54977/files)? However as you say it's unlikely to be useful, and one can also get it if explicitly specified in columns

@jorisvandenbossche
Copy link
Member

In theory one can indeed specify bbox in the list of columns and then specify through pandas to get it as a ArrowDtype with struct. But since that is not yet a default, I would still leave that up to the user.

@jorisvandenbossche
Copy link
Member

I am going to be a bit unconventional and already merge this (we want to show this today at GeoPython's tutorial). But please still leave some review, there are a few follow-ups to take care of anyway:

  • ensure this is robust for the geometry column name (no hardcoded "geometry")
  • test reading a file that uses a different column name as "bbox"
  • test writing in case your geodataframe already has a bbox column

@jorisvandenbossche jorisvandenbossche merged commit 785056e into geopandas:main May 27, 2024
20 checks passed
@nicholas-ys-tan
Copy link
Contributor Author

I am going to be a bit unconventional and already merge this (we want to show this today at GeoPython's tutorial). But please still leave some review, there are a few follow-ups to take care of anyway:

* ensure this is robust for the geometry column name (no hardcoded "geometry")

* test reading a file that uses a different column name as "bbox"

* test writing in case your geodataframe already has a bbox column

Thanks for your help and guidance on this one @jorisvandenbossche , @m-richards , @martinfleis

Should I open a new issue for these follow ups to tie this one off?

Good luck at the geopython tutorial today.

@martinfleis
Copy link
Member

Should I open a new issue for these follow ups to tie this one off?

that may be wise, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: support writing + filtered reading from bbox columns in GeoParquet
4 participants