Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: implement using and writing bounding box column, for faster spatial filtering #9185

Merged
merged 9 commits into from
Mar 17, 2024

Conversation

rouault
Copy link
Member

@rouault rouault commented Feb 1, 2024

  • Implements Introduce bounding box column definition opengeospatial/geoparquet#191
  • Bounding box column is written by default (can be disabled with WRITE_COVERING_BBOX=NO layer creation option)
  • On reading, bounding box column(s) is hidden in the OGR layer definition, but used internally by the driver to speed-up spatial filtering
  • Parquet writer: add SORT_BY_BBOX=YES/NO layer creation option. Defaults to NO

Documentation:

- .. lco:: SORT_BY_BBOX
     :choices: YES, NO
     :default: NO
     :since: 3.9

     Whether features should be sorted based on the bounding box of their
     geometries, before being written in the final file. Sorting them enables
     faster spatial filtering on reading, by grouping together spatially close
     features in the same group of rows.

     Note however that enabling this option involves creating a temporary
     GeoPackage file (in the same directory as the final Parquet file),
     and thus requires temporary storage (possibly up to several times the size
     of the final Parquet file, depending on Parquet compression) and additional
     processing time.

     The efficiency of spatial filtering depends on the ROW_GROUP_SIZE. If it
     is too large, too many features that are not spatially close will be grouped
     together. If it is too small, the file size will increase, and extra
     processing time will be necessary to browse through the row groups.

     Note also that when this option is enabled, the Arrow writing API (which
     is for example triggered when using ogr2ogr to convert from Parquet to Parquet),
     fallbacks to the generic implementation, which does not support advanced
     Arrow types (lists, maps, etc.).

Experiments with the canonical https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet dataset:

  • Generation of datasets:

// Organize in row groups of 65,536 features, no BBOX, no sorting

$ time ogr2ogr out_no_bbox.parquet nz-building-outlines.parquet -progress -lco WRITE_COVERING_BBOX=NO
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m4,457s

// Organize in row groups of 65,536 features, add BBOX columns, no sorting

$ time ogr2ogr out_unsorted.parquet nz-building-outlines.parquet -progress
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m5,408s

// Organize in row groups of max 65,536 features, add BBOX columns, sort using RTree

$ time ogr2ogr out_sorted.parquet nz-building-outlines.parquet -progress -lco SORT_BY_BBOX=YES
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m40,311s

// Organize in row groups of max 16,384 features, add BBOX columns, sort using RTree

$ time ogr2ogr out_sorted_16384.parquet nz-building-outlines.parquet -progress -lco SORT_BY_BBOX=YES  -lco ROW_GROUP_SIZE=16384
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m44,149s
  • File sizes:
out_no_bbox.parquet          436,475,127
out_unsorted.parquet         504,120,728
out_sorted.parquet           489,507,910
out_sorted_16384.parquet     492,760,561
  • Spatial filter selecting a single feature:
$ time ogrinfo out_no_bbox.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m1,302s

$ time ogrinfo out_unsorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m0,947s

$ time ogrinfo out_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m0,278s

$ time ogrinfo out_sorted_16384.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m0,183s
  • Spatial filter selecting ~ 470,000 features (over a total of 3.2 millions):
$ time ogrinfo out_no_bbox.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,957s

$ time ogrinfo out_unsorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,718s

$ time ogrinfo out_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,067s

$ time ogrinfo out_sorted_16384.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,021s

@rouault rouault added this to the 3.9.0 milestone Feb 1, 2024
@rouault rouault changed the title Parquet: implement using and writing bounding box colum, for faster spatial filtering Parquet: implement using and writing bounding box column, for faster spatial filtering Feb 1, 2024
@jorisvandenbossche
Copy link
Contributor

Cool!
For the SORT_BY_BBOX option, one thought: currently you made it into a YES/NO option, but I can imagine that there could be multiple sort methods (I don't know if that's something you would consider adding to GDAL, though).

For example, now you used the Rtree from geopackage. But another option could be to use GEOS (if available in the GDAL build) to calculate the HilbertCode (GEOS >= 3.11) for each bbox value, and then use the Arrow C++ APIs to sort the data based on those values before writing to Parquet.

@rouault
Copy link
Member Author

rouault commented Feb 2, 2024

For the SORT_BY_BBOX option, one thought: currently you made it into a YES/NO option, but I can imagine that there could be multiple sort methods

For the sake of simplicity, I'd prefer not to have to support several methods... We might switch to something better, but I'd see that as an implementation detail. The current implementation might be tunable. I'm not super convinced that the points where I flush row groups are ideal for bbox compacity (there are sometimes significant overlap between different row groups), but couldn't find an obvious way to improve things.
Would Hilbert code result in better results?

then use the Arrow C++ APIs to sort the data based on those values before writing to Parquet.

do you have pointers to the API doc for that sorting API ? But I'd assume that you have to ingest the whole file into memory ? Besides the "simplicity" of re-using the GeoPackage RTree, one of its advantage is that it can work with files much larger than RAM. Users have for example ran into RAM issues when generating very very large FlatGeoBuf files when the driver builds its packed Hilbert R*Tree, which requires to be able to store in RAM something like 84 bytes per feature, and thus if your number of features reaches 1 billion...

… fields and the FID column is not the first one
Defaults to NO

Documentation:
```
- .. lco:: SORT_BY_BBOX
     :choices: YES, NO
     :default: NO
     :since: 3.9

     Whether features should be sorted based on the bounding box of their
     geometries, before being written in the final file. Sorting them enables
     faster spatial filtering on reading, by grouping together spatially close
     features in the same group of rows.

     Note however that enabling this option involves creating a temporary
     GeoPackage file (in the same directory as the final Parquet file),
     and thus requires temporary storage (possibly up to several times the size
     of the final Parquet file, depending on Parquet compression) and additional
     processing time.

     The efficiency of spatial filtering depends on the ROW_GROUP_SIZE. If it
     is too large, too many features that are not spatially close will be grouped
     together. If it is too small, the file size will increase, and extra
     processing time will be necessary to browse through the row groups.

     Note also that when this option is enabled, the Arrow writing API (which
     is for example triggered when using ogr2ogr to convert from Parquet to Parquet),
     fallbacks to the generic implementation, which does not support advanced
     Arrow types (lists, maps, etc.).
```

Experiments with the canonical
https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet
dataset:

* Generation of datasets:

// Organize in row groups of 65,536 features, no BBOX, no sorting
```
$ time ogr2ogr out_no_bbox.parquet nz-building-outlines.parquet -progress -lco WRITE_COVERING_BBOX=NO
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m4,457s
```

// Organize in row groups of 65,536 features, add BBOX columns, no sorting
```
$ time ogr2ogr out_unsorted.parquet nz-building-outlines.parquet -progress
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m5,408s
```

// Organize in row groups of max 65,536 features, add BBOX columns, sort using RTree
```
$ time ogr2ogr out_sorted.parquet nz-building-outlines.parquet -progress -lco SORT_BY_BBOX=YES
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m40,311s
```

// Organize in row groups of max 16,384 features, add BBOX columns, sort using RTree
```
$ time ogr2ogr out_sorted_16384.parquet nz-building-outlines.parquet -progress -lco SORT_BY_BBOX=YES  -lco ROW_GROUP_SIZE=16384
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m44,149s
```

* File sizes:

```
out_no_bbox.parquet          436,475,127
out_unsorted.parquet         504,120,728
out_sorted.parquet           489,507,910
out_sorted_16384.parquet     492,760,561
```

* Spatial filter selecting a single feature:
```
$ time ogrinfo out_no_bbox.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m1,302s

$ time ogrinfo out_unsorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m0,947s

$ time ogrinfo out_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m0,278s

$ time ogrinfo out_sorted_16384.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m0,183s
```

* Spatial filter selecting ~ 470,000 features (over a total of 3.2 millions):
```
$ time ogrinfo out_no_bbox.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,957s

$ time ogrinfo out_unsorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,718s

$ time ogrinfo out_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,067s

$ time ogrinfo out_sorted_16384.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,021s
```
@coveralls
Copy link
Collaborator

Coverage Status

coverage: 68.918% (+0.04%) from 68.876%
when pulling eb3d124 on rouault:parquet_bbox_field
into 1f4ead7 on OSGeo:master.

@rouault rouault merged commit c53727c into OSGeo:master Mar 17, 2024
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants