Parquet: implement using and writing bounding box column, for faster spatial filtering #9185

rouault · 2024-02-01T16:31:08Z

Implements Introduce bounding box column definition opengeospatial/geoparquet#191
Bounding box column is written by default (can be disabled with WRITE_COVERING_BBOX=NO layer creation option)
On reading, bounding box column(s) is hidden in the OGR layer definition, but used internally by the driver to speed-up spatial filtering
Parquet writer: add SORT_BY_BBOX=YES/NO layer creation option. Defaults to NO

Documentation:

- .. lco:: SORT_BY_BBOX
     :choices: YES, NO
     :default: NO
     :since: 3.9

     Whether features should be sorted based on the bounding box of their
     geometries, before being written in the final file. Sorting them enables
     faster spatial filtering on reading, by grouping together spatially close
     features in the same group of rows.

     Note however that enabling this option involves creating a temporary
     GeoPackage file (in the same directory as the final Parquet file),
     and thus requires temporary storage (possibly up to several times the size
     of the final Parquet file, depending on Parquet compression) and additional
     processing time.

     The efficiency of spatial filtering depends on the ROW_GROUP_SIZE. If it
     is too large, too many features that are not spatially close will be grouped
     together. If it is too small, the file size will increase, and extra
     processing time will be necessary to browse through the row groups.

     Note also that when this option is enabled, the Arrow writing API (which
     is for example triggered when using ogr2ogr to convert from Parquet to Parquet),
     fallbacks to the generic implementation, which does not support advanced
     Arrow types (lists, maps, etc.).

Experiments with the canonical https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet dataset:

Generation of datasets:

// Organize in row groups of 65,536 features, no BBOX, no sorting

$ time ogr2ogr out_no_bbox.parquet nz-building-outlines.parquet -progress -lco WRITE_COVERING_BBOX=NO
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m4,457s

// Organize in row groups of 65,536 features, add BBOX columns, no sorting

$ time ogr2ogr out_unsorted.parquet nz-building-outlines.parquet -progress
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m5,408s

// Organize in row groups of max 65,536 features, add BBOX columns, sort using RTree

$ time ogr2ogr out_sorted.parquet nz-building-outlines.parquet -progress -lco SORT_BY_BBOX=YES
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m40,311s

// Organize in row groups of max 16,384 features, add BBOX columns, sort using RTree

$ time ogr2ogr out_sorted_16384.parquet nz-building-outlines.parquet -progress -lco SORT_BY_BBOX=YES  -lco ROW_GROUP_SIZE=16384
0...10...20...30...40...50...60...70...80...90...100 - done.

real    0m44,149s

File sizes:

out_no_bbox.parquet          436,475,127
out_unsorted.parquet         504,120,728
out_sorted.parquet           489,507,910
out_sorted_16384.parquet     492,760,561

Spatial filter selecting a single feature:

$ time ogrinfo out_no_bbox.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m1,302s

$ time ogrinfo out_unsorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m0,947s

$ time ogrinfo out_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m0,278s

$ time ogrinfo out_sorted_16384.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real    0m0,183s

Spatial filter selecting ~ 470,000 features (over a total of 3.2 millions):

$ time ogrinfo out_no_bbox.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,957s

$ time ogrinfo out_unsorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,718s

$ time ogrinfo out_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,067s

$ time ogrinfo out_sorted_16384.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real    0m1,021s

jorisvandenbossche · 2024-02-02T15:32:28Z

Cool!
For the SORT_BY_BBOX option, one thought: currently you made it into a YES/NO option, but I can imagine that there could be multiple sort methods (I don't know if that's something you would consider adding to GDAL, though).

For example, now you used the Rtree from geopackage. But another option could be to use GEOS (if available in the GDAL build) to calculate the HilbertCode (GEOS >= 3.11) for each bbox value, and then use the Arrow C++ APIs to sort the data based on those values before writing to Parquet.

rouault · 2024-02-02T15:50:20Z

For the SORT_BY_BBOX option, one thought: currently you made it into a YES/NO option, but I can imagine that there could be multiple sort methods

For the sake of simplicity, I'd prefer not to have to support several methods... We might switch to something better, but I'd see that as an implementation detail. The current implementation might be tunable. I'm not super convinced that the points where I flush row groups are ideal for bbox compacity (there are sometimes significant overlap between different row groups), but couldn't find an obvious way to improve things.
Would Hilbert code result in better results?

then use the Arrow C++ APIs to sort the data based on those values before writing to Parquet.

do you have pointers to the API doc for that sorting API ? But I'd assume that you have to ingest the whole file into memory ? Besides the "simplicity" of re-using the GeoPackage RTree, one of its advantage is that it can work with files much larger than RAM. Users have for example ran into RAM issues when generating very very large FlatGeoBuf files when the driver builds its packed Hilbert R*Tree, which requires to be able to store in RAM something like 84 bytes per feature, and thus if your number of features reaches 1 billion...

… fields and the FID column is not the first one

…Ignored on reading side

…filtering

…ogression

Defaults to NO Documentation: ``` - .. lco:: SORT_BY_BBOX :choices: YES, NO :default: NO :since: 3.9 Whether features should be sorted based on the bounding box of their geometries, before being written in the final file. Sorting them enables faster spatial filtering on reading, by grouping together spatially close features in the same group of rows. Note however that enabling this option involves creating a temporary GeoPackage file (in the same directory as the final Parquet file), and thus requires temporary storage (possibly up to several times the size of the final Parquet file, depending on Parquet compression) and additional processing time. The efficiency of spatial filtering depends on the ROW_GROUP_SIZE. If it is too large, too many features that are not spatially close will be grouped together. If it is too small, the file size will increase, and extra processing time will be necessary to browse through the row groups. Note also that when this option is enabled, the Arrow writing API (which is for example triggered when using ogr2ogr to convert from Parquet to Parquet), fallbacks to the generic implementation, which does not support advanced Arrow types (lists, maps, etc.). ``` Experiments with the canonical https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet dataset: * Generation of datasets: // Organize in row groups of 65,536 features, no BBOX, no sorting ``` $ time ogr2ogr out_no_bbox.parquet nz-building-outlines.parquet -progress -lco WRITE_COVERING_BBOX=NO 0...10...20...30...40...50...60...70...80...90...100 - done. real 0m4,457s ``` // Organize in row groups of 65,536 features, add BBOX columns, no sorting ``` $ time ogr2ogr out_unsorted.parquet nz-building-outlines.parquet -progress 0...10...20...30...40...50...60...70...80...90...100 - done. real 0m5,408s ``` // Organize in row groups of max 65,536 features, add BBOX columns, sort using RTree ``` $ time ogr2ogr out_sorted.parquet nz-building-outlines.parquet -progress -lco SORT_BY_BBOX=YES 0...10...20...30...40...50...60...70...80...90...100 - done. real 0m40,311s ``` // Organize in row groups of max 16,384 features, add BBOX columns, sort using RTree ``` $ time ogr2ogr out_sorted_16384.parquet nz-building-outlines.parquet -progress -lco SORT_BY_BBOX=YES -lco ROW_GROUP_SIZE=16384 0...10...20...30...40...50...60...70...80...90...100 - done. real 0m44,149s ``` * File sizes: ``` out_no_bbox.parquet 436,475,127 out_unsorted.parquet 504,120,728 out_sorted.parquet 489,507,910 out_sorted_16384.parquet 492,760,561 ``` * Spatial filter selecting a single feature: ``` $ time ogrinfo out_no_bbox.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount 1 real 0m1,302s $ time ogrinfo out_unsorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount 1 real 0m0,947s $ time ogrinfo out_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount 1 real 0m0,278s $ time ogrinfo out_sorted_16384.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount 1 real 0m0,183s ``` * Spatial filter selecting ~ 470,000 features (over a total of 3.2 millions): ``` $ time ogrinfo out_no_bbox.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount 471147 real 0m1,957s $ time ogrinfo out_unsorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount 471147 real 0m1,718s $ time ogrinfo out_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount 471147 real 0m1,067s $ time ogrinfo out_sorted_16384.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount 471147 real 0m1,021s ```

coveralls · 2024-03-17T15:36:31Z

coverage: 68.918% (+0.04%) from 68.876%
when pulling eb3d124 on rouault:parquet_bbox_field
into 1f4ead7 on OSGeo:master.

rouault added this to the 3.9.0 milestone Feb 1, 2024

rouault mentioned this pull request Feb 1, 2024

Introduce bounding box column definition opengeospatial/geoparquet#191

Merged

rouault changed the title ~~Parquet: implement using and writing bounding box colum, for faster spatial filtering~~ Parquet: implement using and writing bounding box column, for faster spatial filtering Feb 1, 2024

rouault added 9 commits March 17, 2024 14:50

Arrow/Parquet: GetArrowSchema(): potential fix when there are ignored…

d184617

… fields and the FID column is not the first one

Parquet: writer: implement creation of covering.bbox struct columns. …

aea729d

…Ignored on reading side

OGRParquetLayer::ReadNextBatch(): avoid potential assertion failure

a98c6c0

Parquet: GetStats::max(): ignore empty row groups

beaa6fd

Parquet: reader: use covering.bbox struct columns for faster spatial …

469fcae

…filtering

OGRFeature: add SerializeToBinary() / DeserializeFromBinary()

22220cc

Add CPLDebugProgress() to display a debugging message indicating a pr…

7b62950

…ogression

SWIG: expose OLCFastWriteArrowBatch

6b4cc47

rouault force-pushed the parquet_bbox_field branch from 54eaeab to eb3d124 Compare March 17, 2024 14:10

rouault merged commit c53727c into OSGeo:master Mar 17, 2024
32 checks passed

Andreasox mentioned this pull request Mar 21, 2024

Compression of ADAPT serialized datasets ADAPT/Standard#127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: implement using and writing bounding box column, for faster spatial filtering #9185

Parquet: implement using and writing bounding box column, for faster spatial filtering #9185

rouault commented Feb 1, 2024 •

edited

jorisvandenbossche commented Feb 2, 2024

rouault commented Feb 2, 2024 •

edited

coveralls commented Mar 17, 2024

Parquet: implement using and writing bounding box column, for faster spatial filtering #9185

Parquet: implement using and writing bounding box column, for faster spatial filtering #9185

Conversation

rouault commented Feb 1, 2024 • edited

jorisvandenbossche commented Feb 2, 2024

rouault commented Feb 2, 2024 • edited

coveralls commented Mar 17, 2024

rouault commented Feb 1, 2024 •

edited

rouault commented Feb 2, 2024 •

edited