polygonize: using the two-arm chains edgetracing algorithm #7344

kikitte · 2023-03-03T13:13:40Z

What does this PR do?

The Two-Arm Chains EdgeTracing Algorithm does a faster, memory saving, and robust polygonize job. It is described in Junhua Teng, Fahui Wang, Yu Liu, An Efficient Algorithm for Raster-to-Vector Data Conversion.

It is:

Fast
The Two-Arm Chains EdgeTracing Algorithm is as fast as the original algorithm when processing small dataset, however it is much faster than the original algorithm when processing large dataset. Please check the benchmarks below for detail.
Robust
This algorithm can handle many special cases gracefully, produce 'correct' polygon geometry without topology error. And the polygons produced by this algorithm follow the right-hand rule (counterclockwise external rings, clockwise internal rings).
Easy to extend
It is easy to extend this algorithm to do another jobs by deriving the PolygonReceiver base class. For eaxmple, Finding the boundary cells of a raster.
Less memory usage
This algorithm use less memory compare to the original in all tests.

Benchmarks

Test on AMD 4800h, 16GB RAM, 1T nvme ssd machine, with ArchLInux installed. GDAL 3.6.2 is used as the origin algorithm.

GDEM Colorized Map(4320x2160, 26.7 MiB)

4connected:

Algorithm	Command	Feature Count	Time(m:ss)	Peak Memory Usage(kbytes)
Origin	time -v gdal_polygonize.py -q GDEM-10km-colorized.tif origin/GDEM-10km-colorized.shp	1,742,271	0:10.39	205,384
TwoArm EdgeTracing	time -v gdal_polygonize.py -q GDEM-10km-colorized.tif twoarm/GDEM-10km-colorized.shp	1,742,271	0:10.27	177,828

8connected:

Algorithm	Command	Feature Count	Time(m:ss)	Peak Memory Usage(kbytes)
Origin	time -v gdal_polygonize.py -8 -q GDEM-10km-colorized.tif origin/GDEM-10km-colorized_8c.shp	1,585,277	0:10.03	198,956
TwoArm EdgeTracing	time -v gdal_polygonize.py -8 -q GDEM-10km-colorized.tif twoarm/GDEM-10km-colorized_8c.shp	1,585,277	0:09.65	173,736

random_grid.tif (5000x5000, 23.9 MiB)

This file is created with the following python script:

import numpy as np
from osgeo import gdal_array

xsize = 5000
ysize = 5000
arr = np.random.randint(3, size=xsize * ysize).reshape(ysize, xsize).astype(np.int8)
gdal_array.SaveArray(arr, 'random_grid.tif')

4connected:

Algorithm	Command	Feature Count	Time(m:ss)	Peak Memory Usage(kbytes)
Origin	time -v gdal_polygonize.py -q random_grid.tif origin/random_grid.shp	9,274,519	1:06.47	569,468
TwoArm EdgeTracing	time -v gdal_polygonize.py -q random_grid.tif twoarm/random_grid.shp	9,274,519	1:07.29	465,776

8connected:

Algorithm	Command	Feature Count	Time(m:ss)	Peak Memory Usage(kbytes)
Origin	time -v gdal_polygonize.py -8 -q random_grid.tif origin/random_grid_8c.shp	2,706,688	0:56.10	324,648
TwoArm EdgeTracing	time -v gdal_polygonize.py -8 -q random_grid.tif twoarm/random_grid_8c.shp	2,706,688	0:36.80	272,500

OR_NLCD_2011(converted to geotiff)(21959x16118, 337.6 MiB)

4connected:

Algorithm	Command	Feature Count	Time(m:ss)	Peak Memory Usage(kbytes)
Origin	time -v gdal_polygonize.py -q nlcd_or_20111.tif origin/nlcd_or_20111.shp	5,699,595	12:32.31	1,246,096
TwoArm EdgeTracing	time -v gdal_polygonize.py -q nlcd_or_20111.tif twoarm/nlcd_or_20111.shp	5,699,595	1:56.53	1,018,524

8connected:

Algorithm	Command	Feature Count	Time(m:ss)	Peak Memory Usage(kbytes)
Origin	time -v gdal_polygonize.py -8 -q nlcd_or_20111.tif origin/nlcd_or_20111_8c.shp	1,877,971	33:53.34	1,211,828
TwoArm EdgeTracing	time -v gdal_polygonize.py -8 -q nlcd_or_20111.tif twoarm/nlcd_or_20111_8c.shp	1,877,971	2:47.36	953,976

rouault

Amazing work ! I've issue a pull request against your fork in kikitte#1 with various non substantial improvements.
Are you connected/affiliate with the authors of the paper, or did you do the implementation just from it (I've had a look at it and although I see the relationship between your code and the paper, I also see that there are various gaps you had to fill in)

rouault · 2023-03-05T18:40:49Z

alg/polygonize.cpp

@@ -664,9 +172,6 @@ static CPLErr GDALPolygonizeT(GDALRasterBandH hSrcBand,
        eErr = GDALRasterIO(hSrcBand, GF_Read, 0, iY, nXSize, 1, panThisLineVal,
                            nXSize, 1, eDT, 0, 0);

-        if (eErr == CE_None && hMaskBand != nullptr)


Is the removal of the masking appropriate here ? Isn't there a risk of setting a different ID in the first and second pass if both don't apply the masking ?

This is indeed a problem, it need to mask the raster data while reading.
Since the polygonizer needs a full labeling raster as its input(via Polygonizer::processLine), it means that any zero area defined by the mask should be assigned to a unique value, but this looks like a time-consuming job to find that value.

Because the most used case is to use nodata area as mask, so the nodata value serve as the unique value, and all looks ok in that case.
And I think the logic of the polygonizer can be update to take into account the "invalid polygon"(its polygon id is -1 if masking while reading) concept.

Agreed that nodata is probably the main use case, but we should make sure that it works with an arbitrary mask band, not necessarily tied to nodata. I would suggest you restore the use of the mask band in the first pass, per this initial pull request, and potentially come with improvements/optimizations in a follow-up pull request. I don't think the main perf improvements of the new algorithm are related to that

~~I've tried to fix this problem, it should work well but with little memory usage increasing because the polygonizer may hole the invalid polygon for a long time.~~ ~~UPDATE: I found a problem(the number of the output polygon changed) when running a 8 connectedness conversion.~~

do we need an extra test case in autotest to catch those cases?

No, no problem. It is caused by my carelessness, actually the result is correct, I don't remember why I written the wrong feature count in the benchmark table, now I have updated it.
Yes, we'd better add a none nodata mask as a test case, it would help us to avoid this situation in the future.

Yes, we'd better add a none nodata mask as a test case

do you want to add that in this pull request ?

Yes, just now I've add a test case.

…id (harmless) unsigned integer overflow

kikitte · 2023-03-06T05:04:03Z

No, I have no any connections to the authors of this paper, and I do the implementation just from it. Yes, actually this paper is less clarity in many details(but it shows us a nice idea of r2v alg), so I filled these gaps during implementation on my own.

Tidy up new-polygonize-alg

autotest/alg/polygonize.py

rouault · 2023-03-07T17:25:21Z

Merged. Thanks again @kikitte !

kikitte and others added 7 commits March 2, 2023 23:39

polygonize: using the two-arm chains edgetracing algorithm

0d31ec4

polygonize: correct the expected wkt result

0764b3b

ProcessArmConnections(): assert in impossible case

99684a2

ProcessArmConnections(): use symbolic names

80a5ca7

namespace polygonizer algorithm

01398a7

static'ify ProcessArmConnections()

ac0c351

Add reference to paper describing the algorithm

072cae9

rouault reviewed Mar 5, 2023

View reviewed changes

Polygonizer<PolyIdType, DataType>::processLine(): rewrite test to avo…

4667a55

…id (harmless) unsigned integer overflow

kikitte added 3 commits March 6, 2023 13:19

Merge pull request #1 from rouault/new-polygonize-alg-tidy-up

745d154

Tidy up new-polygonize-alg

take invalid polygon into account in polygonizer

40092c9

add none nodata mask(user defined mask) test cast

82dfa7a

rouault mentioned this pull request Mar 7, 2023

gdal_polygonize may generate invalid polygons according to ST_IsValid() #7369

Open

rouault reviewed Mar 7, 2023

View reviewed changes

autotest/alg/polygonize.py Show resolved Hide resolved

Update autotest/alg/polygonize.py

bc0b827

rouault merged commit 57bdd7f into OSGeo:master Mar 7, 2023

rouault added this to the 3.7.0 milestone Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polygonize: using the two-arm chains edgetracing algorithm #7344

polygonize: using the two-arm chains edgetracing algorithm #7344

kikitte commented Mar 3, 2023 •

edited

rouault left a comment

rouault Mar 5, 2023

kikitte Mar 6, 2023 •

edited

kikitte Mar 6, 2023 •

edited

rouault Mar 6, 2023

kikitte Mar 6, 2023 •

edited

rouault Mar 6, 2023

kikitte Mar 7, 2023 •

edited

rouault Mar 7, 2023

kikitte Mar 7, 2023

kikitte commented Mar 6, 2023

rouault commented Mar 7, 2023

polygonize: using the two-arm chains edgetracing algorithm #7344

polygonize: using the two-arm chains edgetracing algorithm #7344

Conversation

kikitte commented Mar 3, 2023 • edited

What does this PR do?

Benchmarks

GDEM Colorized Map(4320x2160, 26.7 MiB)

random_grid.tif (5000x5000, 23.9 MiB)

OR_NLCD_2011(converted to geotiff)(21959x16118, 337.6 MiB)

rouault left a comment

Choose a reason for hiding this comment

rouault Mar 5, 2023

Choose a reason for hiding this comment

kikitte Mar 6, 2023 • edited

Choose a reason for hiding this comment

kikitte Mar 6, 2023 • edited

Choose a reason for hiding this comment

rouault Mar 6, 2023

Choose a reason for hiding this comment

kikitte Mar 6, 2023 • edited

Choose a reason for hiding this comment

rouault Mar 6, 2023

Choose a reason for hiding this comment

kikitte Mar 7, 2023 • edited

Choose a reason for hiding this comment

rouault Mar 7, 2023

Choose a reason for hiding this comment

kikitte Mar 7, 2023

Choose a reason for hiding this comment

kikitte commented Mar 6, 2023

rouault commented Mar 7, 2023

kikitte commented Mar 3, 2023 •

edited

kikitte Mar 6, 2023 •

edited

kikitte Mar 6, 2023 •

edited

kikitte Mar 6, 2023 •

edited

kikitte Mar 7, 2023 •

edited