Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polygonize: using the two-arm chains edgetracing algorithm #7344

Merged
merged 12 commits into from
Mar 7, 2023

Conversation

kikitte
Copy link
Contributor

@kikitte kikitte commented Mar 3, 2023

What does this PR do?

The Two-Arm Chains EdgeTracing Algorithm does a faster, memory saving, and robust polygonize job. It is described in Junhua Teng, Fahui Wang, Yu Liu, An Efficient Algorithm for Raster-to-Vector Data Conversion.

It is:

  • Fast
    The Two-Arm Chains EdgeTracing Algorithm is as fast as the original algorithm when processing small dataset, however it is much faster than the original algorithm when processing large dataset. Please check the benchmarks below for detail.
  • Robust
    This algorithm can handle many special cases gracefully, produce 'correct' polygon geometry without topology error. And the polygons produced by this algorithm follow the right-hand rule (counterclockwise external rings, clockwise internal rings).
  • Easy to extend
    It is easy to extend this algorithm to do another jobs by deriving the PolygonReceiver base class. For eaxmple, Finding the boundary cells of a raster.
  • Less memory usage
    This algorithm use less memory compare to the original in all tests.

Benchmarks

Test on AMD 4800h, 16GB RAM, 1T nvme ssd machine, with ArchLInux installed. GDAL 3.6.2 is used as the origin algorithm.

GDEM Colorized Map(4320x2160, 26.7 MiB)

4connected:

Algorithm Command Feature Count Time(m:ss) Peak Memory Usage(kbytes)
Origin time -v gdal_polygonize.py -q GDEM-10km-colorized.tif origin/GDEM-10km-colorized.shp 1,742,271 0:10.39 205,384
TwoArm EdgeTracing time -v gdal_polygonize.py -q GDEM-10km-colorized.tif twoarm/GDEM-10km-colorized.shp 1,742,271 0:10.27 177,828

8connected:

Algorithm Command Feature Count Time(m:ss) Peak Memory Usage(kbytes)
Origin time -v gdal_polygonize.py -8 -q GDEM-10km-colorized.tif origin/GDEM-10km-colorized_8c.shp 1,585,277 0:10.03 198,956
TwoArm EdgeTracing time -v gdal_polygonize.py -8 -q GDEM-10km-colorized.tif twoarm/GDEM-10km-colorized_8c.shp 1,585,277 0:09.65 173,736

random_grid.tif (5000x5000, 23.9 MiB)

This file is created with the following python script:

import numpy as np
from osgeo import gdal_array

xsize = 5000
ysize = 5000
arr = np.random.randint(3, size=xsize * ysize).reshape(ysize, xsize).astype(np.int8)
gdal_array.SaveArray(arr, 'random_grid.tif')

4connected:

Algorithm Command Feature Count Time(m:ss) Peak Memory Usage(kbytes)
Origin time -v gdal_polygonize.py -q random_grid.tif origin/random_grid.shp 9,274,519 1:06.47 569,468
TwoArm EdgeTracing time -v gdal_polygonize.py -q random_grid.tif twoarm/random_grid.shp 9,274,519 1:07.29 465,776

8connected:

Algorithm Command Feature Count Time(m:ss) Peak Memory Usage(kbytes)
Origin time -v gdal_polygonize.py -8 -q random_grid.tif origin/random_grid_8c.shp 2,706,688 0:56.10 324,648
TwoArm EdgeTracing time -v gdal_polygonize.py -8 -q random_grid.tif twoarm/random_grid_8c.shp 2,706,688 0:36.80 272,500

OR_NLCD_2011(converted to geotiff)(21959x16118, 337.6 MiB)

4connected:

Algorithm Command Feature Count Time(m:ss) Peak Memory Usage(kbytes)
Origin time -v gdal_polygonize.py -q nlcd_or_20111.tif origin/nlcd_or_20111.shp 5,699,595 12:32.31 1,246,096
TwoArm EdgeTracing time -v gdal_polygonize.py -q nlcd_or_20111.tif twoarm/nlcd_or_20111.shp 5,699,595 1:56.53 1,018,524

8connected:

Algorithm Command Feature Count Time(m:ss) Peak Memory Usage(kbytes)
Origin time -v gdal_polygonize.py -8 -q nlcd_or_20111.tif origin/nlcd_or_20111_8c.shp 1,877,971 33:53.34 1,211,828
TwoArm EdgeTracing time -v gdal_polygonize.py -8 -q nlcd_or_20111.tif twoarm/nlcd_or_20111_8c.shp 1,877,971 2:47.36 953,976

Copy link
Member

@rouault rouault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work ! I've issue a pull request against your fork in kikitte#1 with various non substantial improvements.
Are you connected/affiliate with the authors of the paper, or did you do the implementation just from it (I've had a look at it and although I see the relationship between your code and the paper, I also see that there are various gaps you had to fill in)

@@ -664,9 +172,6 @@ static CPLErr GDALPolygonizeT(GDALRasterBandH hSrcBand,
eErr = GDALRasterIO(hSrcBand, GF_Read, 0, iY, nXSize, 1, panThisLineVal,
nXSize, 1, eDT, 0, 0);

if (eErr == CE_None && hMaskBand != nullptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the removal of the masking appropriate here ? Isn't there a risk of setting a different ID in the first and second pass if both don't apply the masking ?

Copy link
Contributor Author

@kikitte kikitte Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a problem, it need to mask the raster data while reading.
Since the polygonizer needs a full labeling raster as its input(via Polygonizer::processLine), it means that any zero area defined by the mask should be assigned to a unique value, but this looks like a time-consuming job to find that value.

Copy link
Contributor Author

@kikitte kikitte Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the most used case is to use nodata area as mask, so the nodata value serve as the unique value, and all looks ok in that case.
And I think the logic of the polygonizer can be update to take into account the "invalid polygon"(its polygon id is -1 if masking while reading) concept.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that nodata is probably the main use case, but we should make sure that it works with an arbitrary mask band, not necessarily tied to nodata. I would suggest you restore the use of the mask band in the first pass, per this initial pull request, and potentially come with improvements/optimizations in a follow-up pull request. I don't think the main perf improvements of the new algorithm are related to that

Copy link
Contributor Author

@kikitte kikitte Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to fix this problem, it should work well but with little memory usage increasing because the polygonizer may hole the invalid polygon for a long time. UPDATE: I found a problem(the number of the output polygon changed) when running a 8 connectedness conversion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need an extra test case in autotest to catch those cases?

Copy link
Contributor Author

@kikitte kikitte Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, no problem. It is caused by my carelessness, actually the result is correct, I don't remember why I written the wrong feature count in the benchmark table, now I have updated it.
Yes, we'd better add a none nodata mask as a test case, it would help us to avoid this situation in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we'd better add a none nodata mask as a test case

do you want to add that in this pull request ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just now I've add a test case.

@kikitte
Copy link
Contributor Author

kikitte commented Mar 6, 2023

No, I have no any connections to the authors of this paper, and I do the implementation just from it. Yes, actually this paper is less clarity in many details(but it shows us a nice idea of r2v alg), so I filled these gaps during implementation on my own.

@rouault rouault merged commit 57bdd7f into OSGeo:master Mar 7, 2023
@rouault rouault added this to the 3.7.0 milestone Mar 7, 2023
@rouault
Copy link
Member

rouault commented Mar 7, 2023

Merged. Thanks again @kikitte !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants