Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gdal2tiles: Fix max cache setting #2112

Merged
merged 1 commit into from
Jan 7, 2020
Merged

gdal2tiles: Fix max cache setting #2112

merged 1 commit into from
Jan 7, 2020

Conversation

EvertEt
Copy link
Contributor

@EvertEt EvertEt commented Dec 16, 2019

Since GDAL_CACHEMAX is only read once, setting it here has no effect anymore for future calls (even multiprocessed).
By setting the cache directly using SetCacheMax, we can fix the cache per process.

What does this PR do?

Fixes gdal2tiles max cache setting

What are related issues/pull requests?

N/A

Tasklist

  • Add test case(s)
  • Review
  • Adjust for comments
  • All CI builds and checks have passed

Environment

Discovery of this was done on Ubuntu azure machines.
F4, 4 threads to tile: Ram 4/8GB
F32, 20 threads to tile: SIGKILL
F32: 16 threads to tile: 58/64GB

This shows us the default 5% GDAL_CACHEMAX is being used for every process instead of being split over processes.

@EvertEt
Copy link
Contributor Author

EvertEt commented Dec 16, 2019

I am not familiar with your testing infrastructure. We could add a test for this by using more than 20 threads. Could anyone point me in the right direction or take a stab at this? (If so, how should I validate the test fails before this fix?)

@rouault
Copy link
Member

rouault commented Dec 16, 2019

I'm not sure how to write a test for that. Hard to check behaviour without instrumentation. gdal2tiles tests are in autotest/pyscripts/test_gdal2tiles.py

@EvertEt
Copy link
Contributor Author

EvertEt commented Dec 16, 2019

Only easy way I can think of is write a test with 30 threads, verify it fills up ram and make sure it just runs with this fix.

@rouault
Copy link
Member

rouault commented Dec 16, 2019

verify it fills up ram and make sure it just runs with this fix.

As much as I like tests to go with pull request, here honestly it seems a bit complicated and I'd be concerned by portability issues to do that check among all the OS....

@rouault
Copy link
Member

rouault commented Dec 17, 2019

That said, are you sure your change works ... ? gdal.SetCacheMax() has only effect in the current process. It should not affect spawned worker processes...

@EvertEt
Copy link
Contributor Author

EvertEt commented Dec 18, 2019

I am pretty sure it works. Since I am not experienced in this, I can only speculate, but since the threads spawned from the same python script, I think gdal is re-used and only initialized once.

Some debugging tests: I added print(gdal.GetCacheMax()) after the ReadRaster Extent log in create_base_tile. Before calling gdal2tiles.py with subprocess.check_call I set GDAL_CACHEMAX to 31104. I set nb_processes to 32.

With the fix:
GDAL_CACHEMAX: 3,26149e+10 bytes (31104 MB)
gdal.GetCacheMax(): 1019215872
3,26149e+10 / 1019215872 = 32

Without the fix:
GDAL_CACHEMAX: 3,25992e+10 bytes (31089 MB)
gdal.GetCacheMax(): 32599179264
3,25992e+10 / 32599179264 = 1

@rouault
Copy link
Member

rouault commented Dec 18, 2019

I think gdal is re-used and only initialized once.

Is that on Linux ? If so, according to https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods, multiprocessing uses by default the "fork" method, and thus GDAL is indeed only intitialized once. But on Mac & Windows, I bet this would fail. So perhaps the fix is to have both the original code & your change !

@EvertEt
Copy link
Contributor Author

EvertEt commented Dec 18, 2019

Linux indeed. Would you be able to replicate the experiment above on a different platform?

@rouault
Copy link
Member

rouault commented Dec 18, 2019

Would you be able to replicate the experiment above on a different platform?

If I had time... but I'm not scaling very well...

@EvertEt
Copy link
Contributor Author

EvertEt commented Dec 18, 2019

We can leave in both as it won't hurt and link the docs you mentioned?

@rouault
Copy link
Member

rouault commented Dec 18, 2019

We can leave in both as it won't hurt and link the docs you mentioned?

Yes, a comment explaining this weird setup would be needed. I guess the os.environ[] has to be done before gdal.SetCacheMax()

Once you're satisfied with a fix, maybe @jratike80 would be willing to have a try on Windows (no obligation Jukka...)

@jratike80
Copy link
Collaborator

jratike80 commented Dec 18, 2019

I hope I did this right on Windows by commenting out either the original line or the suggested one

\#Make sure that all processes do not consume more than GDAL_CACHEMAX
os.environ['GDAL_CACHEMAX'] = '%d' % int(gdal.GetCacheMax() / nb_processes)
#gdal.SetCacheMax(max(1024 * 1024, math.floor(gdal.GetCacheMax() / nb_processes)))

The original code with os.environ produces with SET GDAL_CACHEMAX=31104 and
gdal2tiles --processes=4 file1.tif delete -v
8 153 726 976

The new line with "gdal.SetCacheMax.. gives
32 614 907 904

@EvertEt
Copy link
Contributor Author

EvertEt commented Dec 18, 2019

Is the number the output of this? Because you need to log in the threads.

I added print(gdal.GetCacheMax()) after the ReadRaster Extent log in create_base_tile

The first result seems very weird to me as it is higher than the set value. The second one does seem correct but not divided by the amount of processes.
You should not need the verbose log.

Thank you for trying it out already. I will provide a better example soon If I can

@jratike80
Copy link
Collaborator

I am not a programmer so I need rather explicit advice. I searched for text "ReadRaster Extent", found one in a comment and one in a code, and added your snippet here:

    if options.verbose:
        print("\tReadRaster Extent: ",
              (rx, ry, rxsize, rysize), (wx, wy, wxsize, wysize))
        print(gdal.GetCacheMax())

It is around line 939.

@EvertEt
Copy link
Contributor Author

EvertEt commented Dec 18, 2019

That is the correct place. (Does need verbose flag indeed like this.)
I will try to investigate further as to why you are getting these values since the original seems super wrong (higher) and the other one should be divided (expected because of the above)

@jratike80
Copy link
Collaborator

I checked the numbers and wrote them with thousand separators, should be easier to read now.

@EvertEt
Copy link
Contributor Author

EvertEt commented Dec 18, 2019

Oh yes totally my bad. Seems like we indeed need both solutions to work on all platforms. I will improve the PR and check back with both. Thanks a lot

@EvertEt
Copy link
Contributor Author

EvertEt commented Dec 21, 2019

@jratike80 Could you check again? Should be the same behaviour in both cases now.

@EvertEt
Copy link
Contributor Author

EvertEt commented Jan 6, 2020

Is any more work expected on this PR from my side, @rouault ? Thanks!

@rouault
Copy link
Member

rouault commented Jan 6, 2020

Is any more work expected on this PR from my side, @rouault ?

Looks good for me. Let's give a chance to @jratike80 to perhaps have another try when he has some time

@rouault
Copy link
Member

rouault commented Jan 7, 2020

Merging this as it looks reasonable, and I want to pull that in 3.0 branch as well for tomorrow release

@rouault rouault merged commit 870ec8f into OSGeo:master Jan 7, 2020
@EvertEt EvertEt deleted the patch-1 branch January 7, 2020 18:55
@EvertEt
Copy link
Contributor Author

EvertEt commented Jan 8, 2020

Ok thanks! Will it be backported as well?

@rouault
Copy link
Member

rouault commented Jan 8, 2020

yes was backported in time for GDAL 3.0.3 and 2.4.4 RC1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants