Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with 3.8.x and gdal_translate from a large VRT #8967

Closed
trharris78 opened this issue Dec 16, 2023 · 1 comment · Fixed by #8969
Closed

Memory leak with 3.8.x and gdal_translate from a large VRT #8967

trharris78 opened this issue Dec 16, 2023 · 1 comment · Fixed by #8969
Assignees

Comments

@trharris78
Copy link

trharris78 commented Dec 16, 2023

I have a task that joins a bunch of GeoTIFF tiles with a VRT, then translates the VRT to one big GeoTIFF. When I ran this task in AWS in a Docker container, the task failed due to memory usage.

I originally observed this behavior with GDAL 3.8.1. I tried it both in the official Docker image, ghcr.io/osgeo/gdal:ubuntu-small-3.8.1, and on OSX natively (not in Docker). I also tried the official 3.8.0 Docker image and it appears to happen there, but not with the official 3.7.3 image. So possibly some change introduced in 3.8.0 is causing this.

My original data was 346 separate TIFs, approximately 3 GB, wrapped by a VRT. I have since tried to come up with a reasonable reproduce case, but it requires a large amount of input files for the ever-increasing memory usage to become obvious. I have attached one TIF, template.tif, and a simple Python script, create_test_data.py. The script will duplicate the TIF and edit each copy's geotransform in order to make a spatially extended grid, then it will make a VRT from all the tiles in the grid. The script is set up to make a grid of 20x20 tiles and takes 1-2 minutes to run on my 2019 MacBook. The simulated grid takes up about 460MB on disk.

I am using memory-profiler to produce the plots below. The gdal_translate command runs within mprof run to collect a memory trace. The plot can be displayed with mprof plot.

Steps to reproduce:

  1. python create_test_data.py
  2. mprof run -o mprof.dat gdal_translate -co BIGTIFF=YES -co TILED=YES -co COMPRESS=DEFLATE -co NUM_THREADS=ALL_CPUS --config GDAL_TIFF_INTERNAL_MASK YES test_data.vrt test_data.tif
  3. mprof plot mprof.dat

This first plot is from a run in the official 3.8.1 Docker image. I started the Docker container with -m 8000m to tell Docker to provide only 8GB of RAM to the container. The gdal_translate command ran for a bit but was killed once memory usage reached 8GB.

Command output:

Input file size is 97920, 97920
0...10...20...30...40...50...60.root@42186bc7788c:/data#

Memory usage:
image

I thought some of the caching configuration options might help, so I set these very small values:

export GDAL_CACHEMAX=10 VSI_CACHE=FALSE VSI_CACHE_SIZE=1000000 GDAL_MAX_DATASET_POOL_SIZE=2 GDAL_MAX_DATASET_POOL_RAM_USAGE=10MB

I ran the same command, and this time it did complete, but the memory usage still increased continuously. It just didn't hit the 8GB limit I set for the Docker container:

image

Running the exact same command in the older Docker image, ghcr.io/osgeo/gdal:ubuntu-small-3.7.3, produces this much more reasonable memory profile:

image

Apologies for the zip file, GitHub wouldn't let me upload the .tif and .py separately.

attachments.zip

Edit: I should note that I did see the open issue #7908 but it didn't seem to have any applicable resolution.

@rouault rouault self-assigned this Dec 16, 2023
rouault added a commit to rouault/gdal that referenced this issue Dec 16, 2023
…Geo#8967, 3.8.0 regression)

m_abyWrkBuffer and m_abyWrkBufferMask were mistakenly put at the
VRTComplexSource level, missing that there can be a big number of
sources whose lifetime is the same as the VRT dataset, and thus it
is inappropriate to have long-lived working buffers at that level.
We can actually use one single instance of them for all sources, so
move that at the dataset level.
rouault added a commit to rouault/gdal that referenced this issue Dec 16, 2023
…Geo#8967, 3.8.0 regression)

m_abyWrkBuffer and m_abyWrkBufferMask were mistakenly put at the
VRTComplexSource level, missing that there can be a big number of
sources whose lifetime is the same as the VRT dataset, and thus it
is inappropriate to have long-lived working buffers at that level.

In master, this is changed to actually use one single instance of
them for all sources, placed at the dataset level. In that branch,
we can't do that as it would break ABI. So instead clear them after
their immediate use (like was done in GDAL < 3.8)
@rouault
Copy link
Member

rouault commented Dec 16, 2023

Thanks for the high quality report (here and in 8968) ! Fixed queued in #8969 (for 3.9) and #8970 (for 3.8.2)

rouault added a commit that referenced this issue Dec 16, 2023
[3.8 fix] VRTComplexSource: fix excessive RAM usage with many sources (fixes #8967, 3.8.0 regression)
rouault added a commit that referenced this issue Dec 16, 2023
VRTComplexSource: fix excessive RAM usage with many sources (fixes #8967, 3.8.0 regression)
rouault added a commit to rouault/gdal that referenced this issue Dec 19, 2023
For now, only a test case representative of the one of OSGeo#8967

This CI job is only run once per day
rouault added a commit to rouault/gdal that referenced this issue Dec 20, 2023
For now, only a test case representative of the one of OSGeo#8967

This CI job is only run once per day
ralphraul pushed a commit to 1SpatialGroupLtd/gdal that referenced this issue Mar 11, 2024
…Geo#8967, 3.8.0 regression)

m_abyWrkBuffer and m_abyWrkBufferMask were mistakenly put at the
VRTComplexSource level, missing that there can be a big number of
sources whose lifetime is the same as the VRT dataset, and thus it
is inappropriate to have long-lived working buffers at that level.

In master, this is changed to actually use one single instance of
them for all sources, placed at the dataset level. In that branch,
we can't do that as it would break ABI. So instead clear them after
their immediate use (like was done in GDAL < 3.8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants