vsicurl cache/concurrency issue when using multiple threads #1244

tbonfort · 2019-01-29T14:40:42Z

Expected behavior and actual behavior.

Concurrently opening multiple COGs through vsicurl fails intermittently. I suspect the issue is related to a race when accessing the global vsicurl block cache.

Error Messages can vary:

ERROR 4: `/vsicurl/http://localhost/1.tif' not recognized as a supported file format.
Warning 1: TIFFFetchNormalTag:IO error during reading of "JPEGTables"; tag ignored
Warning 1: TIFFFetchNormalTag:IO error during reading of "GeoPixelScale"; tag ignored
ERROR 1: TIFFReadDirectory:IO error during reading of "SampleFormat"
Warning 1: TIFFFetchNormalTag:IO error during reading of "GDALMetadata"; tag ignored

Steps to reproduce the problem.

Here is an "artificial" test case aimed at easily reproducing the issue. Setting CPL_VSIL_CURL_CACHE_SIZE to a large size is a temporary workaround, but the issue does still happen in that case.

Create 1 cog tif name 0.tif and put it somewhere accessible to a local webserver. The headers of my test case was roughly 50k:

Driver: GTiff/GeoTIFF
Files: /usr/share/nginx/html/1.tif
Size is 11188, 14536
    /*snip*/
Pixel Size = (1.186746818780522,-1.186746818780522)
Metadata:
  AREA_OR_POINT=Area
Image Structure Metadata:
  COMPRESSION=YCbCr JPEG
  INTERLEAVE=PIXEL
  SOURCE_COLOR_SPACE=YCbCr
Corner Coordinates:
/*snip*/
Band 1 Block=256x256 Type=Byte, ColorInterp=Red
  Overviews: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
  Mask Flags: PER_DATASET 
  Overviews of mask band: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
Band 2 Block=256x256 Type=Byte, ColorInterp=Green
  Overviews: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
  Mask Flags: PER_DATASET 
  Overviews of mask band: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
Band 3 Block=256x256 Type=Byte, ColorInterp=Blue
  Overviews: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
  Mask Flags: PER_DATASET 
  Overviews of mask band: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455

duplicate it for testing purposes:

for i in {1..1000}; do ln -sf 0.tif $i.tif; done

make sure it is accessible from http://localhost/0.tif (and 1.tif, 2.tif, etc...), or update the provided source code to point to the correct locations
compile

#include <gdal/gdal.h>
#include <pthread.h>

int worker(int workerid, int count, char **urls, int nurls) {
    int i;
    for(i=0;i<count;i++) {
        int idx = rand() % nurls;
        GDALDatasetH  hDataset = GDALOpen( urls[idx], GA_ReadOnly );
        if (hDataset == NULL) {
            fprintf(stderr,"!!!!!!!!!!!!!!!!!!!!! worker %d failed to open %s\n",workerid,urls[idx]);
            return 1;
        }
        GDALClose(hDataset);
        //fprintf(stdout,"worker %d opened %s\n",workerid,urls[idx]);
    }
    return 0;
}


typedef struct{
    int wid;
    int count;
    char **urls;
    int nurls;
} wparams;

void *worker_thread(void *p) {
    wparams *params=(wparams*)p;
    worker(params->wid,params->count,params->urls,params->nurls);
}

int main(int argc, char **argv) {
    int workers = 10;
    int nurls = 10;
    int count = 10;
    if (argc>1) {
        workers = atoi(argv[1]);
    }
    if (argc>2){
        nurls = atoi(argv[2]);
    }
    if (argc>3) {
        count = atoi(argv[3]);
    }
    GDALAllRegister();
    char **urls = (char**)malloc(nurls*sizeof(char*));
    pthread_t *tids = (pthread_t*)malloc(workers*sizeof(pthread_t));
    for(int i=0;i<nurls;i++) {
        urls[i]=malloc(80);
        sprintf(urls[i],"/vsicurl/http://localhost/%d.tif",i);
    }
    for(int i=0;i<workers;i++) {
        wparams *p=malloc(sizeof(wparams));
        p->wid=i;
        p->count=count;
        p->urls=urls;
        p->nurls=nurls;
        pthread_create( &tids[i], NULL, worker_thread, (void*)p);
    }
    for(int i=0;i<workers;i++) {
        pthread_join(tids[i],NULL);
    }
}

Using a small chunk and cache size reproduces the issue very rapidly (10 concurrent workers looping over a set of 10 COGs):

GDAL_CACHEMAX=64 CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR GDAL_HTTP_MULTIRANGE=YES GDAL_HTTP_MAX_RETRY=4 CPL_CURL_VERBOSE=FALSE GDAL_HTTP_VERSION=2 CPL_VSIL_CURL_CHUNK_SIZE=2048 CPL_DEBUG=OFF VSI_CACHE=FALSE CPL_VSIL_CURL_CACHE_SIZE=4096  ./a.out 10 10 100

Using a more reasonable (or default) cache size it is much more rare to reproduce the issue:

GDAL_CACHEMAX=64 CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR GDAL_HTTP_MULTIRANGE=YES GDAL_HTTP_MAX_RETRY=4 CPL_CURL_VERBOSE=FALSE GDAL_HTTP_VERSION=2 CPL_VSIL_CURL_CHUNK_SIZE=32768 CPL_DEBUG=OFF VSI_CACHE=FALSE CPL_VSIL_CURL_CACHE_SIZE=655360  ./a.out 100 100 1000

Operating system

Ubuntu 18.04 64 bit
Ubuntu 18.10
Alpine

GDAL version and provenance

2.4.0 locally compiled

/cc @Manuscrit

The text was updated successfully, but these errors were encountered:

tbonfort · 2019-01-29T14:43:58Z

gdal compilation:

./configure  --disable-static \
		--enable-shared \
		--with-hide-internal-symbols \
		--with-geotiff=internal \
		--with-jpeg \
		--with-libtiff=internal \
		--with-libz=internal \
		--with-png \
		--with-proj \
		--with-proj5-api \
		--with-threads \
		--without-bsb \
		--without-cfitsio \
		--without-cryptopp \
		--with-curl \
		--without-ecw \
		--without-expat \
		--without-fme \
		--without-freexl \
		--without-geos \
		--without-gif \
		--without-gnm \
		--without-grass \
		--without-grib \
		--without-hdf4 \
		--without-hdf5 \
		--without-idb \
		--without-ingres \
		--without-jasper \
		--without-jpeg12 \
		--without-jp2mrsid \
		--without-libgrass \
		--without-libkml \
		--without-libtool \
		--without-kakadu \
		--without-mrf \
		--without-mrsid \
		--without-mysql \
		--without-netcdf \
		--without-odbc \
		--without-ogdi \
		--without-openjpeg \
		--without-pcidsk \
		--without-pcraster \
		--without-pcre \
		--without-perl \
		--without-pg \
		--without-php \
		--without-python \
		--without-qhull \
		--without-sde \
		--without-spatialite \
		--without-sqlite3 \
		--without-webp \
		--without-xerces \
		--without-xml2

fvdbergh · 2019-04-09T15:18:46Z

I can confirm that I have also run into this bug; the problem (probably a data race) only occurs when I access the files via curl, but not when accessing the files directly, in both cases using a fair number of threads.
I happen to be using the OpenJPEG driver to read jpeg2000 files, also using GDAL 2.4.0.

I am still looking at the source to see if there is a simple way to completely disable the CPL_VSIL_CURL_CACHE. Or has this since been fixed in 2.5.x ?

pedros007 · 2019-04-26T17:35:39Z

@fvdbergh To disable, perhaps you could set CPL_VSIL_CURL_NON_CACHED to /vsicurl/ or /(untested) or set CPL_VSIL_CURL_CACHE_SIZE=0?

fvdbergh · 2019-04-26T18:01:51Z

@pedros007 I tried a number of things, and eventually rummaged around in the source code. Unfortunately one cannot set the cache size to zero (the code then simply resets it to 16 MB). I tried hacking the GDAL 2.4.0 code such that all URLs behave as if they belong to the CPL_VSIL_CURL_NON_CACHED set. I even tried to replace the Null_lock (the default template argument to the cache) with a std::guard_lock with a std::mutex. As a last resort, I tried sprinkling in a bunch std::guard_locks around a global std::mutex I added, but to no avail; my multiple threads kept on clobbering some shared state. Trying to find the actual race condition with valgrind / drd did not work either: there were far too many false positives, so I gave up looking for the problem.

Since I had been busy reprojecting a whole bunch of files from the Sentinel 2 AWS S3 bucket (/vsis3 goes through /vsicurl), I could get away with factoring out my calls to the gdalwarp(er) into little stub programs that I ended running in sub-processes. As one would expect, that worked perfectly, even though it is not the most elegant solution.

pedros007 · 2019-04-26T18:55:36Z

@fvdbergh Ahh, you're right

gdal/gdal/port/cpl_vsil_curl.cpp

Lines 4233 to 4239 in fceb74a

    
           GIntBig nCacheSize = CPLAtoGIntBig( 
        
               CPLGetConfigOption("CPL_VSIL_CURL_CACHE_SIZE", "16384000")); 
        
           if( nCacheSize < DOWNLOAD_CHUNK_SIZE || 
        
               nCacheSize / DOWNLOAD_CHUNK_SIZE > INT_MAX ) 
        
           { 
        
               nCacheSize = 16384000; 
        
           }

I would expect this to work: CPL_VSIL_CURL_NON_CACHED="/vsis3/"

Kirill888 · 2019-11-12T05:20:49Z

Hi, I was pointed to this issue by @sgillies from rasterio repo. I too have experienced similar issue when using GDAL 2.4.0 via rasterio library in Python with high level of concurrency.

After reading this issue I have tried playing around with various CACHE related settings and found that setting VSI_CACHE='YES' seems to resolve the problem for me, failures go from "consistently happening every time I run with 32 threads" to "I tried a few times, observed no errors so far"

I'm not familiar with the codebase, but it seems to be used only here:

gdal/gdal/port/cpl_vsil_curl.cpp

Lines 2960 to 2963 in 1c79ed7

    
           if( CPLTestBool( CPLGetConfigOption( "VSI_CACHE", "FALSE" ) ) ) 
        
               return VSICreateCachedFile( poHandle ); 
        
           else 
        
               return poHandle;

rouault · 2019-11-12T15:06:48Z

Candidate fix in #2012

/vsicurl (and derived filesystems): fix concurrency issue with multithreaded reads (fixes #1244)

@tbonfort

…hreaded reads (fixes #1244) Kudos to @tbonfort for the easy reproducer.

@tbonfort

…hreaded reads (fixes #1244) Kudos to @tbonfort for the easy reproducer.

rouault · 2019-11-12T23:23:38Z

Backported to 3.0 and 2.4 branches

Zaitsev mentioned this issue Jun 22, 2019

cached filename used after network errors mojodna/marblecutter-virtual#12

Open

rouault mentioned this issue Oct 29, 2019

GDAL_DISABLE_READDIR_ON_OPEN seems to lead to failing reads from S3 #1960

Closed

dionhaefner mentioned this issue Oct 30, 2019

Experimental: multiprocessing version of raster retrieval DHI/terracotta#154

Merged

sgillies mentioned this issue Nov 12, 2019

Read errors when reading COGs from S3 with many threads rasterio/rasterio#1828

Closed

rouault self-assigned this Nov 12, 2019

rouault closed this as completed in 28298cf Nov 12, 2019

rouault added a commit that referenced this issue Nov 12, 2019

Merge pull request #2012 from rouault/fix_1244

efc8cf4

/vsicurl (and derived filesystems): fix concurrency issue with multithreaded reads (fixes #1244)

rouault added a commit that referenced this issue Nov 12, 2019

/vsicurl (and derived filesystems): fix concurrency issue with multit…

66580a1

…hreaded reads (fixes #1244) Kudos to @tbonfort for the easy reproducer.

rouault added a commit that referenced this issue Nov 12, 2019

/vsicurl (and derived filesystems): fix concurrency issue with multit…

9af4aa1

…hreaded reads (fixes #1244) Kudos to @tbonfort for the easy reproducer.

rouault added this to the 2.4.4 milestone Nov 12, 2019

colekettler mentioned this issue Jan 17, 2020

Performance problems with concurrent range reads and VSICURL caches azavea/docker-gdal#5

Open

pomadchin mentioned this issue Feb 7, 2020

GDAL errors when reading repeatedly from one GDALRasterSource locationtech/geotrellis#3184

Closed

jonseymour mentioned this issue Feb 25, 2020

GDAL does not refresh IAMRole creds on EC2 or ECS after 6 hours #1593

Closed

mihi314 mentioned this issue Feb 27, 2020

Concurrently reading from COGs through vsicurl fails with a segfault in 1.1.3 rasterio/rasterio#1876

Closed

dbekaert mentioned this issue Apr 23, 2020

Support virtual file access aria-tools/ARIA-tools#138

Closed

dimattap mentioned this issue Jul 13, 2021

vsiaz working on VM, but not on AKS HTTPS endpoint scoring file #4100

Open

smohiudd mentioned this issue Jan 9, 2023

Cog/validate endpoint is no longer working for protected S3 assets NASA-IMPACT/veda-backend#133

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vsicurl cache/concurrency issue when using multiple threads #1244

vsicurl cache/concurrency issue when using multiple threads #1244

tbonfort commented Jan 29, 2019

tbonfort commented Jan 29, 2019

fvdbergh commented Apr 9, 2019

pedros007 commented Apr 26, 2019

fvdbergh commented Apr 26, 2019 •

edited

pedros007 commented Apr 26, 2019

Kirill888 commented Nov 12, 2019

rouault commented Nov 12, 2019

rouault commented Nov 12, 2019

vsicurl cache/concurrency issue when using multiple threads #1244

vsicurl cache/concurrency issue when using multiple threads #1244

Comments

tbonfort commented Jan 29, 2019

Expected behavior and actual behavior.

Steps to reproduce the problem.

Operating system

GDAL version and provenance

tbonfort commented Jan 29, 2019

fvdbergh commented Apr 9, 2019

pedros007 commented Apr 26, 2019

fvdbergh commented Apr 26, 2019 • edited

pedros007 commented Apr 26, 2019

Kirill888 commented Nov 12, 2019

rouault commented Nov 12, 2019

rouault commented Nov 12, 2019

fvdbergh commented Apr 26, 2019 •

edited