Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsicurl cache/concurrency issue when using multiple threads #1244

Closed
tbonfort opened this issue Jan 29, 2019 · 8 comments
Closed

vsicurl cache/concurrency issue when using multiple threads #1244

tbonfort opened this issue Jan 29, 2019 · 8 comments
Assignees
Milestone

Comments

@tbonfort
Copy link
Member

Expected behavior and actual behavior.

Concurrently opening multiple COGs through vsicurl fails intermittently. I suspect the issue is related to a race when accessing the global vsicurl block cache.

Error Messages can vary:

  • ERROR 4: `/vsicurl/http://localhost/1.tif' not recognized as a supported file format.
  • Warning 1: TIFFFetchNormalTag:IO error during reading of "JPEGTables"; tag ignored
  • Warning 1: TIFFFetchNormalTag:IO error during reading of "GeoPixelScale"; tag ignored
  • ERROR 1: TIFFReadDirectory:IO error during reading of "SampleFormat"
  • Warning 1: TIFFFetchNormalTag:IO error during reading of "GDALMetadata"; tag ignored

Steps to reproduce the problem.

Here is an "artificial" test case aimed at easily reproducing the issue. Setting CPL_VSIL_CURL_CACHE_SIZE to a large size is a temporary workaround, but the issue does still happen in that case.

  • Create 1 cog tif name 0.tif and put it somewhere accessible to a local webserver. The headers of my test case was roughly 50k:
Driver: GTiff/GeoTIFF
Files: /usr/share/nginx/html/1.tif
Size is 11188, 14536
    /*snip*/
Pixel Size = (1.186746818780522,-1.186746818780522)
Metadata:
  AREA_OR_POINT=Area
Image Structure Metadata:
  COMPRESSION=YCbCr JPEG
  INTERLEAVE=PIXEL
  SOURCE_COLOR_SPACE=YCbCr
Corner Coordinates:
/*snip*/
Band 1 Block=256x256 Type=Byte, ColorInterp=Red
  Overviews: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
  Mask Flags: PER_DATASET 
  Overviews of mask band: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
Band 2 Block=256x256 Type=Byte, ColorInterp=Green
  Overviews: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
  Mask Flags: PER_DATASET 
  Overviews of mask band: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
Band 3 Block=256x256 Type=Byte, ColorInterp=Blue
  Overviews: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
  Mask Flags: PER_DATASET 
  Overviews of mask band: 5594x7268, 2797x3634, 1399x1817, 700x909, 350x455
  • duplicate it for testing purposes:
for i in {1..1000}; do ln -sf 0.tif $i.tif; done
  • make sure it is accessible from http://localhost/0.tif (and 1.tif, 2.tif, etc...), or update the provided source code to point to the correct locations

  • compile

#include <gdal/gdal.h>
#include <pthread.h>

int worker(int workerid, int count, char **urls, int nurls) {
    int i;
    for(i=0;i<count;i++) {
        int idx = rand() % nurls;
        GDALDatasetH  hDataset = GDALOpen( urls[idx], GA_ReadOnly );
        if (hDataset == NULL) {
            fprintf(stderr,"!!!!!!!!!!!!!!!!!!!!! worker %d failed to open %s\n",workerid,urls[idx]);
            return 1;
        }
        GDALClose(hDataset);
        //fprintf(stdout,"worker %d opened %s\n",workerid,urls[idx]);
    }
    return 0;
}


typedef struct{
    int wid;
    int count;
    char **urls;
    int nurls;
} wparams;

void *worker_thread(void *p) {
    wparams *params=(wparams*)p;
    worker(params->wid,params->count,params->urls,params->nurls);
}

int main(int argc, char **argv) {
    int workers = 10;
    int nurls = 10;
    int count = 10;
    if (argc>1) {
        workers = atoi(argv[1]);
    }
    if (argc>2){
        nurls = atoi(argv[2]);
    }
    if (argc>3) {
        count = atoi(argv[3]);
    }
    GDALAllRegister();
    char **urls = (char**)malloc(nurls*sizeof(char*));
    pthread_t *tids = (pthread_t*)malloc(workers*sizeof(pthread_t));
    for(int i=0;i<nurls;i++) {
        urls[i]=malloc(80);
        sprintf(urls[i],"/vsicurl/http://localhost/%d.tif",i);
    }
    for(int i=0;i<workers;i++) {
        wparams *p=malloc(sizeof(wparams));
        p->wid=i;
        p->count=count;
        p->urls=urls;
        p->nurls=nurls;
        pthread_create( &tids[i], NULL, worker_thread, (void*)p);
    }
    for(int i=0;i<workers;i++) {
        pthread_join(tids[i],NULL);
    }
}
  • Using a small chunk and cache size reproduces the issue very rapidly (10 concurrent workers looping over a set of 10 COGs):
GDAL_CACHEMAX=64 CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR GDAL_HTTP_MULTIRANGE=YES GDAL_HTTP_MAX_RETRY=4 CPL_CURL_VERBOSE=FALSE GDAL_HTTP_VERSION=2 CPL_VSIL_CURL_CHUNK_SIZE=2048 CPL_DEBUG=OFF VSI_CACHE=FALSE CPL_VSIL_CURL_CACHE_SIZE=4096  ./a.out 10 10 100
  • Using a more reasonable (or default) cache size it is much more rare to reproduce the issue:
GDAL_CACHEMAX=64 CPL_VSIL_CURL_ALLOWED_EXTENSIONS=.tif GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR GDAL_HTTP_MULTIRANGE=YES GDAL_HTTP_MAX_RETRY=4 CPL_CURL_VERBOSE=FALSE GDAL_HTTP_VERSION=2 CPL_VSIL_CURL_CHUNK_SIZE=32768 CPL_DEBUG=OFF VSI_CACHE=FALSE CPL_VSIL_CURL_CACHE_SIZE=655360  ./a.out 100 100 1000

Operating system

Ubuntu 18.04 64 bit
Ubuntu 18.10
Alpine

GDAL version and provenance

2.4.0 locally compiled

/cc @Manuscrit

@tbonfort
Copy link
Member Author

gdal compilation:

./configure  --disable-static \
		--enable-shared \
		--with-hide-internal-symbols \
		--with-geotiff=internal \
		--with-jpeg \
		--with-libtiff=internal \
		--with-libz=internal \
		--with-png \
		--with-proj \
		--with-proj5-api \
		--with-threads \
		--without-bsb \
		--without-cfitsio \
		--without-cryptopp \
		--with-curl \
		--without-ecw \
		--without-expat \
		--without-fme \
		--without-freexl \
		--without-geos \
		--without-gif \
		--without-gnm \
		--without-grass \
		--without-grib \
		--without-hdf4 \
		--without-hdf5 \
		--without-idb \
		--without-ingres \
		--without-jasper \
		--without-jpeg12 \
		--without-jp2mrsid \
		--without-libgrass \
		--without-libkml \
		--without-libtool \
		--without-kakadu \
		--without-mrf \
		--without-mrsid \
		--without-mysql \
		--without-netcdf \
		--without-odbc \
		--without-ogdi \
		--without-openjpeg \
		--without-pcidsk \
		--without-pcraster \
		--without-pcre \
		--without-perl \
		--without-pg \
		--without-php \
		--without-python \
		--without-qhull \
		--without-sde \
		--without-spatialite \
		--without-sqlite3 \
		--without-webp \
		--without-xerces \
		--without-xml2

@fvdbergh
Copy link

fvdbergh commented Apr 9, 2019

I can confirm that I have also run into this bug; the problem (probably a data race) only occurs when I access the files via curl, but not when accessing the files directly, in both cases using a fair number of threads.
I happen to be using the OpenJPEG driver to read jpeg2000 files, also using GDAL 2.4.0.

I am still looking at the source to see if there is a simple way to completely disable the CPL_VSIL_CURL_CACHE. Or has this since been fixed in 2.5.x ?

@pedros007
Copy link
Contributor

@fvdbergh To disable, perhaps you could set CPL_VSIL_CURL_NON_CACHED to /vsicurl/ or /(untested) or set CPL_VSIL_CURL_CACHE_SIZE=0?

@fvdbergh
Copy link

fvdbergh commented Apr 26, 2019

@pedros007 I tried a number of things, and eventually rummaged around in the source code. Unfortunately one cannot set the cache size to zero (the code then simply resets it to 16 MB). I tried hacking the GDAL 2.4.0 code such that all URLs behave as if they belong to the CPL_VSIL_CURL_NON_CACHED set. I even tried to replace the Null_lock (the default template argument to the cache) with a std::guard_lock with a std::mutex. As a last resort, I tried sprinkling in a bunch std::guard_locks around a global std::mutex I added, but to no avail; my multiple threads kept on clobbering some shared state. Trying to find the actual race condition with valgrind / drd did not work either: there were far too many false positives, so I gave up looking for the problem.

Since I had been busy reprojecting a whole bunch of files from the Sentinel 2 AWS S3 bucket (/vsis3 goes through /vsicurl), I could get away with factoring out my calls to the gdalwarp(er) into little stub programs that I ended running in sub-processes. As one would expect, that worked perfectly, even though it is not the most elegant solution.

@pedros007
Copy link
Contributor

@fvdbergh Ahh, you're right

GIntBig nCacheSize = CPLAtoGIntBig(
CPLGetConfigOption("CPL_VSIL_CURL_CACHE_SIZE", "16384000"));
if( nCacheSize < DOWNLOAD_CHUNK_SIZE ||
nCacheSize / DOWNLOAD_CHUNK_SIZE > INT_MAX )
{
nCacheSize = 16384000;
}

I would expect this to work: CPL_VSIL_CURL_NON_CACHED="/vsis3/"

@Kirill888
Copy link

Hi, I was pointed to this issue by @sgillies from rasterio repo. I too have experienced similar issue when using GDAL 2.4.0 via rasterio library in Python with high level of concurrency.

After reading this issue I have tried playing around with various CACHE related settings and found that setting VSI_CACHE='YES' seems to resolve the problem for me, failures go from "consistently happening every time I run with 32 threads" to "I tried a few times, observed no errors so far"

I'm not familiar with the codebase, but it seems to be used only here:

if( CPLTestBool( CPLGetConfigOption( "VSI_CACHE", "FALSE" ) ) )
return VSICreateCachedFile( poHandle );
else
return poHandle;

@rouault rouault self-assigned this Nov 12, 2019
@rouault
Copy link
Member

rouault commented Nov 12, 2019

Candidate fix in #2012

rouault added a commit that referenced this issue Nov 12, 2019
/vsicurl (and derived filesystems): fix concurrency issue with multithreaded reads (fixes #1244)
rouault added a commit that referenced this issue Nov 12, 2019
…hreaded reads (fixes #1244)

Kudos to @tbonfort for the easy reproducer.
rouault added a commit that referenced this issue Nov 12, 2019
…hreaded reads (fixes #1244)

Kudos to @tbonfort for the easy reproducer.
@rouault
Copy link
Member

rouault commented Nov 12, 2019

Backported to 3.0 and 2.4 branches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants