Large netCDF-4 file reading strategy #69

PeterWarren · 2016-09-19T04:52:16Z

We are using thredds (ncwms currently but soon to be edal-java) to render wms layers of large (64GB) NetCDF-4 files. To avoid hitting out of memory errors we need to ensure the netcdf reading strategy is set to SCANNLINE. Currently, the reading strategy chooser (getOptimumDataReadingStrategy) only selects SCANNLINE if the file type is "netCDF" or "HDF4". Our files are "NetCDF-4" so the chooser falls-back to BOUNDING_BOX reading strategy and thredds quickly exhausts even very large memory allocations.

To avoid this we have patched our ncwms (thredds 4.6) to look for "NetCDF-4" type files and force them into SCANNLINE mode. We would now like to find a more permanent solution for thredds 5.0 and onwards.

I have 2 proposed solutions:

Added NetCDF-4 to the types that go into SCANNLINE mode as we have done previously. However, NetCDF-4 can be compressed and comments around getOptimumDataReadingStrategy suggest that compressed files should be read with BOUNDING_BOX.
Implement the todo in the getOptimumDataReadingStrategy method to "also use the size of the grids as a deciding factor" and choose a size in MB or data points to change to SCANNLINE. Perhaps this solutions is more appropriate because it addresses the real issue which is the size of the file not its type.

(1) is trivial so I wont provide any code for it.
I had a go at implementing (2) (attached bellow). I assumed that all NetcdfDatasets could be considered gridded datasets, I am not sure if that's safe? And I calculated the size of the dataset by taking the product of all dimensions.

try (GridDataset gridDataset = getGridDataset(nc)) {        // assume gridded dataset (possibly unsafe?)
    for(GridDatatype grid : gridDataset.getGrids()) {
        long totalsize=1;
        DataType dt =grid.getDataType();
        int datapintsize = dt.getSize();        //could use the data point size in the estimate
        for(Dimension dim :grid.getDimensions()) totalsize*= (long)dim.getLength(); // take product of all dimensions
        if (totalsize > (100 * 1024 *1024) ) return DataReadingStrategy.SCANLINE;
        /* 100MB of single byte data or 400MB of int32 or float data */
    }
} catch (DataReadingException | IOException e) {
    /* Ignore exception and try to choose a reading strategy based on file type. */
}

Please let me know what you think.
NetCDF-4ReadingStratPatch.zip

Thanks

The text was updated successfully, but these errors were encountered:

guygriffiths · 2016-09-19T10:10:54Z

This looks good. I've modified it slightly so that it uses the DataType size (from your code), and compares against a multiple of the maximum available memory. This will make it to the next release of ncWMS2, and should hopefully be in a subsequent TDS release.

PeterWarren · 2016-09-20T00:36:17Z

Thank you Guy.
When testing this I also noticed there are 2 very minor numeric overflow bugs in DerivedStaggeredGrid.size() and RectilinearGridImpl.size(). One or both integers used need to be cast to longs before multiplying.
eg.
return (long) xAxis.size() * (long) yAxis.size();

guygriffiths · 2016-09-20T09:28:20Z

Great, thanks, I've fixed that one too.

guygriffiths · 2016-09-30T11:18:30Z

So after some testing, it turns out that this is having a very detrimental effect on displaying data from large datasets - SCANLINE is a lot slower for compressed data, and this change is picking SCANLINE for datasets which really don't need it.

I've changed the code so that only the size of the horizontal grid is taken into account. That's all that DataReadingStrategy applies to anyway, so this should give a more realistic estimate of the amount of data which needs to be read, and should only choose SCANLINE in cases where it's really necessary to avoid OutOfMemoryExceptions. Once I've confirmed that it's all working properly, would you mind testing with your dataset to make sure that SCANLINE is still chosen?

adamsteer · 2016-09-30T20:09:32Z

Hi Guy - do you have a compiled ncwms jar containing your change that will
work with thredds 4.6? If so be really interested in testing it, we see
similar issues on a TDS which uses the scan line reading modification. I'm
not a Java developer, so grabbing a compiled jar is easiest - otherwise
I'll try to compile one. Thanks!

On 30 Sep 2016 21:18, "Guy Griffiths" notifications@github.com wrote:

So after some testing, it turns out that this is having a very detrimental
effect on displaying data from large datasets - SCANLINE is a lot slower
for compressed data, and this change is picking SCANLINE for datasets which
really don't need it.

I've changed the code so that only the size of the horizontal grid is
taken into account. That's all that DataReadingStrategy applies to anyway,
so this should give a more realistic estimate of the amount of data which
needs to be read, and should only choose SCANLINE in cases where it's
really necessary to avoid OutOfMemoryExceptions. Once I've confirmed that
it's all working properly, would you mind testing with your dataset to make
sure that SCANLINE is still chosen?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#69 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMel1mmJe1gzz_BHV6Gm4O-u5TfhTVWdks5qvPAGgaJpZM4KAInp
.

PeterWarren · 2016-10-17T00:43:36Z

Thanks again Guy. I'll backport your patch into 4.6 for Adam and test the current master branch.

adamsteer · 2017-05-24T04:11:48Z

just catching up here - what's the best way to go about testing this patch - is it part of any edal-java release yet (wondering if i should leap ahead to TDS 5 at this point)? and/or where can I grab a compiled ncwms.jar file containing the patch for TDS 4.x? Thanks

guygriffiths · 2017-05-31T09:19:28Z

@adamsteer - Yes, this will have made it into any recent edal-java release, and so should be available in the latest TDS 5 builds. @PeterWarren would be better placed to tell you whether this is in any 4.x version of TDS

guygriffiths closed this as completed Nov 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large netCDF-4 file reading strategy #69

Large netCDF-4 file reading strategy #69

PeterWarren commented Sep 19, 2016

guygriffiths commented Sep 19, 2016

PeterWarren commented Sep 20, 2016

guygriffiths commented Sep 20, 2016

guygriffiths commented Sep 30, 2016

adamsteer commented Sep 30, 2016

PeterWarren commented Oct 17, 2016

adamsteer commented May 24, 2017

guygriffiths commented May 31, 2017

Large netCDF-4 file reading strategy #69

Large netCDF-4 file reading strategy #69

Comments

PeterWarren commented Sep 19, 2016

guygriffiths commented Sep 19, 2016

PeterWarren commented Sep 20, 2016

guygriffiths commented Sep 20, 2016

guygriffiths commented Sep 30, 2016

adamsteer commented Sep 30, 2016

PeterWarren commented Oct 17, 2016

adamsteer commented May 24, 2017

guygriffiths commented May 31, 2017