Skip to content

Commit

Permalink
Use the built-in HDF5 byte-range reader, if available.
Browse files Browse the repository at this point in the history
re: Issue Unidata#1848

The existing Virtual File Driver built to support byte-range
read-only file access is quite old. It turns out to be extremely
slow (reason unknown at the moment).

Starting with HDF5 1.10.6, the HDF5 library has its own version
of such a file driver. The HDF5 developers have better knowledge
about building such a driver and what incantations are needed to
get good performance.

This PR modifies the byte-range code in hdf5open.c so
that if the HDF5 file driver is available, then it is used
in preference to the one written by the Netcdf group.

Misc. Other Changes:

1. Moved all of nc4print code to ncdump to keep appveyor quiet.
  • Loading branch information
DennisHeimbigner committed Sep 24, 2020
1 parent e9af580 commit f3218a2
Show file tree
Hide file tree
Showing 18 changed files with 324 additions and 105 deletions.
8 changes: 8 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -770,12 +770,20 @@ IF(USE_HDF5)
# Check to see if H5Dread_chunk is available
CHECK_LIBRARY_EXISTS(${HDF5_C_LIBRARY_hdf5} H5Dread_chunk "" HAS_READCHUNKS)

# Check to see if H5Pset_fapl_ros3 is available
CHECK_LIBRARY_EXISTS(${HDF5_C_LIBRARY_hdf5} H5Pset_fapl_ros3 "" HAS_HDF5_ROS3)

# Check to see if this is hdf5-1.10.3 or later.
IF(HAS_READCHUNKS)
SET(HDF5_SUPPORTS_PAR_FILTERS ON)
SET(ENABLE_NCDUMPCHUNKS ON)
ENDIF()

# Check to see if this is hdf5-1.10.6 or later.
IF(HAS_HDF5_ROS3)
SET(ENABLE_HDF5_ROS3 ON)
ENDIF()

IF (HDF5_SUPPORTS_PAR_FILTERS)
SET(HDF5_HAS_PAR_FILTERS TRUE CACHE BOOL "")
SET(HAS_PAR_FILTERS yes CACHE STRING "")
Expand Down
27 changes: 27 additions & 0 deletions NUG/nczarr.md
Original file line number Diff line number Diff line change
Expand Up @@ -456,6 +456,33 @@ for all objects limited to 5 Terabytes.
2. S3 key names can be any UNICODE name with a maximum length of 1024
bytes. Note that the limit is defined in terms of bytes and not (Unicode) characters. This affects the depth to which groups can be nested because the key encodes the full path name of a group.

# Appendix D. Alternative Mechanisms for Accessing Remote Datasets

The NetCDF-C library contains an alternate mechanism for accessing data
store in Amazon S3: The byte-range mechanism.
The idea is to treat the remote data as if it was a big file. This remote
"file" can be randomly accessed using the HTTP Byte-Range header.

In the Amazon S3 context, a copy of a dataset, a netcdf-3 or netdf-4 file,
is uploaded into a single object in some bucket. Then using the key to this object,
it is possible to tell the netcdf-c library to treat the object as a remote file
and to use the HTTP Byte-Range protocol to access the contents of the object.
The dataset object is referenced using a URL with the trailing fragment containing
the string ````#mode=bytes````.

An examination of the test program _nc_test/test_byterange.sh_ shows simple examples
using the _ncdump_ program. One such test is specified as follows:
````
https://noaa-goes16.s3.amazonaws.com/ABI-L1b-RadC/2017/059/03/OR_ABI-L1b-RadC-M3C13_G16_s20170590337505_e20170590340289_c20170590340316.nc#mode=bytes
````

This mechanism generalizes to work with most servers that support byte-range access.
Specifically, Thredds servers support such access using the HttpServer access method
as can be seen from this URL taken from the above test program.
````
https://thredds-test.unidata.ucar.edu/thredds/fileServer/irma/metar/files/METAR_20170910_0000.nc#bytes
````

# __Point of Contact__ {#nczarr_poc}

__Author__: Dennis Heimbigner<br>
Expand Down
3 changes: 3 additions & 0 deletions config.h.cmake.in
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,9 @@ are set when opening a binary file on Windows. */
/* if true, build byte-range Client */
#cmakedefine ENABLE_BYTERANGE 1

/* if true, use hdf5 S3 virtual file reader */
#cmakedefine ENABLE_HDF5_ROS3 1

/* if true, enable CDF5 Support */
#cmakedefine ENABLE_CDF5 1

Expand Down
7 changes: 7 additions & 0 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -1120,6 +1120,7 @@ fi
hdf5_parallel=no
hdf5_supports_par_filters=no
enable_szlib=no
has_ross3=no

if test "x$enable_hdf5" = xyes; then

Expand Down Expand Up @@ -1161,6 +1162,12 @@ if test "x$enable_hdf5" = xyes; then
# See if H5Dread_chunk is available
AC_SEARCH_LIBS([H5Dread_chunk],[hdf5_hldll hdf5_hl], [has_readchunks=yes], [has_readdhunks=no])

# See if hdf5 library supports Read-Only S3 (byte-range) driver
AC_SEARCH_LIBS([H5Pset_fapl_ros3],[hdf5_hldll hdf5_hl], [has_ros3=yes], [has_ros3=no])
if test "x$has_ros3" = xyes; then
AC_DEFINE([ENABLE_HDF5_ROS3], [1], [if true, support byte-range using hdf5 virtual file driver.])
fi

# Check to see if HDF5 library is 1.10.3 or greater. If so, allows
# parallel I/O with filters. This allows zlib/szip compression to
# be used with parallel I/O, which is very helpful to HPC users.
Expand Down
89 changes: 88 additions & 1 deletion libhdf5/hdf5open.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@
#include "H5FDhttp.h"
#endif

#ifdef ENABLE_HDF5_ROS3
#include <H5FDros3.h>
#endif

/*Nemonic */
#define FILTERACTIVE 1

Expand Down Expand Up @@ -68,6 +72,10 @@ static hid_t nc4_H5Fopen(const char *filename, unsigned flags, hid_t fapl_id);
#define nc4_H5Fopen H5Fopen
#endif

#ifdef ENABLE_HDF5_ROS3
static int ros3info(NCURI* uri, char** hostportp, char** regionp);
#endif

/**
* @internal Struct to track HDF5 object info, for
* rec_read_metadata(). We get this info for every object in the
Expand Down Expand Up @@ -845,9 +853,44 @@ nc4_open_file(const char *path, int mode, void* parameters, int ncid)
#ifdef ENABLE_BYTERANGE
else
if(h5->http.iosp) { /* Arrange to use the byte-range driver */
/* Configure FAPL to use the byte-range file driver */
#ifdef ENABLE_HDF5_ROS3
NCURI* uri = NULL;
H5FD_ros3_fapl_t fa;
char* hostport = NULL;
char* region = NULL;
char* accessid = NULL;
char* secretkey = NULL;
ncuriparse(path,&uri);
if(uri == NULL)
BAIL(NC_EINVAL);
/* Extract auth related info */
if((ros3info(uri,&hostport,&region)))
BAIL(NC_EINVAL);
accessid = NC_rclookup("HTTP.CREDENTIALS.USER",hostport);
secretkey = NC_rclookup("HTTP.CREDENTIALS.PASSWORD",hostport);
fa.version = 1;
fa.aws_region[0] = '\0';
fa.secret_id[0] = '\0';
fa.secret_key[0] = '\0';
if(accessid == NULL || secretkey == NULL) {
/* default, non-authenticating, "anonymous" fapl configuration */
fa.authenticate = (hbool_t)0;
} else {
fa.authenticate = (hbool_t)1;
strlcat(fa.aws_region,region,H5FD_ROS3_MAX_REGION_LEN);
strlcat(fa.secret_id, accessid, H5FD_ROS3_MAX_SECRET_ID_LEN);
strlcat(fa.secret_key, secretkey, H5FD_ROS3_MAX_SECRET_KEY_LEN);
}
nullfree(region);
nullfree(hostport);
/* create and set fapl entry */
if(H5Pset_fapl_ros3(fapl_id, &fa) < 0)
BAIL(NC_EHDFERR);
#else
/* Configure FAPL to use our byte-range file driver */
if (H5Pset_fapl_http(fapl_id) < 0)
BAIL(NC_EHDFERR);
#endif
/* Open the HDF5 file. */
if ((h5->hdfid = nc4_H5Fopen(path, flags, fapl_id)) < 0)
BAIL(NC_EHDFERR);
Expand Down Expand Up @@ -2714,6 +2757,50 @@ rec_read_metadata(NC_GRP_INFO_T *grp)
return retval;
}

#ifdef ENABLE_HDF5_ROS3
static int
ros3info(NCURI* uri, char** hostportp, char** regionp)
{
int stat = NC_NOERR;
size_t len;
char* hostport = NULL;
char* region = NULL;
char* p;

if(uri == NULL || uri->host == NULL)
{stat = NC_EINVAL; goto done;}
len = strlen(uri->host);
if(uri->port != NULL)
len += 1+strlen(uri->port);
len++; /* nul term */
if((hostport = malloc(len)) == NULL)
{stat = NC_ENOMEM; goto done;}
hostport[0] = '\0';
strlcat(hostport,uri->host,len);
if(uri->port != NULL) {
strlcat(hostport,":",len);
strlcat(hostport,uri->port,len);
}
/* We only support path urls, not virtual urls, so the
host past the first dot must be "s3.amazonaws.com" */
p = strchr(uri->host,'.');
if(p != NULL && strcmp(p+1,"s3.amazonaws.com")==0) {
len = (size_t)((p - uri->host)-1);
region = calloc(1,len+1);
memcpy(region,uri->host,len);
region[len] = '\0';
} else /* cannot find region: use "" */
region = strdup("");
if(hostportp) {*hostportp = hostport; hostport = NULL;}
if(regionp) {*regionp = region; region = NULL;}

done:
nullfree(hostport);
nullfree(region);
return stat;
}
#endif /*ENABLE_HDF5_ROS3*/

#ifdef _WIN32

/**
Expand Down
2 changes: 1 addition & 1 deletion libsrc4/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# Process these files with m4.

SET(libsrc4_SOURCES nc4dispatch.c nc4attr.c nc4dim.c nc4grp.c
nc4internal.c nc4type.c nc4var.c ncfunc.c error4.c nc4printer.c
nc4internal.c nc4type.c nc4var.c ncfunc.c error4.c
ncindex.c nc4filters.c)

add_library(netcdf4 OBJECT ${libsrc4_SOURCES})
Expand Down
4 changes: 2 additions & 2 deletions libsrc4/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ libnetcdf4_la_CPPFLAGS = ${AM_CPPFLAGS}

# This is our output. The netCDF-4 convenience library.
noinst_LTLIBRARIES = libnetcdf4.la
libnetcdf4_la_SOURCES = nc4dispatch.c nc4attr.c nc4dim.c nc4grp.c \
nc4internal.c nc4type.c nc4var.c ncfunc.c error4.c nc4printer.c \
libnetcdf4_la_SOURCES = nc4dispatch.c nc4attr.c nc4dim.c nc4grp.c \
nc4internal.c nc4type.c nc4var.c ncfunc.c error4.c \
ncindex.c nc4filters.c


Expand Down
2 changes: 1 addition & 1 deletion nc_test/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -117,5 +117,5 @@ FILE(GLOB CUR_EXTRA_DIST RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} ${CMAKE_CURRENT_SO
SET(CUR_EXTRA_DIST ${CUR_EXTRA_DIST} CMakeLists.txt Makefile.am)
SET(CUR_EXTRA_DIST ${CUR_EXTRA_DIST} test_get.m4 test_put.m4 test_read.m4 test_write.m4 ref_tst_diskless2.cdl tst_diskless5.cdl
ref_tst_diskless3_create.cdl ref_tst_diskless3_open.cdl
ref_tst_http_nc3.cdl ref_tst_http_nc4.cdl)
ref_tst_http_nc3.cdl ref_tst_http_nc4a.cdl ref_tst_http_nc4b.cdl ref_tst_http_nc4c.cdl)
ADD_EXTRA_DIST("${CUR_EXTRA_DIST}")
5 changes: 3 additions & 2 deletions nc_test/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -104,14 +104,15 @@ test_write.m4 ref_tst_diskless2.cdl tst_diskless5.cdl \
ref_tst_diskless3_create.cdl ref_tst_diskless3_open.cdl \
run_inmemory.sh run_mmap.sh \
f03tst_open_mem.nc \
test_byterange.sh ref_tst_http_nc3.cdl ref_tst_http_nc4.cdl \
test_byterange.sh ref_tst_http_nc3.cdl \
ref_tst_http_nc4a.cdl ref_tst_http_nc4b.cdl ref_tst_http_nc4c.cdl \
CMakeLists.txt

# These files are created by the tests.
CLEANFILES = nc_test_*.nc tst_*.nc t_nc.nc large_files.nc \
quick_large_files.nc tst_diskless3_file.cdl \
tst_diskless4.cdl ref_tst_diskless4.cdl benchmark.nc \
tst_http_nc3.cdl tst_http_nc4.cdl tmp*.cdl tmp*.nc
tst_http_nc3.cdl tst_http_nc4?.cdl tmp*.cdl tmp*.nc

EXTRA_DIST += bad_cdf5_begin.nc run_cdf5.sh
if ENABLE_CDF5
Expand Down
93 changes: 21 additions & 72 deletions nc_test/ref_tst_http_nc3.cdl
Original file line number Diff line number Diff line change
@@ -1,77 +1,26 @@
netcdf \2004050300_eta_211 {
netcdf point {
dimensions:
record = UNLIMITED ; // (1 currently)
x = 135 ;
y = 95 ;
datetime_len = 21 ;
nmodels = 1 ;
ngrids = 1 ;
nav = 1 ;
nav_len = 100 ;
time = 3 ;
variables:
double reftime(record) ;
reftime:long_name = "reference time" ;
reftime:units = "hours since 1992-1-1" ;
double valtime(record) ;
valtime:long_name = "valid time" ;
valtime:units = "hours since 1992-1-1" ;
char datetime(record, datetime_len) ;
datetime:long_name = "reference date and time" ;
float valtime_offset(record) ;
valtime_offset:long_name = "hours from reference time" ;
valtime_offset:units = "hours" ;
int model_id(nmodels) ;
model_id:long_name = "generating process ID number" ;
char nav_model(nav, nav_len) ;
nav_model:long_name = "navigation model name" ;
int grid_type_code(nav) ;
grid_type_code:long_name = "GRIB-1 GDS data representation type" ;
char grid_type(nav, nav_len) ;
grid_type:long_name = "GRIB-1 grid type" ;
char grid_name(nav, nav_len) ;
grid_name:long_name = "grid name" ;
int grid_center(nav) ;
grid_center:long_name = "GRIB-1 originating center ID" ;
int grid_number(nav, ngrids) ;
grid_number:long_name = "GRIB-1 catalogued grid numbers" ;
grid_number:_FillValue = -9999 ;
char x_dim(nav, nav_len) ;
x_dim:long_name = "x dimension name" ;
char y_dim(nav, nav_len) ;
y_dim:long_name = "y dimension name" ;
int Nx(nav) ;
Nx:long_name = "number of points along x-axis" ;
int Ny(nav) ;
Ny:long_name = "number of points along y-axis" ;
float La1(nav) ;
La1:long_name = "latitude of first grid point" ;
La1:units = "degrees_north" ;
float Lo1(nav) ;
Lo1:long_name = "longitude of first grid point" ;
Lo1:units = "degrees_east" ;
float Lov(nav) ;
Lov:long_name = "orientation of the grid" ;
Lov:units = "degrees_east" ;
float Dx(nav) ;
Dx:long_name = "x-direction grid length" ;
Dx:units = "km" ;
float Dy(nav) ;
Dy:long_name = "y-direction grid length" ;
Dy:units = "km" ;
byte ProjFlag(nav) ;
ProjFlag:long_name = "projection center flag" ;
byte ResCompFlag(nav) ;
ResCompFlag:long_name = "resolution and component flags" ;
float Z_sfc(record, y, x) ;
Z_sfc:long_name = "Geopotential height, gpm" ;
Z_sfc:units = "gp m" ;
Z_sfc:_FillValue = -9999.f ;
Z_sfc:navigation = "nav" ;
float lon(time) ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
float lat(time) ;
lat:long_name = "latitude" ;
lat:units = "degrees_north" ;
float z(time) ;
z:long_name = "height above mean sea level" ;
z:units = "km" ;
z:positive = "up" ;
double time(time) ;
time:long_name = "time" ;
time:units = "days since 1970-01-01 00:00:00" ;
float data(time) ;
data :long_name = "skin temperature" ;
data :units = "Celsius" ;
data :coordinates = "time lon lat z" ;

// global attributes:
:record = "reftime, valtime" ;
:history = "2003-09-25 16:09:26 - created by gribtocdl 1.4 - 12.12.2002" ;
:title = "CMC_reg_HGT_SFC_0_ps60km_2003092500_P000.grib" ;
:Conventions = "NUWG" ;
:version = 0. ;
:featureType = "point" ;
:Conventions = "CF-1.6" ;
}
File renamed without changes.
50 changes: 50 additions & 0 deletions nc_test/ref_tst_http_nc4b.cdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
netcdf HadCRUT.4.6.0.0.median {
dimensions:
latitude = 36 ;
longitude = 72 ;
field_status_string_length = 1 ;
time = UNLIMITED ; // (2047 currently)
variables:
float latitude(latitude) ;
latitude:standard_name = "latitude" ;
latitude:long_name = "latitude" ;
latitude:point_spacing = "even" ;
latitude:units = "degrees_north" ;
latitude:axis = "Y" ;
float longitude(longitude) ;
longitude:standard_name = "longitude" ;
longitude:long_name = "longitude" ;
longitude:point_spacing = "even" ;
longitude:units = "degrees_east" ;
longitude:axis = "X" ;
float time(time) ;
time:standard_name = "time" ;
time:long_name = "time" ;
time:units = "days since 1850-1-1 00:00:00" ;
time:calendar = "gregorian" ;
time:start_year = 1850s ;
time:end_year = 2020s ;
time:start_month = 1s ;
time:end_month = 7s ;
time:axis = "T" ;
float temperature_anomaly(time, latitude, longitude) ;
temperature_anomaly:long_name = "near_surface_temperature_anomaly" ;
temperature_anomaly:units = "K" ;
temperature_anomaly:missing_value = -1.e+30f ;
temperature_anomaly:_FillValue = -1.e+30f ;
temperature_anomaly:reference_period = 1961s, 1990s ;
char field_status(time, field_status_string_length) ;
field_status:long_name = "field_status" ;

// global attributes:
:title = "HadCRUT4 near-surface temperature ensemble data - ensemble median" ;
:institution = "Met Office Hadley Centre / Climatic Research Unit, University of East Anglia" ;
:history = "Updated at 27/08/2020 14:34:42" ;
:source = "CRUTEM.4.6.0.0, HadSST.3.1.1.0" ;
:comment = "" ;
:reference = "Morice, C. P., J. J. Kennedy, N. A. Rayner, and P. D. Jones (2012), Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 dataset, J. Geophys. Res., doi:10.1029/2011JD017187" ;
:version = "HadCRUT.4.6.0.0" ;
:Conventions = "CF-1.0" ;
:ensemble_members = 100s ;
:ensemble_member_index = 0s ;
}
Loading

0 comments on commit f3218a2

Please sign in to comment.