## Read Directly from the Cloud

In the previous notebook, we showed how the library can use a VOL connector to access files stored in a cloud backend through HSDS. But this is not the only way to use HDF5 in the cloud. HDF5 files stored in an s3 bucket can be accessed directly from the library without any external connector or drivers. This is possible through the Read-Only s3 VFD (ROS3 VFD). 

### Enabling the ROS3 VFD

The ROS3 VFD is packaged with the library, and only needs to be enabled during the build process. When using CMake to build HDF5, this is done by setting the variable `HDF5_ENABLE_ROS3_VFD` to `ON`.

<div class="alert alert-block alert-warning">
The cloning and compilation takes, depending on the machine type, about three minutes.
</div>

In [1]:
%%bash
git clone https://github.com/HDFGroup/hdf5.git build/hdf5
mkdir -p build/hdf5/build
cd build/hdf5/build
cmake -DCMAKE_INSTALL_PREFIX=/home/vscode/.local -DHDF5_ENABLE_ROS3_VFD=ON -DBUILD_STATIC_LIBS=OFF -DBUILD_TESTING=OFF -DHDF5_BUILD_EXAMPLES=OFF -DHDF5_BUILD_TOOLS=ON -DHDF5_BUILD_UTILS=OFF ../ 2>&1 > /dev/null
make -j 4 2>&1 > /dev/null
make install 2>&1 > /dev/null
cd ../../..

fatal: destination path 'build/hdf5' already exists and is not an empty directory.


For this demonstration, we will use a copy of the `ou_process.h5` file that has been uploaded to an s3 bucket. We can use the command line tool `h5dump` to examine this file and see that it is the same file we're familiar with.

In [1]:
%%bash
/home/vscode/.local/bin/h5dump-shared -pBH --vfd-name ros3 --s3-cred="(,,)" https://s3.us-east-2.amazonaws.com/docs.hdfgroup.org/hdf5/h5/ou_process.h5

HDF5 "https://s3.us-east-2.amazonaws.com/docs.hdfgroup.org/hdf5/h5/ou_process.h5" {
SUPER_BLOCK {
   SUPERBLOCK_VERSION 0
   FREELIST_VERSION 0
   SYMBOLTABLE_VERSION 0
   OBJECTHEADER_VERSION 0
   OFFSET_SIZE 8
   LENGTH_SIZE 8
   BTREE_RANK 16
   BTREE_LEAF 4
   ISTORE_K 32
   FILE_SPACE_STRATEGY H5F_FSPACE_STRATEGY_FSM_AGGR
   FREE_SPACE_PERSIST FALSE
   FREE_SPACE_SECTION_THRESHOLD 1
   FILE_SPACE_PAGE_SIZE 4096
   USER_BLOCK {
      USERBLOCK_SIZE 0
   }
}
GROUP "/" {
   ATTRIBUTE "source" {
      DATATYPE  H5T_STRING {
         STRSIZE 41;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
   }
   DATASET "dataset" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 100, 1000 ) / ( 100, 1000 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 800000
         OFFSET 2048
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FI

Now we will use the ROS3 VFD to read the file. Similarly to other VFDs, the ROS3 VFD is enabled at file access time with a File Access Property List (FAPL) modified via `H5Pset_fapl_ros3()`. This function also requires some s3-specific parameters in the form of the `H5FD_ros3_fapl_t` structure. Because this file is publicly accessible without any authentication, all we need to provide is the region it is located in.

In [10]:
%%writefile src/ou_ros3.c
#include "hdf5.h"
#include <stdlib.h>
#include <string.h>

int main(void)
{
hid_t file_id;
hid_t fapl_id;
hid_t dset_id;
hid_t dspace_id;
hid_t dtype_id;
hid_t attr_id;

int rank = 2;
const hsize_t dims[] = {100, 1000};
double *read_buffer = (double*) calloc(dims[0] * dims[1], sizeof(double));

H5FD_ros3_fapl_t s3_params;

const char* aws_region = "us-east-2";
strcpy(s3_params.aws_region, aws_region);

const char* file_uri = "https://s3.us-east-2.amazonaws.com/docs.hdfgroup.org/hdf5/h5/ou_process.h5";

/* The version of this ROS3 FAPL parameter structure */
s3_params.version = 1;

/* This file permits anonymous access, so authentication is not needed */
/* secret_id and secret_key auth fields that would be required for protected s3 buckets */
s3_params.authenticate = 0; 

fapl_id = H5Pcreate(H5P_FILE_ACCESS);

H5Pset_fapl_ros3(fapl_id, &s3_params);

/* Open and read from the file */
file_id = H5Fopen(file_uri, H5F_ACC_RDONLY, fapl_id);

dset_id = H5Dopen(file_id, "dataset", H5P_DEFAULT);

dspace_id = H5Screate_simple(rank, dims, NULL);

/* Read data */
H5Dread(dset_id, H5T_IEEE_F64LE, H5S_ALL, H5S_ALL, H5P_DEFAULT, (void*) read_buffer);

for (size_t i = 0; i < 10; i++) {
    size_t row_idx = i * dims[0] / 10;
    size_t col_idx = i * dims[1] / 10;
    double element = read_buffer[row_idx * dims[1] + col_idx];
    printf("Element at index (%zu, %zu) = %lf\n", row_idx, col_idx, element);
}

free(read_buffer);
H5Dclose(dset_id);
H5Sclose(dspace_id);
H5Pclose(fapl_id);
H5Fclose(file_id);
return 0;

}

Overwriting src/ou_ros3.c


In [11]:
%%bash
g++ -I/home/vscode/.local/include -L/home/vscode/.local/lib ./src/ou_ros3.c -o ./build/ou_ros3 -lhdf5 -Wl,-rpath=/home/vscode/.local/lib
./build/ou_ros3

Element at index (0, 0) = 0.000000
Element at index (10, 100) = 0.077412
Element at index (20, 200) = 0.031822
Element at index (30, 300) = 0.050663
Element at index (40, 400) = 0.076260
Element at index (50, 500) = 0.096194
Element at index (60, 600) = 0.013299
Element at index (70, 700) = -0.030038
Element at index (80, 800) = 0.048959
Element at index (90, 900) = -0.070738


The ROS3 VFD is also transparently accessible through h5py! (remember Tutorial 04?) The default installation of h5py does not have th ROS3 VFD enabled. To enable it, we build h5py from source against an HDF5 library that already has the ROS3 VFD enabled. Then, we can access our file in the cloud by providing the same ROS3 arguments as before to `h5py.File()`.

In [5]:
%%bash
git clone https://github.com/h5py/h5py
cd h5py
HDF5_DIR=/home/vscode/.local pip install .


fatal: destination path 'h5py' already exists and is not an empty directory.


Processing /home/matthewlarson/Documents/hdf5-tutorial/h5py
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: h5py
  Building wheel for h5py (pyproject.toml): started
  Building wheel for h5py (pyproject.toml): still running...
  Building wheel for h5py (pyproject.toml): finished with status 'done'
  Created wheel for h5py: filename=h5py-3.10.0-cp39-cp39-linux_x86_64.whl size=1874495 sha256=bf5eecaf84e283629c2ee4e2efdca2abefd21523c217410c175249745ffac61a
  Stored in directory: /tmp/pip-ephem-wheel-cache-yvc30wut/wheels/72/49/cf/e202bdbdc4979200d58297536c81c24a0fe7cadb

In [7]:
import h5py

dims = [100, 1000]

s3_uri = "https://s3.us-east-2.amazonaws.com/docs.hdfgroup.org/hdf5/h5/ou_process.h5"
kwargs = {}
kwargs['mode'] = 'r'
kwargs['driver'] = "ros3"
kwargs['aws_region'] = ("us-east-2").encode("utf-8")
# kwargs['authenticate'] = 0

f = h5py.File(s3_uri, **kwargs)

dset = f["dataset"]

data = dset[:]

for i in range(10):
    row_idx = int(i * dims[0] / 10)
    col_idx = int(i * dims[1] / 10)
    print(f"dataset[{row_idx}, {col_idx}] = {data[row_idx, col_idx]}")


dataset[0, 0] = 0.0
dataset[10, 100] = 0.07741231848747207
dataset[20, 200] = 0.03182172849584258
dataset[30, 300] = 0.050662851693772944
dataset[40, 400] = 0.07626000210825314
dataset[50, 500] = 0.096193561713216
dataset[60, 600] = 0.01329928934495046
dataset[70, 700] = -0.03003752756107992
dataset[80, 800] = 0.048958998265666936
dataset[90, 900] = -0.07073801698079688


## Cloud-Optimized

While it is possible to read HDF5 files in their native format directly from the Cloud, this introduces some inefficiencies. The access patterns of the library were optimized for filesystem I/O, not cloud access and the delays it brings. These concerns motivated the development of the [HDF5 Cloud-Optimized Read-Only Library (H5Coro)](https://github.com/ICESat2-SlideRule/h5coro). H5Coro a pure Python implementation of a subset of the HDF5 specification that has been optimized for reading data out of S3. For large files, H5Coro can speed up access times by [two orders of magnitude](https://www.hdfgroup.org/wp-content/uploads/2021/05/JPSwinski_H5Coro.pdf). This is achieved primarily through using a larger cache to avoid repeated small reads to the cloud.