# Reading Data Directly from the Cloud

In the previous notebook, we showed how the library can use a VOL connector to access files stored in a cloud backend through HSDS. But this is not the only way to use HDF5 in the cloud. HDF5 files stored in an S3 bucket can be accessed directly from the library without any external connector or drivers. This is possible through the Read-Only S3 VFD (ROS3 VFD). 

## Enabling the ROS3 VFD

The ROS3 VFD is packaged with the library, and only needs to be enabled during the build process. When using CMake to build HDF5, this is done by setting the variable `HDF5_ENABLE_ROS3_VFD` to `ON`.

<div class="alert alert-block alert-warning">
The cloning and compilation takes, depending on the machine type, about three minutes.
</div>

In [None]:
%%bash
git clone https://github.com/HDFGroup/hdf5.git build/hdf5
mkdir -p build/hdf5/build
cd build/hdf5/build
cmake -DCMAKE_INSTALL_PREFIX=/home/vscode/.local -DHDF5_ENABLE_ROS3_VFD=ON -DBUILD_STATIC_LIBS=OFF -DBUILD_TESTING=OFF -DHDF5_BUILD_EXAMPLES=OFF -DHDF5_BUILD_TOOLS=ON -DHDF5_BUILD_UTILS=OFF ../ 2>&1 > /dev/null
make -j 4 2>&1 > /dev/null
make install 2>&1 > /dev/null
cd ../../..

For this demonstration, we will use a copy of the `ou_process.h5` file that has been uploaded to an S3 bucket. We can use the command line tool `h5dump` to examine this file and see that it is the same file we're familiar with.

In [None]:
%%bash
/home/vscode/.local/bin/h5dump -pBH --vfd-name ros3 --s3-cred="(,,)" https://s3.us-east-2.amazonaws.com/docs.hdfgroup.org/hdf5/h5/ou_process.h5

Now we will use the ROS3 VFD to read the file. Similarly to other VFDs, the ROS3 VFD is enabled at file access time with a File Access Property List (FAPL) modified via `H5Pset_fapl_ros3()`. This function also requires some S3-specific parameters in the form of the `H5FD_ros3_fapl_t` structure. Because this file is publicly accessible without any authentication, all we need to provide is the region it is located in.

In [None]:
%%writefile src/ou_ros3.c
#include "hdf5.h"
#include <array>
#include <cassert>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <string>
#include <vector>

using namespace std;

int main(void)
{
H5FD_ros3_fapl_t s3_params;

// Set the AWS region and the S3 bucket name
strcpy(s3_params.aws_region, "us-east-2");
string file_uri = "https://s3.us-east-2.amazonaws.com/docs.hdfgroup.org/hdf5/h5/ou_process.h5";
// The version of this ROS3 FAPL parameter structure
s3_params.version = 1;
// This file permits anonymous access, so authentication is not needed
// secret_id and secret_key auth fields that would be required for protected S3 buckets
s3_params.authenticate = 0; 

// Open and read from the file
auto fapl = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_ros3(fapl, &s3_params);
auto file = H5Fopen(file_uri.c_str(), H5F_ACC_RDONLY, fapl);
auto dset = H5Dopen(file, "dataset", H5P_DEFAULT);

// Get the dataspace details from the file
auto dspace = H5Dget_space(dset);
assert(H5Sget_simple_extent_ndims(dspace) == 2);
array<hsize_t, 2> dims;
H5Sget_simple_extent_dims(dspace, dims.data(), NULL);
vector<double> read_buffer(dims[0] * dims[1]);

H5Dread(dset, H5T_IEEE_F64LE, H5S_ALL, H5S_ALL, H5P_DEFAULT, (void*) read_buffer.data());

for (size_t i = 0; i < 10; i++) {
    size_t row_idx = i * dims[0] / 10;
    size_t col_idx = i * dims[1] / 10;
    auto element = read_buffer[row_idx * dims[1] + col_idx];
    cout << "Element at index (" << row_idx << ", " << col_idx << ") = "
         << std::setprecision(16) << element << endl;
}

H5Dclose(dset);
H5Sclose(dspace);
H5Pclose(fapl);
H5Fclose(file);

return 0;
}

In [None]:
%%bash
g++ -Wall -pedantic -I/home/vscode/.local/include -L/home/vscode/.local/lib ./src/ou_ros3.c -o ./build/ou_ros3 -lhdf5 -Wl,-rpath=/home/vscode/.local/lib
./build/ou_ros3

The ROS3 VFD is also transparently accessible through h5py! (remember Tutorial 04?) The default installation of h5py does not have th ROS3 VFD enabled. To enable it, we build h5py from source against an HDF5 library that already has the ROS3 VFD enabled. Then, we can access our file in the cloud by providing the same ROS3 arguments as before to `h5py.File()`.

In [None]:
%%bash
git clone https://github.com/h5py/h5py
cd h5py
HDF5_DIR=/home/vscode/.local pip install .


In [None]:
import h5py

dims = [100, 1000]

s3_uri = "https://s3.us-east-2.amazonaws.com/docs.hdfgroup.org/hdf5/h5/ou_process.h5"
kwargs = {}
kwargs['mode'] = 'r'
kwargs['driver'] = "ros3"
kwargs['aws_region'] = ("us-east-2").encode("utf-8")
# kwargs['authenticate'] = 0

f = h5py.File(s3_uri, **kwargs)

dset = f["dataset"]

data = dset[:]

for i in range(10):
    row_idx = int(i * dims[0] / 10)
    col_idx = int(i * dims[1] / 10)
    print(f"dataset[{row_idx}, {col_idx}] = {data[row_idx, col_idx]}")


## Cloud-Optimized

While it is possible to read HDF5 files in their native format directly from the Cloud, this introduces some inefficiencies. The access patterns of the library were optimized for filesystem I/O, not cloud access and the delays it brings. These concerns motivated the development of the [HDF5 Cloud-Optimized Read-Only Library (H5Coro)](https://github.com/ICESat2-SlideRule/h5coro). H5Coro a pure Python implementation of a subset of the HDF5 specification that has been optimized for reading data out of S3. For large files, H5Coro can speed up access times by [two orders of magnitude](https://www.hdfgroup.org/wp-content/uploads/2021/05/JPSwinski_H5Coro.pdf). This is achieved primarily through using a larger cache to avoid repeated small reads to the cloud.