# Cloud Optimized HDF: or How I Learned to Stop Worrying and Love the Format


<img src="https://i.imgflip.com/8e4hnc.jpg" width="400px"/>


<img src="https://i.imgflip.com/8e4iqw.jpg" width="400px">

## The big ol list of "ifs"

* We use the most recent versions of h5py, xarray and fsspec
* We create the HDF5 files with [cloud optimized flags](https://www.youtube.com/watch?v=rcS5vt-mKok)
  * if the files are out there we can repack them, consolidating the metadata and perhaps incresing the chunk sizes
* We know how to "tweak the nobs" (or a fair understanding of what the I/O libraries are doing).

In [17]:
import xarray as xr
import h5py
import s3fs

fs = s3fs.S3FileSystem(anon=True)

for library in (xr, h5py, s3fs):
    print(f'{library.__name__} v{library.__version__}')

xarray v2024.1.1
h5py v3.10.0
s3fs v2023.12.2


In [12]:
# a "big" ATL03 file from the ICESat-2 mission
original_granule = "s3://its-live-data/cloud-experiments/h5cloud/atl03/big/original/ATL03_20190219140808_08110212_006_02.h5"
# the same "big" ATL03 file from the ICESat-2 mission, metadata consolidated in 8MB-size pages.
cloud_optimized = "s3://its-live-data/cloud-experiments/h5cloud/atl03/big/repacked/ATL03_20190219140808_08110212_006_02_repacked.h5"

fs.info(original_granule)

{'ETag': '"237bbd5828745b9e1a1e0ba88486e43c-835"',
 'LastModified': datetime.datetime(2024, 1, 29, 4, 48, 24, tzinfo=tzutc()),
 'size': 6997123664,
 'name': 'its-live-data/cloud-experiments/h5cloud/atl03/big/original/ATL03_20190219140808_08110212_006_02.h5',
 'type': 'file',
 'StorageClass': 'INTELLIGENT_TIERING',
 'VersionId': None,
 'ContentType': 'application/x-hdf5'}

In [13]:
fs.info(cloud_optimized)

{'ETag': '"08af0688f787f10eee1ccfb13f7eb66d-836"',
 'LastModified': datetime.datetime(2024, 1, 29, 4, 52, 44, tzinfo=tzutc()),
 'size': 7008000000,
 'name': 'its-live-data/cloud-experiments/h5cloud/atl03/big/repacked/ATL03_20190219140808_08110212_006_02_repacked.h5',
 'type': 'file',
 'StorageClass': 'INTELLIGENT_TIERING',
 'VersionId': None,
 'ContentType': 'application/x-hdf5'}

In [None]:
# don't even try this out of region (us-west-2) will take forever, forever >= 30 minutes
ds = xr.open_dataset(fs.open(original_granule),
                     group="/gt1l/heights",
                     engine="h5netcdf")
ds

In [None]:
# again... don't even try this out of region (us-west-2) will take forever, forever >= 30 minutes
ds = xr.open_dataset(fs.open(cloud_optimized),
                     group="/gt1l/heights",
                     engine="h5netcdf")
ds

In [15]:
%%time

# this one is different! you can try this at home (cloud otpmized HDF5!)

io_params ={
    "fsspec_params": {
        # "skip_instance_cache": True
        "cache_type": "blockcache",  # or "first" with enough space
        "block_size": 8*1024*1024 # could be bigger
    },
    "h5py_params" : {
        "driver_kwds": { # only recent versions of xarray and h5netcdf allow this correctly
            "page_buf_size": 32*1024*1024, # this one only works in repacked files
            "rdcc_nbytes": 8*1024*1024 # this one is to read the chunks 
        }

    }
}
ds = xr.open_dataset(fs.open(cloud_optimized, **io_params["fsspec_params"]),
                     group="/gt1l/heights",
                     engine="h5netcdf",
                     **io_params["h5py_params"])
ds

CPU times: user 4.16 s, sys: 3.04 s, total: 7.2 s
Wall time: 20.6 s


In [16]:
%%time

# takes about ~2 minutes
ds.h_ph.mean()

CPU times: user 11 s, sys: 2.02 s, total: 13 s
Wall time: 1min 25s


<center><img src="https://i.imgflip.com/8e4kuf.jpg" width="400px"></center></center>