## HDF5 data in the cloud

Many large data collections are hosted in the cloud and are freely availble. 
E.g.: See https://registry.opendata.aws/

For HDF5 data stored in AWS S3, these can directly be accessed with h5py and s3fs, or
using HSDS (Highly Scalable Data Service) and h5pyd (h5py-api compatible package for HSDS).

This notebook illustrates accessing the NREL NSRDB (National Solar Radiation Database) using both h5pyd 
and h5py.

By running this notebook in codespaces, data access will generally be faster, since the bulk of
the data transfer happens on a high-speed internet backbone.  The data is physically located in
the AWS us-west-2 region, so speed might be somewhat faster if you select us-west when creating
the codespace.

Once the codespace environment is ready, you can start evaluating the Jupyter notebooks 
(by placing the cursor into a code cell and pressing `Ctrl+Enter` or `Shift+Enter`). 
When prompted for a Python kernel, select

```
hdf5-tutorial (Python 3.11.x) /opt/conda/envs/hdf5-tutorial/python
```

In [None]:
%matplotlib inline
USE_H5PY = False  # set to True to use h5py/hdf5lib instead
if USE_H5PY:
    import h5py
    import s3fs  # This package enables h5py to "see" S3 files as read-only posix files
else:
    import h5pyd as h5py  # Use the "as" syntax for code  compatibility
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# hsls is an h5pyd utility to list HSDS domains
# In the shell, use the --bucket option to list files from NREL's S3 bucket 
# run with "-r" option to see all domains
! hsls --bucket s3://nrel-pds-hsds /nrel/nsrdb/

In [None]:
# Drill down to the conus directory.  Use -H and -v options to show the file sizes
# Downloading one of these files would take over a month with a standard
# broadband internet connection!

! hsls -H -v --bucket s3://nrel-pds-hsds /nrel/nsrdb/conus/

In [None]:
%%time
# Open one of the nsrdb files.  Use the bucket param to get the data from NREL's S3 bucket
if USE_H5PY:
    s3 = s3fs.S3FileSystem(anon=True)
    f = h5py.File(s3.open("s3://nrel-pds-nsrdb/conus/nsrdb_conus_pv_2022.h5", "rb"), "r")
else:
    f = h5py.File("/nrel/nsrdb/conus/nsrdb_conus_2022.h5", bucket="s3://nrel-pds-hsds")

In [None]:
# attributes can be used to provide desriptions of the content
%time f.attrs['version']   

In [None]:
list(f)  # datasets under root group

In [None]:
dset = f["air_temperature"]
dset

In [None]:
# each dataset has an id
dset.id.id

In [None]:
dset.shape  # two-dimensional  time x station_index

In [None]:
# get the chunk shape
dset.chunks

In [None]:
# compute the number of bytes per chunk (about 2mb)
np.prod(dset.chunks) * dset.dtype.itemsize   

In [None]:
# compute the number of chunks in the dataset
(dset.shape[0] // dset.chunks[0]) * (dset.shape[1] // dset.chunks[1])  

In [None]:
# read one year of measurments for a given station index.
# this will require reading ~`100MB from S3`
%time tseries = dset[::,1234567]
tseries

In [None]:
# get min, max, and mean values
tseries.min(), tseries.max(), tseries.mean()

In [None]:
# plot the data
x = range(len(tseries))
plt.plot(x, tseries)

In [None]:
# This dataset is actually linked from an HDF5 file in a different bucket
if USE_H5PY:
    # this property doesn't exist for h5py
    layout = None
else:
    layout = dset.id.layout
layout

In [None]:
# The HSDS domain actually maps to several different HDF5 files
# compile a list of all the files
hdf5_files = set()
if not USE_H5PY:
    for k in f:
        dset = f[k]
        layout = dset.id.layout
        if "file_uri" in layout:
            hdf5_files.add(layout["file_uri"])
hdf5_files