# Kerchunk and NetCDF/HDF5:  A Case Study using the National Water Model - Short Range Forecast Dataset

## Motivation

NetCDF/HDF5 is one of the most universally adopted file formats in earth sciences, with support of much of the community as well as scientific agencies, data centers and university labs. A huge amount of legacy data has been generated in this format. Fortunately, using `Kerchunk`, we can read these datasets as if they were Zarr.

## About the Dataset
The National Water Model dataset is a produced by the National Oceanic and Atmospheric Administations (NOAA's) Office of Water Prediction. It is a forecast model of water resources, providing multiple variables across the continental United States (CONUS). 
This dataset is available via the Registry of Open Data on AWS as a collection of netCDF files that do not require any login authentication. Using `Kerchunk`, we will demonstrate how to build a `Kerchunk` index so that this dataset can be read as if it were an ARCO (Analysis-Ready, Cloud-Optimized) dataset. 



## Overview
   
Within this notebook, we will cover:

1. How to access remote NetCDF data using `Kerchunk`
1. Combining multiple `Kerchunk` reference files using `MultiZarrToZarr`
1. Reading the output with `Xarray` and `Intake`

## Prerequisites
| Concepts | Importance | Notes |
| --- | --- | --- |
| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |
| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |
| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |
| [Intake Introduction](https://projectpythia.org/intake-cookbook/notebooks/intake_introduction.html) | Recommended | IO |
- **Time to learn**: 45 minutes
---

## Imports

In [1]:
# Module Imports
import os

import fsspec
import fsspec_reference_maker
import ujson
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr
from tqdm import tqdm

## Create Input File List

In [2]:
# Create an `fsspec` filesystem for AWS s3.
fs = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

# Use fsspec and glob to retrieve a list of all netCDF files to be used in the kerchunk index generation.
flist = fs.glob(
    f"noaa-nwm-pds/nwm.*/short_range/nwm.*.short_range.channel_rt.f001.conus.nc"
)


last_dir = f"{os.path.dirname(flist[-1])}"
last_file = os.path.basename(flist[-1]).split(".")
last_files = fs.glob(
    f"{last_dir}/{last_file[0]}.{last_file[1]}.{last_file[2]}.channel_rt.*.conus.nc"
)

# Skip the first of the last_files since it's a duplicate
flist.extend(last_files[1:])

# We need to include the "s3://" prefix to the list of files
# so that fsspec will recognize that these JSON files are on S3. There is no "storage_
urls = ["s3://" + f for f in flist]

## Iterate through `flist` and create `Kerchunk` indicies as `.json` files

In [3]:
# fsspec.open args
so = dict(mode="rb", anon=True, default_fill_cache=False, default_cache_type="first")
output_dir = "./"

# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk` index from a NetCDF file.
def gen_json(u, output_dir: str):
    with fs.open(u, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
        p = u.split("/")
        date = p[3]
        fname = p[5]
        outf = f"{output_dir}/{date}.{fname}.json"
        with open(outf, "wb") as f:
            f.write(ujson.dumps(h5chunks.translate()).encode())
        return outf


# Iterate through filelist to generate Kerchunked files. Good use for `Dask`
output_files = []
for fil in tqdm(urls):
    outf = gen_json(fil, output_dir)
    output_files.append(outf)

  3%|▎         | 25/733 [03:38<1:32:16,  7.82s/it]

## Combine .json `Kerchunk` reference files and write a combined `Kerchunk` index

In [None]:
# Combine single `Kerchunk`` output reference files into a multi-file `Kerchunk`` dataset
mzz = MultiZarrToZarr(output_files, concat_dims=["time"])
d = mzz.translate()

# Write `Kerchunk` `.json` record
output_fname = "NWM.json"
with open(f"{output_fname}", "wb") as f:
    f.write(ujson.dumps(d).encode())

## Load kerchunked dataset

In [None]:
# create an `fsspec`` reference filesystem from the `Kerchunk`` output
fs = fsspec.filesystem("reference", fo=output_fname)
m = fs.get_mapper("")
ds = xr.open_zarr(m)

In [None]:
ds