# Kerchunk and NetCDF/HDF5:  A Case Study using the National Water Model - Short Range Forecast Dataset

## Overview
   
Within this notebook, we will cover:

1. How to access remote NetCDF data using `Kerchunk`
1. Combining multiple `Kerchunk` reference files using `MultiZarrToZarr`
1. Reading the output with `Xarray` and `Intake`

This notebook shares many similarities with the [Multi-File Datasets with Kerchunk](../foundations/kerchunk_multi_file.ipynb). If you are confused on the function of a block of code, please refer there for a more detailed breakdown of what each line is doing.


## Prerequisites
| Concepts | Importance | Notes |
| --- | --- | --- |
| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |
| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |
| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |
| [Intake Introduction](https://projectpythia.org/intake-cookbook/notebooks/intake_introduction.html) | Recommended | IO |
- **Time to learn**: 45 minutes
---

## Motivation

NetCDF/HDF5 is one of the most universally adopted file formats in earth sciences, with support of much of the community as well as scientific agencies, data centers and university labs. A huge amount of legacy data has been generated in this format. Fortunately, using `Kerchunk`, we can read these datasets as if they were `Zarr`.

## About the Dataset
The National Water Model dataset is a produced by the National Oceanic and Atmospheric Administations (NOAA's) Office of Water Prediction. It is a forecast model of water resources, providing multiple variables across the continental United States (CONUS). 
This dataset is available via the Registry of Open Data on AWS as a collection of netCDF files that do not require any login authentication. Using `Kerchunk`, we will demonstrate how to build a `Kerchunk` index so that this dataset can be read as if it were an ARCO (Analysis-Ready, Cloud-Optimized) dataset. 



## Imports

In [1]:
import os

import fsspec
import ujson
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr
from tqdm import tqdm

## Create Input File List

Here we are using `fsspec's` glob functionality along with the *`*`* wildcard operator and some string slicing to grab a list of NetCDF files from a `s3` `fsspec` filesystem. 

In [2]:
# Create an `fsspec` filesystem for AWS s3.
fs = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

# Use fsspec and glob to retrieve a list of all netCDF files to be used in the `Kerchunk` index generation.
file_pattern = fs.glob(
    f"noaa-nwm-pds/nwm.*/short_range/nwm.*.short_range.channel_rt.f001.conus.nc"
)


last_dir = f"{os.path.dirname(file_pattern[-1])}"
last_file = os.path.basename(file_pattern[-1]).split(".")
last_files = fs.glob(
    f"{last_dir}/{last_file[0]}.{last_file[1]}.{last_file[2]}.channel_rt.*.conus.nc"
)

# Skip the first of the last_files since it's a duplicate
file_pattern.extend(last_files[1:])

# We need to include the "s3://" prefix to the list of files
# so that fsspec will recognize that these JSON files are on S3.
urls = ["s3://" + f for f in file_pattern]

### OPTIONAL: To speed next section up, uncomment the next cell.  This will reduce the # of input files and speed up the example considerably.    



In [3]:
file_pattern = file_pattern[0:10]

## Iterate through `file_pattern` list and create `Kerchunk` indicies as `.json` files

In the cell below, we are defining a dictionary of kwargs to pass to `fsspec` and `s3fs`. 
After that, we create a function named `gen_json`, which creates a `Kerchunk` index on a given `NetCDF` file and writes that a `.json` reference file.

In [4]:
# fsspec.open args
so = dict(mode="rb", anon=True, default_fill_cache=False, default_cache_type="first")
output_dir = "./"

# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk` index from a NetCDF file.
def gen_json(u, output_dir: str):
    with fs.open(u, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
        p = u.split("/")
        date = p[3]
        fname = p[5]
        outf = f"{output_dir}/{date}.{fname}.json"
        with open(outf, "wb") as f:
            f.write(ujson.dumps(h5chunks.translate()).encode())
        return outf


# Iterate through filelist to generate Kerchunked files. Good use for `Dask`
output_files = []
for fil in tqdm(urls):
    outf = gen_json(fil, output_dir)
    output_files.append(outf)

100%|██████████| 708/708 [47:20<00:00,  4.01s/it]


## Combine .json `Kerchunk` reference files and write a combined `Kerchunk` index

In the following cell, we are combining all the `.json` reference files that were generated above into a single reference file and writing that file to disk.

In [5]:
# Combine single Kerchunk output reference files into a multi-file Kerchunk dataset
mzz = MultiZarrToZarr(output_files, concat_dims=["time"])
d = mzz.translate()

# Write Kerchunk .json record
output_fname = "NWM.json"
with open(f"{output_fname}", "wb") as f:
    f.write(ujson.dumps(d).encode())

## Load kerchunked dataset

In [6]:
# create an fsspec reference filesystem from the Kerchunk output
fs = fsspec.filesystem("reference", fo=output_fname)
m = fs.get_mapper("")
ds = xr.open_zarr(m)

1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  ds = xr.open_zarr(m)


In [7]:
ds

Unnamed: 0,Array,Chunk
Bytes,5.53 kiB,8 B
Shape,"(708,)","(1,)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 5.53 kiB 8 B Shape (708,) (1,) Dask graph 708 chunks in 2 graph layers Data type object numpy.ndarray",708  1,

Unnamed: 0,Array,Chunk
Bytes,5.53 kiB,8 B
Shape,"(708,)","(1,)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 14.65 GiB 21.18 MiB Shape (708, 2776738) (1, 2776738) Dask graph 708 chunks in 2 graph layers Data type float64 numpy.ndarray",2776738  708,

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 14.65 GiB 21.18 MiB Shape (708, 2776738) (1, 2776738) Dask graph 708 chunks in 2 graph layers Data type float64 numpy.ndarray",2776738  708,

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 14.65 GiB 21.18 MiB Shape (708, 2776738) (1, 2776738) Dask graph 708 chunks in 2 graph layers Data type float64 numpy.ndarray",2776738  708,

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 14.65 GiB 21.18 MiB Shape (708, 2776738) (1, 2776738) Dask graph 708 chunks in 2 graph layers Data type float64 numpy.ndarray",2776738  708,

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 14.65 GiB 21.18 MiB Shape (708, 2776738) (1, 2776738) Dask graph 708 chunks in 2 graph layers Data type float64 numpy.ndarray",2776738  708,

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 14.65 GiB 21.18 MiB Shape (708, 2776738) (1, 2776738) Dask graph 708 chunks in 2 graph layers Data type float64 numpy.ndarray",2776738  708,

Unnamed: 0,Array,Chunk
Bytes,14.65 GiB,21.18 MiB
Shape,"(708, 2776738)","(1, 2776738)"
Dask graph,708 chunks in 2 graph layers,708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
