# Virtual aggregate CESM MOM6 datasets with kerchunk

This notebook is adapted from the [work](https://github.com/lsterzinger/2022-esip-kerchunk-tutorial/blob/main/01-Create_References.ipynb) by [Lucas Sterzinger](https://lucassterzinger.com/) (an NCAR SIParCS intern in 2021).

## What is kerchunk?

From the [docs](https://fsspec.github.io/kerchunk/)

> 1. Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …), allowing efficient access to the data from traditional file systems or cloud object storage. 
> 2. It also provides a flexible way to create virtual datasets from multiple files. 
> 3. It does this by extracting the byte ranges, compression information and other information about the data and storing this metadata in a new, separate object. 
> 4. This means that you can create a virtual aggregate dataset over potentially many source files, for efficient, parallel and cloud-friendly in-situ access without having to copy or translate the originals.
> …
> 5. For binary storage of array data, essentially all formats involve taking blocks of in-memory C buffers and encoding/compressing them to disc, with some additional metadata describing the details of that buffer plus any other attributes. This description can be applied to a very wide variety of data formats.
> 6. The primary purpose of kerchunk is to find where these binary blocks are, and how to decode them, so that blocks from one or more files can be arranged into aggregate datasets accessed via the zarr library and the power of fsspec


In practice, we use kerchunk to generate a JSON file containing "references" to binary blocks stored elsewhere. 
The JSON file is structured to look like a [Zarr dataset](https://zarr.readthedocs.io/en/stable/).
Such a file can be interpreted as an aggregate Zarr dataset using [fsspec](https://filesystem-spec.readthedocs.io/en/latest/?badge=latest) and zarr.

## Summary

We'll create a virtual aggregate Zarr dataset to represent CESM MOM6 ocean component outputs in the netCDF3 format.

Output streams for this particular simulation are:
1. `static` file with time-invariant grid variables
2. `sfc` files with daily average surface information
3. `h` files with monthly averages of full 3D fields at fixed depth levels

For analysis reasons, we'd like the information in the `static` file to be merged with the `h` Dataset and the `sfc` dataset.
So we'll merge them using `kerchunk.combine.merge_vars`.

Then we generate aggregate datasets (JSON files) for the  `h` and `sfc` datasets independently. 

```{note}
These two datasets cannot be combined into a single Dataset without renaming the `time` dimension because of the different time frequency. In general, it's possible that the same variable name appears in different output streams, so merging is usually not a good idea.
```

We can use Zarr to represent both `sfc` and `h` in a single dataset using multiple [groups](https://zarr.readthedocs.io/en/stable/spec/v2.html#groups).
To do so, we generate a new JSON file that represents all output streams using a Zarr group for each stream (the `h` dataset forms one group, and the `sfc` dataset another group).
The Zarr specification for [groups](https://zarr.readthedocs.io/en/stable/spec/v2.html#groups) is quite simple, so this turns out to be easy.

We then demo reading the aggregate Dataset in two ways:
1. Individual groups using `xarray.open_dataset` with the `group` kwarg
2. All groups at once using the [datatree](https://xarray-datatree.readthedocs.io/en/latest/) library.


## Setup

In [1]:
%load_ext watermark

from glob import glob

import dask
import fsspec
import kerchunk
import ujson
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.netCDF3 import NetCDF3ToZarr

%watermark -iv

kerchunk: 0.1.0
xarray  : 2023.2.0
fsspec  : 2022.11.0
dask    : 2023.1.0
ujson   : 5.7.0



I requested 8 cores for my session.

In [3]:
from dask.distributed import Client

client = Client(threads_per_worker=4)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/8787/status,

0,1
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/8787/status,Workers: 2
Total threads: 8,Total memory: 32.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:39170,Workers: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/8787/status,Total threads: 8
Started: Just now,Total memory: 32.00 GiB

0,1
Comm: tcp://127.0.0.1:36066,Total threads: 4
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/34627/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:39523,
Local directory: /glade/scratch/dcherian/tmp/dask/dask-worker-space/worker-aj62kj8l,Local directory: /glade/scratch/dcherian/tmp/dask/dask-worker-space/worker-aj62kj8l

0,1
Comm: tcp://127.0.0.1:35585,Total threads: 4
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/46403/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:45920,
Local directory: /glade/scratch/dcherian/tmp/dask/dask-worker-space/worker-na8ngl5u,Local directory: /glade/scratch/dcherian/tmp/dask/dask-worker-space/worker-na8ngl5u


## CESM MOM6 output

There are a large number of files. Usually we use [intake-esm](https://intake-esm.readthedocs.io/en/stable/) to catalog and access the files.
The downside is that navigating the catalog can be painful, and reading from disk involves touching many files wiith `xarray.open_mfdataset`. 
This can take a while.

In [2]:
root = "/glade/campaign/cgd/oce/projects/pump/cesm/"
casename = "gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods"

There's a lot of output here, we'll read a subset.

In [31]:
from glob import glob

files = glob(f"{root}/{casename}/run/*mom6.*")
len(files)

This static file (`gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc`) has grid information

In [15]:
(staticfile,) = glob(f"{root}/{casename}/run/*static*")
print(staticfile)

/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc


## Simple example: generate references for the static file

kerchunk provides a [number of "backends"](https://fsspec.github.io/kerchunk/reference.html) to generate the "references" for a file format.

CESM output uses netCDF3 so we'll use `NetCDF3ToZarr`. Call `.translate` on the returned object to create a dictionary representation of the JSON file.

In [20]:
refs = NetCDF3ToZarr(staticfile)
refs.translate()

{'version': 1,
 'refs': {'.zgroup': '{"zarr_format":2}',
  'xh/.zarray': '{"chunks":[540],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[540],"zarr_format":2}',
  'xh/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   23048,
   4320],
  'xh/.zattrs': '{"_ARRAY_DIMENSIONS":["xh"],"cartesian_axis":"X","long_name":"h point nominal longitude","units":"degrees_east"}',
  'yh/.zarray': '{"chunks":[458],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[458],"zarr_format":2}',
  'yh/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   27368,
   3664],
  'yh/.zattrs': '{"_ARRAY_DIMENSIONS":["yh"],"cartesian_axis":"Y","

Consider the entries: `xh/.zarray` and `'xh/0'`

If this dataset were indeed stored on disk as a Zarr `DirectoryStore`, then 
- there would be a subfolder named `xh`.
- The `xh/.zarray` file idntifies `xh` as an array.
- The `xh/0` file would contain all `xh` values that are stored as a single chunk. 

The value associated with `xh/0` identifies a byte range in a file that contains the actual values.

Make this a function for reuse later.

In [21]:
def gen_ref(f):
    return NetCDF3ToZarr(f).translate()

Manipulating the JSON file can be painful. kerchunk comes with some useful pre-processors.

Here we'll use `kerchunk.combine.drop` to drop the `time` variable to avoid some problems later on.

In [22]:
# The static file with time-invariant variables has a useless `time` dimension.
# This messes up kerchunk's heuristics.
# kerchunk.combine.drop returns a function ...
drop_time = kerchunk.combine.drop("time")
staticdict = drop_time(gen_ref(staticfile))

## JSONs for the `sfc` and `h` datasets

This bit generates individual JSONs for the `sfc` and `h` datasets:
1. For each `.nc` file generate references with `gen_ref`
2. Use `kerchunk.combine.MultiZarrToZarr` to consolidate to a single Zarr dataset.
3. Merge in the static dataset references using `kerchunk.combine.merge_vars`.
4. Write a new JSON file

In [155]:
def generate_json(root, casename, code, static_refs):
    """
    Generate Kerchunk references for CESM output.
    """

    from pathlib import Path

    import dask.bag
    import ujson

    # Get list of files
    flist = sorted(glob(f"{root}/{casename}/run/*mom6.{code}_*"))

    # parallelize generating references using dask.bah
    bag = dask.bag.from_sequence(flist, npartitions=len(flist)).map(gen_ref)
    dicts = bag.compute()

    # Combine multiple  Zarr references (one per file) to
    # a single aggregate reference file
    mzz = MultiZarrToZarr(dicts, concat_dims="time")

    # merge in the static variable references
    merged = kerchunk.combine.merge_vars([static_refs.copy(), mzz.translate()])

    # create the output directory if needed
    Path(f"{root}/{casename}/run/jsons/").mkdir(parents=True, exist_ok=True)

    # write the JSON
    with open(f"{root}/{casename}/run/jsons/{code}.json", "wb") as f:
        f.write(ujson.dumps(merged).encode())

In [166]:
generate_json(root, casename, code="sfc", static_refs=staticdict)

In [156]:
generate_json(root, casename, code="h", static_refs=staticdict)



## Demo: reading a dataset

To read the dataset with Xarray, the JSON files needs to be represented as a Zarr dataset.

Use `fsspec` to do this.

In [23]:
fs = fsspec.filesystem(
    "reference",  # protocol
    fo=f"{root}/{casename}/run/jsons/sfc.json",  # json
    skip_instance_cache=True,  # skip caching, this is useful when building catalogs.
)
mapper = fs.get_mapper(root="")

Mapper is a dictionary-like object. We can ask it for the `.zgroup` "file" for example

In [24]:
mapper[".zgroup"]

b'{"zarr_format":2}'

Magic! The zarr library asks the `mapper` for a 'file', the `fsspec` library responds with data from the appropriate bytes stored in a file somewhere else.

In [26]:
xr.open_zarr(mapper, use_cftime=True, consolidated=False)

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 8.39 GiB 0.94 MiB Shape (9101, 458, 540) (1, 458, 540) Dask graph 9101 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458  9101,

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 8.39 GiB 0.94 MiB Shape (9101, 458, 540) (1, 458, 540) Dask graph 9101 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458  9101,

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 8.39 GiB 0.94 MiB Shape (9101, 458, 540) (1, 458, 540) Dask graph 9101 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458  9101,

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,71.10 kiB,8 B
Shape,"(9101,)","(1,)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray
"Array Chunk Bytes 71.10 kiB 8 B Shape (9101,) (1,) Dask graph 9101 chunks in 2 graph layers Data type timedelta64[ns] numpy.ndarray",9101  1,

Unnamed: 0,Array,Chunk
Bytes,71.10 kiB,8 B
Shape,"(9101,)","(1,)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,71.10 kiB,8 B
Shape,"(9101,)","(1,)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 71.10 kiB 8 B Shape (9101,) (1,) Dask graph 9101 chunks in 2 graph layers Data type object numpy.ndarray",9101  1,

Unnamed: 0,Array,Chunk
Bytes,71.10 kiB,8 B
Shape,"(9101,)","(1,)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,71.10 kiB,8 B
Shape,"(9101,)","(1,)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 71.10 kiB 8 B Shape (9101,) (1,) Dask graph 9101 chunks in 2 graph layers Data type object numpy.ndarray",9101  1,

Unnamed: 0,Array,Chunk
Bytes,71.10 kiB,8 B
Shape,"(9101,)","(1,)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 8.39 GiB 0.94 MiB Shape (9101, 458, 540) (1, 458, 540) Dask graph 9101 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458  9101,

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 8.39 GiB 0.94 MiB Shape (9101, 458, 540) (1, 458, 540) Dask graph 9101 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458  9101,

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 8.39 GiB 0.94 MiB Shape (9101, 458, 540) (1, 458, 540) Dask graph 9101 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458  9101,

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 8.39 GiB 0.94 MiB Shape (9101, 458, 540) (1, 458, 540) Dask graph 9101 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458  9101,

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,142.20 kiB,16 B
Shape,"(9101, 2)","(1, 2)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray
"Array Chunk Bytes 142.20 kiB 16 B Shape (9101, 2) (1, 2) Dask graph 9101 chunks in 2 graph layers Data type timedelta64[ns] numpy.ndarray",2  9101,

Unnamed: 0,Array,Chunk
Bytes,142.20 kiB,16 B
Shape,"(9101, 2)","(1, 2)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 8.39 GiB 0.94 MiB Shape (9101, 458, 540) (1, 458, 540) Dask graph 9101 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458  9101,

Unnamed: 0,Array,Chunk
Bytes,8.39 GiB,0.94 MiB
Shape,"(9101, 458, 540)","(1, 458, 540)"
Dask graph,9101 chunks in 2 graph layers,9101 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 MiB 0.94 MiB Shape (458, 540) (458, 540) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",540  458,

Unnamed: 0,Array,Chunk
Bytes,0.94 MiB,0.94 MiB
Shape,"(458, 540)","(458, 540)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


(^v^)  Looks like surface variables with static variables merged in.

## Combine datasets to single Zarr with groups

To create the `sfc` group, we read the `sfc.json` file and add `sfc/` to every key.

Repeat for the `h` dataset, and add a top level `.zgroup` entry.

Now we have dict representation of a virtual Zarr dataset! Write that to a JSON file.

In [176]:
def combine_jsons_as_groups(codes):
    ZARR_GROUP_ENTRY = {".zgroup": '{"zarr_format":2}'}

    import ujson

    newrefs = {}
    for code in codes:
        # read in existing JSON references
        with open(f"{root}/{casename}/run/jsons/{code}.json", "rb") as f:
            d = ujson.loads(f.read())

        # Add a new group by renaming the keys
        newrefs.update({f"{code}/{k}": v for k, v in d["refs"].items()})

    # Add top-level .zgroup entry
    newrefs.update(ZARR_GROUP_ENTRY)

    # This is now the combined dataset
    combined = {"version": 1, "refs": newrefs}

    # write a new reference JSON file
    with open(f"{root}/{casename}/run/jsons/combined.json", "wb") as f:
        f.write(ujson.dumps(combined).encode())

In [177]:
combine_jsons_as_groups(codes=["sfc", "h"])



## Reading the combined dataset

### Create the filesystem and mapper

In [28]:
fs = fsspec.filesystem(
    "reference",
    fo=f"{root}/{casename}/run/jsons/combined.json",
    skip_instance_cache=True,
)
mapper = fs.get_mapper(root="")

### Simple xarray.open_dataset

Specify the `group` kwarg to extract a single group

In [12]:
xr.open_dataset(mapper, engine="zarr", group="sfc", use_cftime=True, consolidated=False)

### Using datatree

Open all groups at one go using [datatree](https://xarray-datatree.readthedocs.io/en/latest/)

In [29]:
import datatree

In [30]:
tree = datatree.open_datatree(mapper, engine="zarr", use_cftime=True, consolidated=False)
tree

In [31]:
tree["h"]

In [32]:
tree["sfc"]

## Next

1. We could even consider adding higher-level `lnd`, `atm`, `ocn` groups so that single virtual dataset represents all output streams for all components from a single simulation.
2. In `intake-esm` terminology, a single JSON file representating an aggregate dataset could be a single asset.