<img src="../../thumbnail.png" width=500 alt="Kerchunk Logo"></img>

# Multi-File Datasets with Kerchunk

## Overview

This notebook is intends to build off of the [Kerchunk Basics notebook](./kerchunk_basics.ipynb).

In this tutorial we will:
- Create a list of input paths for a collection of NetCDF files stored on the cloud.
- Iterate through our file input list and create `Kerchunk` reference `.jsons` for each file.
- Combine the reference `.jsons` into a single combined dataset reference with the rechunker class, `MultiZarrToZarr`
- Learn how to read the combined dataset  using `Xarray` and `fsspec`.


## Prerequisites
| Concepts | Importance | Notes |
| --- | --- | --- |
| [Kerchunk Basics](./kerchunk_basics.ipynb) | Required | Basic features |
| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Recommended | IO |

- **Time to learn**: 60 minutes
---

## Imports


In [14]:
import fsspec
import ujson
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr
from tqdm import tqdm

### Create a File Pattern from a list of  input NetCDF files

Below we will create a list of input files we want `Kerchunk` to index. In the [Kerchunk Basics Tutorial](./kerchunk_basics.ipynb), we looked at a single file of climate downscaled data over Southern Alaska. In this example, we will build off of that work and use kerchunk to combine multiple NetCDF files of this dataset into a single virtual dataset.

In [15]:
# Initiate fsspec filesystems for reading and writing
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)
fs_write = fsspec.filesystem("")

# Retrieve list of available days in archive for the year 2060.
files_paths = fs_read.glob("s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/*")

# Here we prepend the prefix 's3://', which points to AWS.
file_pattern = sorted(["s3://" + f for f in files_paths])

In [16]:
print(f"{len(file_pattern)} file paths were retrieved.")

365 file paths were retrieved.


As a quick check, it looks like we have a list 365 file paths, which should be a year of downscaled climte data. 

### Optional: If you want to examine one NetCDF files before creating the `Kerchunk` index, try uncommenting this code snippet below.

In [17]:
# Note: Optional piece of code to view one of the NetCDF files using Xarray as fsspec.

# import s3fs

# fs = s3fs.S3FileSystem(anon=True, default_fill_cache=False)
# with fs.open(file_pattern[0]) as fileObj:
#     ds = xr.open_dataset(fileObj)
#     print(ds)
#     print(ds.nbytes / 1e9)

## Create Kerchunk References for every file in the `File_Pattern` list

Now that we have a list of NetCDF files, we can use `Kerchunk` to create reference files for each one of these. To do this, we will iterate through each file and create a reference `.json`. To speed this process up, you could use `Dask` to parallelize this.

### Note: To speed next section up, uncomment the next cell.  This will reduce the # of input files from 365 to 7, going from a year's worth of data, to a weeks worth of data.


In [18]:
### OPTIONAL SPEEDUP: DEFAULT IS OFF ###

file_pattern = file_pattern[0:7]

In [19]:
# fsspec.open kwargs. Details on this can be found in `kerchunk_basics` in the (### Define kwargs for `fsspec`) section.

so = dict(mode="rb", anon=True, default_fill_cache=False, default_cache_type="first")
output_dir = "./"

# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk` index from a NetCDF file.
def generate_json_reference(u, output_dir: str):
    with fs_read.open(u, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
        fname = u.split("/")[-1].strip(".nc")
        outf = f"{fname}.json"
        with open(outf, "wb") as f:
            f.write(ujson.dumps(h5chunks.translate()).encode())
        return outf


# Iterate through filelist to generate Kerchunked files. Good use for `Dask`
output_files = []
for fil in tqdm(file_pattern):
    outf = generate_json_reference(fil, output_dir)
    output_files.append(outf)

 43%|████▎     | 3/7 [14:48<19:47, 296.82s/it]

In [None]:
output_files

## Combine `.json` `Kerchunk` reference files and write a combined `Kerchunk` index

In [None]:
# combine individual references into single consolidated reference
mzz = MultiZarrToZarr(
    output_files,
    concat_dims=["Time"],
    identical_dims=["south_north", "west_east", "interp_levels", "soil_layers_stag"],
)


multi_kerchunk = mzz.translate()

## Write combined kerchunk index for future use

In [None]:
# Write kerchunk .json record
output_fname = "combined_kerchunk.json"
with open(f"{output_fname}", "wb") as f:
    f.write(ujson.dumps(multi_kerchunk).encode())

## Open combined `Kerchunk` dataset with `fsspec` and `Xarray`

In [None]:
# open dataset as zarr object using fsspec reference file system and Xarray
fs = fsspec.filesystem(
    "reference", fo=multi_kerchunk, remote_protocol="s3", remote_options={"anon": True}
)
m = fs.get_mapper("")
ds = xr.open_dataset(m, engine="zarr", backend_kwargs=dict(consolidated=False))
ds

## Plot a slice of the dataset

In [None]:
ds.isel(Time=0).SNOW.plot()