# NetCDF
Generating virtual datasets from NetCDF files


<img src="../images/ARG.png" width=350 alt="ARG"></img>

## Overview
   
Within this notebook, we will cover:

1. How to access remote NetCDF data using `VirtualiZarr` and `Kerchunk`
1. Combining multiple virtual datasets

This notebook shares many similarities with the  [multi-file virtual datasets with VirtualiZarr](./02_kerchunk_multi_file.ipynb) notebook. If you are confused on the function of a block of code, please refer there for a more detailed breakdown of what each line is doing.


## Prerequisites
| Concepts | Importance | Notes |
| --- | --- | --- |
| [Basics of virtual Zarr stores](../foundations/01_kerchunk_basics.ipynb) | Required | Core |
| [Multi-file virtual datasets with VirtualiZarr](../foundations/02_kerchunk_multi_file.ipynb) | Required | Core |
| [Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask](../foundations/03_kerchunk_dask) | Required | Core |
| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |
- **Time to learn**: 45 minutes
---

## Motivation

NetCDF4/HDF5 is one of the most universally adopted file formats in earth sciences, with support of much of the community as well as scientific agencies, data centers and university labs. A huge amount of legacy data has been generated in this format. Fortunately, using `VirtualiZarr` and `Kerchunk`, we can read these datasets as if they were an Analysis-Read Cloud-Optimized (ARCO) format such as `Zarr`.

## About the Dataset

For this example, we will look at a weather dataset composed of multiple NetCDF files.The SMN-Arg is a WRF deterministic weather forecasting dataset created by the `Servicio Meteorológico Nacional de Argentina` that covers Argentina as well as many neighboring countries at a 4km spatial resolution.  

The model is initialized twice daily at 00 & 12 UTC with hourly forecasts for variables such as temperature, relative humidity, precipitation, wind direction and magnitude etc. for multiple atmospheric levels.
The data is output at hourly intervals with a maximum prediction lead time of 72 hours in NetCDF files.

More details on this dataset can be found [here](https://registry.opendata.aws/smn-ar-wrf-dataset/).


## Flags
In the section below, set the `subset` flag to be `True` (default) or `False` depending if you want this notebook to process the full file list. If set to `True`, then a subset of the file list will be processed (Recommended)

In [1]:
subset_flag = True

## Imports

In [2]:
import logging

import dask
import fsspec
import s3fs
import xarray as xr
from distributed import Client
from virtualizarr import open_virtual_dataset

### Examining a Single NetCDF File

Before we use `VirtualiZarr` to create virtual datasets for multiple files, we can load a single NetCDF file to examine it. 



In [3]:
# URL pointing to a single NetCDF file
url = "s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/00/WRFDETAR_01H_20221231_00_072.nc"

# Initialize a s3 filesystem
fs = s3fs.S3FileSystem(anon=True)
# Use Xarray to open a remote NetCDF file
ds = xr.open_dataset(fs.open(url), engine="h5netcdf")

In [4]:
ds

Here we see the `repr` from the `Xarray` Dataset of a single `NetCDF` file. From examining the output, we can tell that the Dataset dimensions are `['time','y','x']`, with time being only a single step.
Later, when we use `Xarray's` `combine_nested` functionality, we will need to know on which dimensions to concatenate across. 



## Create Input File List

Here we are using `fsspec's` glob functionality along with the *`*`* wildcard operator and some string slicing to grab a list of NetCDF files from a `s3` `fsspec` filesystem. 

In [5]:
# Initiate fsspec filesystems for reading
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True)

files_paths = fs_read.glob("s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*")

# Here we prepend the prefix 's3://', which points to AWS.
files_paths = sorted(["s3://" + f for f in files_paths])


# If the subset_flag == True (default), the list of input files will be subset
# to speed up the processing
if subset_flag:
    files_paths = files_paths[0:8]

## Start a Dask Client

To parallelize the creation of our reference files, we will use `Dask`. For a detailed guide on how to use Dask and Kerchunk, see the Foundations notebook: [Kerchunk and Dask](../foundations/kerchunk_dask).


In [7]:
client = Client(n_workers=8, silence_logs=logging.ERROR)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /user/maxrjones/virtualizarr/proxy/8787/status,

0,1
Dashboard: /user/maxrjones/virtualizarr/proxy/8787/status,Workers: 8
Total threads: 8,Total memory: 14.84 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:39261,Workers: 8
Dashboard: /user/maxrjones/virtualizarr/proxy/8787/status,Total threads: 8
Started: Just now,Total memory: 14.84 GiB

0,1
Comm: tcp://127.0.0.1:45871,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/33657/status,Memory: 1.85 GiB
Nanny: tcp://127.0.0.1:37507,
Local directory: /tmp/dask-scratch-space/worker-b2_y18pl,Local directory: /tmp/dask-scratch-space/worker-b2_y18pl

0,1
Comm: tcp://127.0.0.1:33427,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/42077/status,Memory: 1.85 GiB
Nanny: tcp://127.0.0.1:40917,
Local directory: /tmp/dask-scratch-space/worker-ezy20nvh,Local directory: /tmp/dask-scratch-space/worker-ezy20nvh

0,1
Comm: tcp://127.0.0.1:39879,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/42795/status,Memory: 1.85 GiB
Nanny: tcp://127.0.0.1:32947,
Local directory: /tmp/dask-scratch-space/worker-21wn2e1m,Local directory: /tmp/dask-scratch-space/worker-21wn2e1m

0,1
Comm: tcp://127.0.0.1:33527,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/32837/status,Memory: 1.85 GiB
Nanny: tcp://127.0.0.1:43273,
Local directory: /tmp/dask-scratch-space/worker-1modidaw,Local directory: /tmp/dask-scratch-space/worker-1modidaw

0,1
Comm: tcp://127.0.0.1:38345,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/33493/status,Memory: 1.85 GiB
Nanny: tcp://127.0.0.1:43799,
Local directory: /tmp/dask-scratch-space/worker-vm949tt7,Local directory: /tmp/dask-scratch-space/worker-vm949tt7

0,1
Comm: tcp://127.0.0.1:44881,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/35277/status,Memory: 1.85 GiB
Nanny: tcp://127.0.0.1:37753,
Local directory: /tmp/dask-scratch-space/worker-_jtbgk4_,Local directory: /tmp/dask-scratch-space/worker-_jtbgk4_

0,1
Comm: tcp://127.0.0.1:38177,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/45451/status,Memory: 1.85 GiB
Nanny: tcp://127.0.0.1:35793,
Local directory: /tmp/dask-scratch-space/worker-i2s4xq_o,Local directory: /tmp/dask-scratch-space/worker-i2s4xq_o

0,1
Comm: tcp://127.0.0.1:43883,Total threads: 1
Dashboard: /user/maxrjones/virtualizarr/proxy/37903/status,Memory: 1.85 GiB
Nanny: tcp://127.0.0.1:38321,
Local directory: /tmp/dask-scratch-space/worker-8877bc92,Local directory: /tmp/dask-scratch-space/worker-8877bc92


In [7]:
def generate_virtual_dataset(file, storage_options):
    return open_virtual_dataset(
        file, indexes={}, reader_options={"storage_options": storage_options}
    )


storage_options = dict(anon=True, default_fill_cache=False, default_cache_type="first")
# Generate Dask Delayed objects
tasks = [
    dask.delayed(generate_virtual_dataset)(file, storage_options)
    for file in files_paths
]

In [8]:
# Start parallel processing
import warnings

warnings.filterwarnings("ignore")
virtual_datasets = list(dask.compute(*tasks))

## Combine virtual datasets and write a Kerchunk reference JSON to store the virtual Zarr store

In the following cell, we are combining all the `virtual datasets that were generated above into a single reference file and writing that file to disk.

In [9]:
combined_vds = xr.combine_nested(
    virtual_datasets, concat_dim=["time"], coords="minimal", compat="override"
)
combined_vds

In [11]:
combined_vds.virtualize.to_kerchunk("ARG_combined.json", format="json")

## Shut down the Dask cluster

In [12]:
client.shutdown()