# Virtualizarr Useful Recipes with NASA Earthdata

#### *Author: Dean Henze, PO.DAAC*

Testing the creation of virtual reference files using `kerchunk` vs `virtualizarr`. Currently, we're seeing that opening a data set using a ref file made with `virtualizarr` takes much longer than using a ref file made with `kerchunk`.

## Import Packages
#### ***Note Zarr Version***
***Zarr version 2 is needed for the current implementation of this notebook, due to (as of February 2025) Zarr version 3 not accepting `FSMap` objects.***

We ran this notebook in a Python 3.12 environment. The minimal working install we used to run this notebook from a clean environment was:
```
pip install zarr==2.18.4 fastparquet==2024.5.0 xarray==2025.1.2 earthaccess==0.11.0 fsspec==2024.10.0 "dask[complete]"==2024.5.2 h5netcdf==1.3.0 ujson==5.10.0 matplotlib==3.9.2 jupyterlab jupyter-server-proxy virtualizarr==1.3.0
```
And optionally:
```
pip install coiled==1.58.0
```

In [1]:
# Built-in packages
import os
import sys
import json

# Filesystem management 
import fsspec
import earthaccess

# Data analysis
import xarray as xr
from virtualizarr import open_virtual_dataset

# Parallel computing 
import multiprocessing
from dask import delayed
import dask.array as da
from dask.distributed import Client

# Other
import ujson
import matplotlib.pyplot as plt

In [2]:
import coiled

## Other Setup

In [3]:
xr.set_options( # display options for xarray objects
    display_expand_attrs=False,
    display_expand_coords=True,
    display_expand_data=True,
)

<xarray.core.options.set_options at 0x7fea96d30980>

## 1. Get Data File S3 endpoints in Earthdata Cloud 
The first step is to find the S3 endpoints to the files. Handling access credentials to Earthdata and then finding the endpoints can be done a number of ways (e.g. using the `requests`, `s3fs` packages) but we use the `earthaccess` package for its ease of use. We get the endpoints for all files in the CCMP record.

In [4]:
# Get Earthdata creds
earthaccess.login()

Enter your Earthdata Login username:  deanh808
Enter your Earthdata password:  ········


<earthaccess.auth.Auth at 0x7feaec06c590>

In [5]:
# Get AWS creds. Note that if you spend more than 1 hour in the notebook, you may have to re-run this line!!!
fs = earthaccess.get_s3_filesystem(daac="PODAAC")

In [6]:
# Locate CCMP file information / metadata:
granule_info = earthaccess.search_data(
    short_name="OSTIA-UKMO-L4-GLOB-REP-v2.0",
    )

In [7]:
# Get S3 endpoints for all files:
data_s3links = [g.data_links(access="direct")[0] for g in granule_info]
data_s3links[0:3]

['s3://podaac-ops-cumulus-protected/OSTIA-UKMO-L4-GLOB-REP-v2.0/1982/001/19820101120000-UKMO-L4_GHRSST-SSTfnd-OSTIA-GLOB_REP-v02.0-fv02.0.nc',
 's3://podaac-ops-cumulus-protected/OSTIA-UKMO-L4-GLOB-REP-v2.0/1982/002/19820102120000-UKMO-L4_GHRSST-SSTfnd-OSTIA-GLOB_REP-v02.0-fv02.0.nc',
 's3://podaac-ops-cumulus-protected/OSTIA-UKMO-L4-GLOB-REP-v2.0/1982/003/19820103120000-UKMO-L4_GHRSST-SSTfnd-OSTIA-GLOB_REP-v02.0-fv02.0.nc']

In [8]:
# Get HTTPS endpoints for all files:
data_httpslinks = [g.data_links(access="external")[0] for g in granule_info]
data_httpslinks[0:3]

['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/OSTIA-UKMO-L4-GLOB-REP-v2.0/1982/001/19820101120000-UKMO-L4_GHRSST-SSTfnd-OSTIA-GLOB_REP-v02.0-fv02.0.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/OSTIA-UKMO-L4-GLOB-REP-v2.0/1982/002/19820102120000-UKMO-L4_GHRSST-SSTfnd-OSTIA-GLOB_REP-v02.0-fv02.0.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/OSTIA-UKMO-L4-GLOB-REP-v2.0/1982/003/19820103120000-UKMO-L4_GHRSST-SSTfnd-OSTIA-GLOB_REP-v02.0-fv02.0.nc']

## 2. Generate reference file entire record using virtualizarr

In [9]:
reader_opts = {"storage_options": fs.storage_options} # S3 filesystem creds from previous section.

In [10]:
# Open data using the reference file, using a small wrapper function around xarray's open_dataset. 
# This will shorten code blocks in other sections. 
def opendf_withref(ref, fs_data):
    """
    "ref" is a reference file or object. "fs_data" is a filesystem with credentials to
    access the actual data files. 
    """
    storage_opts = {"fo": ref, "remote_protocol": "s3", "remote_options": fs_data.storage_options}
    fs_ref = fsspec.filesystem('reference', **storage_opts)
    m = fs_ref.get_mapper('')
    data = xr.open_dataset(
        m, engine="zarr", chunks={},
        backend_kwargs={"consolidated": False}
    )
    return data

### Parallelize using distributed cluster with Coiled

In [11]:
%%time

## --------------------------------------------
## Create single reference files with parallel computing using Coiled
## --------------------------------------------

# Wrap `open_virtual_dataset()` into coiled function and copy to mulitple VM's:
open_vds_par = coiled.function(
    region="us-west-2", spot_policy="on-demand", 
    vm_type="m6i.large", n_workers=30
    )(open_virtual_dataset)

# Begin computations:
results = open_vds_par.map(data_s3links[:], indexes={}, reader_options=reader_opts)

virtual_ds_list = []
for r in results:
    virtual_ds_list.append(r)

Output()

Output()

CPU times: user 10.6 s, sys: 801 ms, total: 11.4 s
Wall time: 8min 1s


In [12]:
open_vds_par.cluster.shutdown()

Using the individual references to create the combined reference is fast and does not requre parallel computing.

In [13]:
%%time
# Combining the individual references works the same as in Section 2.2.1:
virtual_ds_combined = xr.combine_nested(virtual_ds_list, concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')

CPU times: user 8 s, sys: 124 ms, total: 8.12 s
Wall time: 8.12 s


In [15]:
# Save in JSON or PARQUET format:
fname_combined_json = 'OSTIA-UKMO-L4-GLOB-REP-v2.0_reffile_virtualizarr.json'
fname_combined_parq = 'OSTIA-UKMO-L4-GLOB-REP-v2.0_reffile_virtualizarr.parq'
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_json, format='json')
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_parq, format='parquet')

In [16]:
%%time
# Test lazy loading of the combine reference file JSON:
data_json = opendf_withref(fname_combined_json, fs)
print(data_json)

<xarray.Dataset> Size: 11TB
Dimensions:           (time: 15340, lat: 3600, lon: 7200)
Coordinates:
  * lat               (lat) float32 14kB -89.97 -89.93 -89.88 ... 89.93 89.97
  * lon               (lon) float32 29kB -180.0 -179.9 -179.9 ... 179.9 180.0
  * time              (time) datetime64[ns] 123kB 1982-01-01T12:00:00 ... 202...
Data variables:
    analysed_sst      (time, lat, lon) float64 3TB dask.array<chunksize=(1, 1200, 2400), meta=np.ndarray>
    analysis_error    (time, lat, lon) float64 3TB dask.array<chunksize=(1, 1200, 2400), meta=np.ndarray>
    mask              (time, lat, lon) float32 2TB dask.array<chunksize=(1, 1800, 3600), meta=np.ndarray>
    sea_ice_fraction  (time, lat, lon) float64 3TB dask.array<chunksize=(1, 1800, 3600), meta=np.ndarray>
Attributes: (39)
CPU times: user 54.4 s, sys: 814 ms, total: 55.2 s
Wall time: 4min 36s


In [17]:
%%time
# Test lazy loading of the combine reference file PARQUET:
data_parq = opendf_withref(fname_combined_parq, fs)
print(data_parq)

<xarray.Dataset> Size: 11TB
Dimensions:           (time: 15340, lat: 3600, lon: 7200)
Coordinates:
  * lat               (lat) float32 14kB -89.97 -89.93 -89.88 ... 89.93 89.97
  * lon               (lon) float32 29kB -180.0 -179.9 -179.9 ... 179.9 180.0
  * time              (time) datetime64[ns] 123kB 1982-01-01T12:00:00 ... 202...
Data variables:
    analysed_sst      (time, lat, lon) float64 3TB dask.array<chunksize=(1, 1200, 2400), meta=np.ndarray>
    analysis_error    (time, lat, lon) float64 3TB dask.array<chunksize=(1, 1200, 2400), meta=np.ndarray>
    mask              (time, lat, lon) float32 2TB dask.array<chunksize=(1, 1800, 3600), meta=np.ndarray>
    sea_ice_fraction  (time, lat, lon) float64 3TB dask.array<chunksize=(1, 1800, 3600), meta=np.ndarray>
Attributes: (39)
CPU times: user 52.5 s, sys: 621 ms, total: 53.1 s
Wall time: 4min 25s


## 3. Generate reference file entire record using kerchunk

In [18]:
from kerchunk.df import refs_to_dataframe
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr

In [19]:
fobjs = earthaccess.open(granule_info)

QUEUEING TASKS | :   0%|          | 0/15340 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/15340 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/15340 [00:00<?, ?it/s]

In [20]:
def single_ref_earthaccess(fobj, dir_save=None):
    """
    Create a reference for a single data file. "fobj" (earthaccess.store.EarthAccessFile 
    object) is the output from earthaccess.open(), which also has the file endpoint. 
    Option to save as a JSON to direcotry "dir_save", with file name of the corresponding 
    data file with ".json" appended. Otherwise reference info is returned.
    """
    endpoint = fobj.full_name
    reference = SingleHdf5ToZarr(fobj, endpoint, inline_threshold=0).translate()
    
    if dir_save is not None:
        with open(dir_save + endpoint.split('/')[-1]+'.json', 'w') as outf:
            outf.write(ujson.dumps(reference))
    else:
        return reference, endpoint # returns both the kerchunk reference and the path the file on podaac-ops-cumulus-protected

In [22]:
%%time

## --------------------------------------------
## Create single reference files with parallel computing using Coiled
## --------------------------------------------

# Wrap `create_single_ref` into coiled function:
single_ref_earthaccess_par = coiled.function(
    region="us-west-2", spot_policy="on-demand", 
    vm_type="m6i.large", n_workers=16
    )(single_ref_earthaccess)

# Begin computations:
results = single_ref_earthaccess_par.map(fobjs[:365])

# Append results as they become available:
virtual_ref_list = []
for r, ep in results:
    virtual_ref_list.append(r)

Output()

Output()

CPU times: user 2.96 s, sys: 116 ms, total: 3.08 s
Wall time: 1min 54s


In [None]:
single_ref_earthaccess_par.cluster.shutdown()

In [25]:
%%time

## --------------------------------------------
## Create combined reference file
## --------------------------------------------
## Combined reference file
kwargs_mzz = {'remote_protocol':"s3", 'remote_options':fs.storage_options, 'concat_dims':["time"]}
mzz = MultiZarrToZarr(virtual_ref_list, **kwargs_mzz)
ref_combined = mzz.translate()

refs_to_dataframe(ref_combined, "OSTIA-UKMO-L4-GLOB-REP-v2.0_reffile_kerchunk.parq")

CPU times: user 2.54 s, sys: 190 ms, total: 2.73 s
Wall time: 1min 35s
