# Virtualizarr Useful Recipes with NASA Earthdata

#### *Author: Dean Henze, PO.DAAC*

*Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.*

## Summary

This notebook goes through several functionalities of the virtualizarr package to create virtual reference files, specifically using it with NASA Earthdata and utilizing the `earthaccess` package. It is meant to be a quick-start reference that introduces some key capabilities / characteristics of the package once a user has a high-level understanding of virtual data sets and the cloud-computing challenges it addresses (see references in the *Prerequisite knowledge* section below). In short, virtualizarr is a Python package to create "reference files", which can be thought of as road maps for the computer to efficiently navigate through large arrays in an Earthdata file (or any file). Once a reference file for a data set is created, it can be accessed and used to e.g. to lazy load data faster, access subsets of the data quicker (spatially, temporally, or any other dimension in the data set), and in some cases perform computations faster.

The functionalities of virtualizarr covered in this notebook are:

1. **Getting Data File endpoints in Earthdata Cloud**
2. **Generating reference files for 1 day, 1 year, and the entire record of a ~750 GB data set**. The data set, available on PO.DAAC, is the Level 4 global gridded 6-hourly wind product from the Cross-Calibrated Multi-Platform project (https://doi.org/10.5067/CCMP-6HW10M-L4V31). This section also covers speeding up the reference file creations using parallel computing. Reference files are saved in both JSON and PARQUET formats. The latter is an important format as it reduces the reference file size by ~30x in our tests.
3. **Combining reference files**. The ability to combine reference files together rather than having to create the combined product from scratch is important since it can save computing resources/time. This notebooks explores (3.1) Adding an extra day of the CCMP record to the year-long reference file created in Section 2, and (3.2) combing 2 year-long reference files together.
4. **Using the reference file to perform a basic analysis on the CCMP data set with a parallel computing cluster.** Parallel computing on both a local and distributed cluster are tested. For the local cluster, we are able to run all computations successfully. For the distributed cluster, we are only able to run computations if the reference file is first loaded fully into memory.

## Requirements, prerequisite knowledge, learning outcomes

#### Requirements to run this notebook

* Earthdata login account: An Earthdata Login account is required to access data from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account.

* Compute environment: This notebook is meant to be run in the cloud (AWS instance running in us-west-2). We used an `m6i.4xlarge` EC2 instance (16 CPU's, 64 GiB memory) to complete Section 4 on parallel computing, although this is likely overkill for the other sections. At minimum we recommend a VM with 10 CPU's to make the parallel computations in Section 2.2.1 faster.

* Optional Coiled account: To run the optional sections on distributed clusters, Create a coiled account (free to sign up), and connect it to an AWS account. For more information on Coiled, setting up an account, and connecting it to an AWS account, see their website [https://www.coiled.io](https://www.coiled.io). 

#### Prerequisite knowledge

* This notebook covers kerchunk functionality but does not present the high-level ideas behind it. For an understanding of reference files and how they are meant to enhance in-cloud access to file formats that are not cloud optimized (such netCDF, HDF), please see e.g. this [kerchunk page](https://fsspec.github.io/kerchunk/), or [this page on virtualizarr](https://virtualizarr.readthedocs.io/en/latest/) (a package with similar functionality).

* Familiarity with the `earthaccess` and `Xarray` packages. Familiarity with directly accessing NASA Earthdata in the cloud. 

* The Cookbook notebook on [Dask basics](https://podaac.github.io/tutorials/notebooks/Advanced_cloud/basic_dask.html) is handy for those new to parallel computating and wanting to implement it in Sections 1.2.1 and 3.2.

#### Learning Outcomes

This notebook demonstrates several recipes for key kerchunk functionalities with NASA Earthdata. It is meant to be used after the user has a high level understanding of kerchunk and the challenges it is trying to solve, at which point this notebook: 

* Demonstrates how to implement the package,
* Highlights several characteristics of the package which will likely be of interest for utilizing it with Earthdata in common workflows. 

## Import Packages
#### ***Note Zarr Version***
***Zarr version 2 is needed for the current implementation of this notebook, due to (as of February 2025) Zarr version 3 not accepting `FSMap` objects.***

We ran this notebook in a Python 3.12 environment. The minimal working install we used to run this notebook from a clean environment was:
```
pip install zarr==2.18.4 fastparquet==2024.5.0 xarray==2024.1.0 earthaccess==0.11.0 fsspec==2024.10.0 "dask[complete]"==2024.5.2 h5netcdf==1.3.0 ujson==5.10.0 matplotlib==3.9.2 jupyterlab jupyter-server-proxy virtualizarr==1.3.0
```
And optionally:
```
pip install coiled==1.58.0
```

In [1]:
# Built-in packages
import os
import sys
import json

# Filesystem management 
import fsspec
import earthaccess

# Data analysis
import xarray as xr
from virtualizarr import open_virtual_dataset

# Parallel computing 
import multiprocessing
from dask import delayed
import dask.array as da
from dask.distributed import Client

# Other
import ujson
import matplotlib.pyplot as plt

In [2]:
import coiled

## Other Setup

In [3]:
xr.set_options( # display options for xarray objects
    display_expand_attrs=False,
    display_expand_coords=True,
    display_expand_data=True,
)

<xarray.core.options.set_options at 0x7f4516ec0380>

## 1. Get Data File S3 endpoints in Earthdata Cloud 
The first step is to find the S3 endpoints to the files. Handling access credentials to Earthdata and then finding the endpoints can be done a number of ways (e.g. using the `requests`, `s3fs` packages) but we use the `earthaccess` package for its ease of use. We get the endpoints for all files in the CCMP record.

In [34]:
# Get Earthdata creds
earthaccess.login()

<earthaccess.auth.Auth at 0x7f45c8db2690>

In [5]:
# Get AWS creds. Note that if you spend more than 1 hour in the notebook, you may have to re-run this line!!!
fs = earthaccess.get_s3_filesystem(daac="PODAAC")

In [6]:
# Locate CCMP file information / metadata:
granule_info = earthaccess.search_data(
    short_name="CCMP_WINDS_10M6HR_L4_V3.1",
    )

In [7]:
# Get S3 endpoints for all files:
data_s3links = [g.data_links(access="direct")[0] for g in granule_info]
data_s3links[0:3]

['s3://podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930102_V03.1_L4.nc',
 's3://podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930103_V03.1_L4.nc',
 's3://podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930105_V03.1_L4.nc']

In [8]:
# Get HTTPS endpoints for all files:
data_httpslinks = [g.data_links(access="external")[0] for g in granule_info]
data_httpslinks[0:3]

['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930102_V03.1_L4.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930103_V03.1_L4.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930105_V03.1_L4.nc']

## 2. Generate reference files for 1 day, 1 year, and entire record

### 2.1 First day
The virtualizarr function to generate reference information is pretty compact. We use it on one file for demonstration:

In [9]:
%%time
reader_opts = {"storage_options": fs.storage_options} # S3 filesystem creds from previous section.
virtual_ds_example = open_virtual_dataset(data_s3links[0], indexes={}, reader_options=reader_opts)
print(virtual_ds_example)

<xarray.Dataset> Size: 66MB
Dimensions:    (time: 4, latitude: 720, longitude: 1440)
Coordinates:
    longitude  (longitude) float32 6kB ManifestArray<shape=(1440,), dtype=flo...
    time       (time) float32 16B ManifestArray<shape=(4,), dtype=float32, ch...
    latitude   (latitude) float32 3kB ManifestArray<shape=(720,), dtype=float...
Data variables:
    ws         (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    uwnd       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    nobs       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    vwnd       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
Attributes: (54)
CPU times: user 326 ms, sys: 74.5 ms, total: 401 ms
Wall time: 1.11 s


The reference can be saved to file and used to open the corresponding CCMP data file with Xarray:

In [10]:
virtual_ds_example.virtualize.to_kerchunk('virtual_ds_example.json', format='json')

In [11]:
# Open data using the reference file, using a small wrapper function around xarray's open_dataset. 
# This will shorten code blocks in other sections. 
def opendf_withref(ref, fs_data):
    """
    "ref" is a reference file or object. "fs_data" is a filesystem with credentials to
    access the actual data files. 
    """
    storage_opts = {"fo": ref, "remote_protocol": "s3", "remote_options": fs_data.storage_options}
    fs_ref = fsspec.filesystem('reference', **storage_opts)
    m = fs_ref.get_mapper('')
    data = xr.open_dataset(
        m, engine="zarr", chunks={},
        backend_kwargs={"consolidated": False}
    )
    return data

In [12]:
data_example = opendf_withref('virtual_ds_example.json', fs)
print(data_example)

<xarray.Dataset> Size: 66MB
Dimensions:    (latitude: 720, longitude: 1440, time: 4)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 32B 1993-01-02 ... 1993-01-02T18:00:00
Data variables:
    nobs       (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (54)


In [13]:
# Also useful to note, these reference objects don't take much memory:
print(sys.getsizeof(virtual_ds_example), "bytes")

120 bytes


### 2.2 First year
Reference information for each data file in the year is created individually, and then the combined reference file for the year can be created.

For us, reference file creation for a single file takes about 0.7 seconds, so processing a year of files would take about 4.25 minuts. One can easly accomplish this with a for-loop:

```
virtual_ds_list = [
    open_virtual_dataset(p, indexes={}, reader_options={"storage_options": fs.storage_options})
    for p in data_s3links
    ]
```

However, we speed things up using basic parallel computing. 

### 2.2.1 Method 1: parallelize using Dask local cluster
If using the suggested `m6i.8xlarge` AWS EC2 instance, there are 32 CPUs available and each should have enough memory that we can utilize all 32 at once. If working on a different VM-type, change the `n_workers` in the call to `Client()` below as needed.

In [14]:
# Check how many cpu's are on this VM:
print("CPU count =", multiprocessing.cpu_count())

CPU count = 16


In [15]:
# Start up cluster and print some information about it:
client = Client(n_workers=15, threads_per_worker=1)
print(client.cluster)
print("View any work being done on the cluster here", client.dashboard_link)

LocalCluster(e5a9615d, 'tcp://127.0.0.1:42901', workers=15, threads=15, memory=60.83 GiB)
View any work being done on the cluster here https://cluster-gjspj.dask.host/jupyter/proxy/8787/status


In [16]:
%%time
# Create individual references:
open_vds_par = delayed(open_virtual_dataset)
tasks = [open_vds_par(p, indexes={}, reader_options=reader_opts) for p in data_s3links[:365]]
virtual_ds_list = list(da.compute(*tasks)) # The xr.combine_nested() function below needs a list rather than a tuple.

CPU times: user 3.76 s, sys: 685 ms, total: 4.45 s
Wall time: 25 s


Using the individual references to create the combined reference is fast and does not requre parallel computing.

In [17]:
%%time
# Create the combined reference
virtual_ds_combined = xr.combine_nested(virtual_ds_list, concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')

CPU times: user 128 ms, sys: 0 ns, total: 128 ms
Wall time: 126 ms


In [18]:
# Save in JSON or PARQUET format:
fname_combined_json = 'ref_combined_1year.json'
fname_combined_parq = 'ref_combined_1year.parq'
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_json, format='json')
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_parq, format='parquet')

In [None]:
%%time
# Test lazy loading of the combine reference file JSON:
data_json = opendf_withref(fname_combined_json, fs)
print(data_json)

In [None]:
%%time
# Test lazy loading of the combine reference file PARQUET:
data_parq = opendf_withref(fname_combined_parq, fs)
print(data_parq)

### 2.2.2 Optional method 2: parallelize using distributed cluster with Coiled
At PO.DAAC we have been testing the third party software/package Coiled which makes it easy to spin up distributed computing clusters in the cloud. Since we suspect that Coiled may become a key member of the Cloud ecosystem for earth science researchers, this optional section is included, which can be used as an alternative to Section 2.2.1 for generating the individual reference files in parallel.

In [33]:
%%time

## --------------------------------------------
## Create single reference files with parallel computing using Coiled
## --------------------------------------------

# Wrap `open_virtual_dataset()` into coiled function and copy to mulitple VM's:
open_vds_par = coiled.function(
    region="us-west-2", spot_policy="on-demand", 
    vm_type="m6i.large", n_workers=30
    )(open_virtual_dataset)

# Begin computations:
results = open_vds_par.map(data_s3links[:], indexes={}, reader_options=reader_opts)

virtual_ds_list = []
for r in results:
    virtual_ds_list.append(r)

Output()

Output()

CPU times: user 7.6 s, sys: 653 ms, total: 8.26 s
Wall time: 8min 30s


In [34]:
open_vds_par.cluster.shutdown()

Using the individual references to create the combined reference is fast and does not requre parallel computing.

In [35]:
%%time
# Combining the individual references works the same as in Section 2.2.1:
virtual_ds_combined = xr.combine_nested(virtual_ds_list, concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')

CPU times: user 5.48 s, sys: 33.2 ms, total: 5.51 s
Wall time: 5.51 s


In [36]:
# Save in JSON or PARQUET format:
fname_combined_json = 'ref_combined_1year.json'
fname_combined_parq = 'ref_combined_1year.parq'
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_json, format='json')
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_parq, format='parquet')

In [37]:
%%time
# Test lazy loading of the combine reference file JSON:
data_json = opendf_withref(fname_combined_json, fs)
print(data_json)

<xarray.Dataset> Size: 775GB
Dimensions:    (latitude: 720, longitude: 1440, time: 46696)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 374kB 1993-01-02 ... 2024-12-31T18:00:00
Data variables:
    nobs       (time, latitude, longitude) float32 194GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 194GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 194GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 194GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (44)
CPU times: user 37.7 s, sys: 492 ms, total: 38.2 s
Wall time: 3min 34s


In [38]:
%%time
# Test lazy loading of the combine reference file PARQUET:
data_parq = opendf_withref(fname_combined_parq, fs)
print(data_parq)

<xarray.Dataset> Size: 775GB
Dimensions:    (latitude: 720, longitude: 1440, time: 46696)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 374kB 1993-01-02 ... 2024-12-31T18:00:00
Data variables:
    nobs       (time, latitude, longitude) float32 194GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 194GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 194GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 194GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (44)
CPU times: user 38.3 s, sys: 351 ms, total: 38.7 s
Wall time: 3min 17s


## 2.3 Entire record

Processing the entire record follows the exact same workflow as processing the first year Section 2.2 (either parallelization method). The only modification required is to change the one instance of 
```
data_s3links[:365]
```
with 
```
data_s3links[:]
```
when setting up the parallel computations (occurs once in each of Sections 2.2.1 and 2.2.2). Optionally, also change the saved file names e.g. from `ref_combined_1year.json` to `ref_combined_record.json`.

For us, processing the entire record using a local cluster on an `m6i.4xlarge` EC2 instance, with 15 workers, took about 13 minutes. Using 30 `m6i.large` VM's on a distributed cluster with Coiled took ~8.5 minutes and cost ~$0.40.

Because the virtualizarr package is so efficient at combining many individual reference files together, and because the individual references have such small in-memory requirements, the workflows in Section 2.2 are assumed to scale to tens of thousands of files and TB's of data. However, this assumption will be tested as the techniques in the notebook are applied to progressively larger data sets.

For us, lazy loading the entire recrod took ~3 minutes. Current work is looking into reducing this time further (expectations were high coming from kerchunk reference file generation). However, 3 minutes is still a massive upgrade - for example, compare that to an attempt at opening these same files with `Xarray` the "traditional" way with a call to `xr.open_mfdataset()`. On a smaller machine, the following line of code will either fail or take a long (possibly very long) amount of time:

In [33]:
## You can try un-commenting and running this but your notebook will probably stall or crash:
# fobjs = earthaccess.open(granule_info)
# data = xr.open_mfdataset(fobjs[:])

## 3. Combining reference files
This section demonstrates that reference files can be combined in two examples:

1. A single reference file is appended to the year-long combined reference file generated in the previous section.
2. A second year-long combined reference file is created and combined with the first year-long reference file.

In both cases, creating the final product (e.g. combining two reference files) is easy and works, ***but currently only works for JSON*** (and the future ice chunk format).

In [72]:
# In case this notebook has been running over an hour, refresh the file system and credentials:
fs = earthaccess.get_s3_filesystem(daac="PODAAC")
reader_opts = {"storage_options": fs.storage_options}

### 3.1 Adding an extra day of the CCMP record to the year-long reference file.

In [51]:
%%time
# Create reference file for 366th CCMP file:
vds_extraday = open_virtual_dataset(data_s3links[366], indexes={}, reader_options=reader_opts)

CPU times: user 220 ms, sys: 58.7 ms, total: 279 ms
Wall time: 765 ms


In [61]:
%%time
# Add it to the year-long reference, then save as new reference file
#ref_year1 = open_virtual_dataset('ref_combined_1year.parq', filetype='parquet')
vds_year1 = open_virtual_dataset('ref_combined_1year.json', filetype='kerchunk')
vds_appended = xr.combine_nested([vds_year1, vds_extraday], concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')

In [62]:
## Confirm time dimension has increased in new reference:
print(len(vds_year1['time']))
print(len(vds_appended['time']))

1460
1464


In [64]:
%%time
# Save to file then use to open new data set
vds_appended.virtualize.to_kerchunk('ref_appended.json', format='json')
data_appended = opendf_withref('ref_appended.json', fs)
print(data_appended)

<xarray.Dataset> Size: 24GB
Dimensions:    (latitude: 720, longitude: 1440, time: 1464)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 12kB 1993-01-02 ... 1994-01-06T18:00:00
Data variables:
    nobs       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (54)
CPU times: user 842 ms, sys: 31.8 ms, total: 874 ms
Wall time: 2.89 s


### 3.2 Combining two year-long combined reference files
Individual files for the second year of CCMP are created and combined into a single reference file, then this file is combined with the first year-long reference file. As before, parallel computing is used to speed up creation of the files, but this could also be accomplished with a for-loop.

In [None]:
## !!!!!!!!!!!
## This line only needs to be un-commented and run if you don't have a cluster already running
## from Section 2.2.1
## !!!!!!!!!!!

## Start up cluster:
#client = Client(n_workers=15, threads_per_worker=1)

In [36]:
%%time
# Create individual references for second year:
open_vds_par = delayed(open_virtual_dataset)
tasks = [open_vds_par(p, indexes={}, reader_options=reader_opts) for p in data_s3links[366:730]]
virtual_ds_list_year2 = list(da.compute(*tasks))

CPU times: user 4.31 s, sys: 792 ms, total: 5.11 s
Wall time: 23.9 s


In [66]:
%%time
# Combine the individual references:
vds_combined_year2 = xr.combine_nested(virtual_ds_list_year2, concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')
vds_combined_year2

CPU times: user 126 ms, sys: 0 ns, total: 126 ms
Wall time: 124 ms


In [69]:
vds_year1 = open_virtual_dataset('ref_combined_1year.json', filetype='kerchunk')
vds_2years = xr.combine_nested([vds_year1, vds_combined_year2], concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')

In [70]:
## Confirm time dimension has increased in new reference:
print(len(vds_year1['time']))
print(len(vds_2years['time']))

1460
2916


In [74]:
%%time

# Save to file then use to open new data set
vds_2years.virtualize.to_kerchunk('ref_twoyears.json', format='json')
data_2years = opendf_withref('ref_twoyears.json', fs)
print(data_2years)

<xarray.Dataset> Size: 48GB
Dimensions:    (latitude: 720, longitude: 1440, time: 2916)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 23kB 1993-01-02 ... 1995-01-08T18:00:00
Data variables:
    nobs       (time, latitude, longitude) float32 12GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 12GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 12GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 12GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (47)
CPU times: user 1.72 s, sys: 142 ms, total: 1.86 s
Wall time: 6.48 s


## 4. Using a reference file to analyze data with parallel computing

This section verifies that the reference file can be used to perform computations on the CCMP data, additionally verifying that parallel computing can be implemented with the computations. We try parallel computations using both a local and distributed cluster. 

The analysis will bin/average the first year of CCMP data by month, to generate a "mean seasonal cycle" (of course, one year of data isn't enough to produce a real a mean seasonal cycle). The analysis uses Xarray built in functions which naturally parallelize on Dask clusters.

In [24]:
def seasonal_cycle_regional(data_array, lat_region, lon_region):
    """
    Uses built in Xarray functions to generate a mean seasonal cycle at each grid point 
    over a specified region. Any temporal linear trends are first removed at each point 
    respecitvely, then data are binned and averaged by month. 
    """
    ## Subset to region:
    da_regional = data_array.sel(lat=slice(*lat_region), lon=slice(*lon_region))
    
    ## Remove any linear trends:
    p = da_regional.polyfit(dim='time', deg=1) # Degree 1 polynomial fit coefficients over time for each lat, lon.
    fit = xr.polyval(da_regional['time'], p.polyfit_coefficients) # Compute linear trend time series at each lat, lon.
    da_detrend = (da_regional - fit) # xarray is smart enough to subtract along the time dim only.
    
    ## Mean seasonal cycle:
    seasonal_cycle = da_detrend.groupby("time.month").mean("time")
    return seasonal_cycle

In [25]:
# Region to perform analysis over:
lat_region = (30, 45)
lon_region = (-135, -105)

### 4.1 Using a local cluster
This section was run using 16 workers on an `m6i.8xlarge` EC2 instance. Any warning messages generated by the cluster are left in the output here intentionally. Note that despite the warning messages, the parallel computations complete successfully. 

In [26]:
print("CPU count =", multiprocessing.cpu_count())

CPU count = 32


In [None]:
## Local Dask Cluster
client = Client(n_workers=16, threads_per_worker=1)
print(client.cluster)
client.dashboard_link

In [44]:
# Open data ...

In [29]:
# %%time
# Perform computations:
# seasonal_cycle = seasonal_cycle_regional(<data set>, lat_region, lon_region).compute()

CPU times: user 31.7 s, sys: 7.21 s, total: 38.9 s
Wall time: 55.3 s


In [45]:
## Test plot seasonal cycle at a few gridpoint locations

# # Points to plot seasonal cycle at:
# lat_points = (38, 38, 38, 38)
# lon_points = (-123.25, -125, -128, -132)

# fig = plt.figure()
# ax = plt.axes()

# for lat, lon in zip(lat_points, lon_points):
#     scycle_point = seasonal_cycle.sel(lat=lat, lon=lon)
#     ax.plot(scycle_point['month'], scycle_point.values, 'o-')

# ax.set_title("Seasonal cycle of ... anomalies \n at four test points", fontsize=14)
# ax.set_xlabel("month", fontsize=12)
# ax.set_ylabel(r"$\Delta$... ", fontsize=12)

### 4.2 Optional: Using a distributed cluster
We use the third party software/package Coiled to spin up our distributed cluster.

In [65]:
cluster = coiled.Cluster(
    n_workers=25, 
    region="us-west-2", 
    worker_vm_types="c7g.large", # or can try "m7a.medium"
    scheduler_vm_types="c7g.large", # or can try "m7a.medium"
    ) 
client = cluster.get_client()

Output()

Output()

***Note that computations on the distributed cluster work if we fully load the reference information into memory first, but not if we just pass the path to the reference file!***

In [46]:
%%time

##==================================================================
## Only works if the reference is loaded into memory first!!!!
##==================================================================
# Load into memory ...

CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 5.01 μs


In [47]:
# %%time
# seasonal_cycle = seasonal_cycle_regional(<data set>, lat_region, lon_region).compute()

In [48]:
## Test plot seasonal cycle at a few gridpoint locations

# # Points to plot seasonal cycle at:
# lat_points = (38, 38, 38, 38)
# lon_points = (-123.25, -125, -128, -132)

# fig = plt.figure()
# ax = plt.axes()

# for lat, lon in zip(lat_points, lon_points):
#     scycle_point = seasonal_cycle.sel(lat=lat, lon=lon)
#     ax.plot(scycle_point['month'], scycle_point.values, 'o-')

# ax.set_title("Seasonal cycle of ... anomalies \n at four test points", fontsize=14)
# ax.set_xlabel("month", fontsize=12)
# ax.set_ylabel(r"$\Delta$... ", fontsize=12)

In [49]:
##==================================================================
## Loading the data with the reference this way will lead to errors!!!!
##==================================================================
# Loading by passing the file path without loading into memory...

In [70]:
# %%time
# seasonal_cycle = seasonal_cycle_regional(<data set>, lat_region, lon_region).compute()

RuntimeError: Error during deserialization of the task graph. This frequently
occurs if the Scheduler and Client have different environments.
For more information, see
https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments


In [71]:
client.shutdown()