# Kerchunk Useful Recipes with NASA Earthdata

#### *Author: Dean Henze, PO.DAAC*

## Summary

This notebook goes through several functionalities of kerchunk, specifically using it with NASA Earthdata and utilizing the `earthaccess` package. It is meant to be a quick-start reference that introduces some key capabilities / characteristics of the package. This notebook does not go into depth on what kerchunk is, so the unfamiliar user is encouraged to check out some of the references below. In short, kerchunk is a Python package which allows you to generate "reference files", which one can think of as road maps for the computer to more rapidly and efficiently navigate through the arrays in a large data set. Once a reference file for a data set is created and stored in an accessible location, it allows us e.g. to lazy load data faster, access subsets of the data quicker (either spatially, temporally, or any other dimension in the data set), and in some cases perform computations faster.

The functionalities of kerchunk covered in this notebook are:
1. Generating a reference file in JSON format for the first year of the MUR 0.01 degree resolution sea surface temperature data set.
2. Generating a reference file in PARQUET format for the first year of the MUR 0.01 degree resolution sea surface temperature data set.
3. Adding an extra day of the MUR record to our existing reference file.
4. Using the reference file to perform a basic analysis on the MUR data set with a parallel computing cluster.

## Requirements, prerequisite knowledge, learning outcomes

#### Requirements to run this notebook

* Earthdata login account: An Earthdata Login account is required to access data from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account.

* Compute environment: This notebook is meant to be run in the cloud (AWS instance running in us-west-2), recommended on a VM with at minimum

* VM type: ...


#### Prerequisite knowledge


#### Learning Outcomes

## Import Packages
```
pip install git+https://github.com/fsspec/kerchunk
pip install fastparquet xarray earthaccess coild fsspec
```

In [53]:
import os
import fsspec
import kerchunk
from kerchunk.df import refs_to_dataframe
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
import ujson
import json
import xarray as xr
import earthaccess
import coiled

## 1. Generating a reference file in JSON format for the first year of MUR 0.01 degree SST

### 1.1 Locate Data File S3 endpoints in Earthdata Cloud 
The first step is to find the S3 endpoints to the files and generate file-like objects to use with kerchunk. Handling access credentials to Earthdata and then finding the endpoints can be done a number of ways (e.g. using the `requests`, `s3fs` packages) but we choose to use the `earthaccess` package for its convenience and brevity. We will get two years of MUR files, from beginning 2019 to end 2020. 

In [54]:
# Get Earthdata creds
earthaccess.login()

<earthaccess.auth.Auth at 0x7eff0944cda0>

In [55]:
# Get AWS creds
fs = earthaccess.get_s3fs_session(daac="PODAAC")

In [56]:
granule_info = earthaccess.search_data(
    short_name="MUR-JPL-L4-GLOB-v4.1",
    temporal=("2019-01-01", "2020-12-31"),
    )

In [43]:
# Generate's the file-like objects from the files located in the previous code block:
fobjs = earthaccess.open(granule_info)

QUEUEING TASKS | :   0%|          | 0/732 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/732 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/732 [00:00<?, ?it/s]

In [6]:
# Endpoints found in this attribute:
example_endpoint = fobjs[0].full_name
example_endpoint

's3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'

### 1.2 Generate kerchunk reference files for each individual file in year 2019
It is necessary to create reference files for each data file individually before they can be combined into a single reference file for the entire record. The time for generating many of these individual files can add up, so this section also covers the option to utilize parallel computing.

First, we define a small wrapper function around kerchunk and earthaccess packages.

In [7]:
def single_ref_earthaccess(fobj):
    """
    Creates and returns a reference for a single file. "fobj" is an earthaccess.store.EarthAccessFile 
    object obtained from a call to earthaccess.open(), which also has the file endpoint.
    """
    endpoint = fobj.full_name
    reference = SingleHdf5ToZarr(fobj, endpoint, inline_threshold=0).translate()
    return reference, endpoint # returns both the kerchunk reference and the path the file on podaac-ops-cumulus-protected

Which can immediately be used to create a reference file and open one of the MUR files:

In [10]:
%%time
# Create reference file:
reference, endpoint = single_ref_earthaccess(fobjs[0])

CPU times: user 347 ms, sys: 86.3 ms, total: 433 ms
Wall time: 2.96 s


In [57]:
%%time
# Open data using the reference file, using a small wrapper function around xarray's open_dataset 
# for a kerchunk file. This will shorten code blocks in other sections. 
def opendf_kerchunk(ref, fs):
    """
    "ref" is a kerchunk reference file or object. "fs" is a filesystem with access to the 
    actual data files. 
    """
    storage_opts = {"fo": ref, "remote_protocol": "s3", "remote_options": fs.storage_options}
    data = xr.open_dataset(
        "reference://", engine="zarr", chunks={},
        backend_kwargs={
            "storage_options": storage_opts,
             "consolidated": False
            }
        )
    return data

data = opendf_kerchunk(reference, fs)

CPU times: user 97.1 ms, sys: 7.11 ms, total: 104 ms
Wall time: 504 ms


In [17]:
%%time
# Very basic computation:
data['analysed_sst'].mean().compute().item()

CPU times: user 12.6 s, sys: 3.24 s, total: 15.8 s
Wall time: 8.66 s


287.08852469456235

**For us, reference file creation took ~4 seconds, so processing a year would take *4 x 365 ~ 24 minutes***. One could easily write a simple for-loop to accomplish this. We speed things up using basic parallel computing. 

In [13]:
## Save reference JSONs in this directory:
dir_refs_indv_2019 = './reference_jsons_individual_2019/'
!mkdir $dir_refs_indv_2019

In [14]:
%%time

## --------------------------------------------
## Create single reference files with parallel computing using Coiled
## --------------------------------------------

# Wrap `create_single_ref` into coiled function:
single_ref_earthaccess_par = coiled.function(
    region="us-west-2", spot_policy="on-demand", 
    vm_type="m6i.large", n_workers=10
    )(single_ref_earthaccess)

# Begin computations:
results = single_ref_earthaccess_par.map(fobjs[:365])

# Save results to JSONs as they become available:
for reference, endpoint in results:
    name_ref = dir_refs_indv_2019 + endpoint.split('/')[-1].replace('.nc', '.json')
    with open(name_ref, 'w') as outf:
        outf.write(ujson.dumps(reference))

Output()

Output()

CPU times: user 4.31 s, sys: 236 ms, total: 4.55 s
Wall time: 4min 30s


In [15]:
single_ref_earthaccess_par.cluster.shutdown()

### 1.3 Create combined reference file and use it to open the data
The computation time for this step can also be decreased with parallel computing, but in this case serial computing is used.

In [48]:
%%time

## --------------------------------------------
## Create combined reference file
## --------------------------------------------

ref_files_indv = [dir_refs_indv_2019+f for f in os.listdir(dir_refs_indv_2019) if f.endswith('.json')]
ref_files_indv.sort()

## Combined reference file
kwargs_mzz = {'remote_protocol':"s3", 'remote_options':fs.storage_options, 'concat_dims':["time"]}
mzz = MultiZarrToZarr(ref_files_indv, **kwargs_mzz)
ref_combined = mzz.translate()

 # Save reference info to JSON:
with open("ref_combined_2019.json", 'wb') as outf:
    outf.write(ujson.dumps(ref_combined).encode())

CPU times: user 5.68 s, sys: 457 ms, total: 6.14 s
Wall time: 1min 23s


In [49]:
%%time
# Open the portion of the MUR record corresponding to the reference file created:
data = opendf_kerchunk(json.load(open("ref_combined_2019.json")), fs)

CPU times: user 6 s, sys: 649 ms, total: 6.65 s
Wall time: 6.97 s


In [50]:
data

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray
"Array Chunk Bytes 1.72 TiB 31.96 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type timedelta64[ns] numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 881.06 GiB 15.98 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 31.96 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


The data will open quickly now that we have the reference file. Compare that to an attempt at opening these same files with `Xarray` the "traditional" way with a call to `xr.open_mfdataset()`. On a smaller machine, the following line of code will either fail or take a long (possibly very long) amount of time:

In [None]:
## You can try running this but your notebook will probably crash:
# data = xr.open_mfdataset(fobjs[:365])

## 2. Generate the same MUR reference file but in PARQUET format
Instead of re-creating all individual reference files, this section will load the combined 2019 reference file, then re-save in parquet format and use it to open the MUR data. It also demsonstrates the smaller disk space required for this format.

In [58]:
ref_combined_2019 = json.load(open("ref_combined_2019.json"))

In [61]:
%%time
# Save reference info to parquet:
refs_to_dataframe(ref_combined_2019, "ref_combined_2019.parq")

CPU times: user 2.61 s, sys: 43.2 ms, total: 2.65 s
Wall time: 2.65 s


In [63]:
%%time
data = opendf_kerchunk("ref_combined_2019.parq", fs)
data

CPU times: user 15.2 ms, sys: 3.9 ms, total: 19.1 ms
Wall time: 78.2 ms


Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray
"Array Chunk Bytes 1.72 TiB 31.96 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type timedelta64[ns] numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 881.06 GiB 15.98 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 31.96 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [67]:
## Compare size of JSON vs parquet, printed in MB
    # JSON
print("JSON:", os.path.getsize("ref_combined_2019.json")/10**6, "MB")
    # parquet
size_parq = 0 
for path, dirs, files in os.walk("ref_combined_2019.parq"):
    for f in files:
        fp = os.path.join(path, f)
        size_parq += os.path.getsize(fp)
print("PARQUET:", size_parq/10**6, "MB")

JSON: 77.881403 MB
PARQUET: 2.502714 MB


## 3. Combining reference files
This section demonstrates that reference files can be combined in two examples:

1. A single reference file (for the first day of 2020) is appended to the combined reference file for 2019 generated in the previous section.
2. A second year-long combined reference file is created for all of 2020 and combined with the 2019 reference file.

In both cases, a key result is that creating the final product (e.g. combining two reference files) is much shorter than it would have been to create it from scratch.

### 3.1 Adding an extra day of the MUR record to our existing reference file.

In [24]:
%%time
# Create reference file for first day in 2020:
ref_add, endpoint_add = single_ref_earthaccess(fobjs[365])

name_ref_add = endpoint_add.split('/')[-1].replace('.nc', '.json')
with open(name_ref_add, 'w') as outf:
    outf.write(ujson.dumps(ref_add))

CPU times: user 428 ms, sys: 62 ms, total: 490 ms
Wall time: 4.54 s


In [44]:
%%time

# Add it to the combined reference file:
kwargs_mzz = {'remote_protocol':"s3", 'remote_options':fs.storage_options, 'concat_dims':["time"]}
mzz = MultiZarrToZarr(["ref_combined_2019.json", name_ref_add], **kwargs_mzz)
ref_combined_add1day = mzz.translate()

 # Save reference info to JSON:
with open("ref_combined_add1day.json", 'wb') as outf:
    outf.write(ujson.dumps(ref_combined_add1day).encode())

CPU times: user 4.56 s, sys: 374 ms, total: 4.94 s
Wall time: 5.23 s


**Appending an additional file does not take much time!**

In [45]:
%%

UsageError: Cell magic `%%` not found.


In [46]:
%%time
# Open data using new reference file:
data = opendf_kerchunk(json.load(open("ref_combined_add1day.json")), fs)

CPU times: user 6.67 s, sys: 615 ms, total: 7.29 s
Wall time: 7.47 s


In [47]:
print(len(data["time"]))
data

366


Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,15.98 MiB
Shape,"(366, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118584 chunks in 2 graph layers,118584 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.73 TiB 15.98 MiB Shape (366, 17999, 36000) (1, 1023, 2047) Dask graph 118584 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  366,

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,15.98 MiB
Shape,"(366, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118584 chunks in 2 graph layers,118584 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,15.98 MiB
Shape,"(366, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118584 chunks in 2 graph layers,118584 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.73 TiB 15.98 MiB Shape (366, 17999, 36000) (1, 1023, 2047) Dask graph 118584 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  366,

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,15.98 MiB
Shape,"(366, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118584 chunks in 2 graph layers,118584 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,31.96 MiB
Shape,"(366, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61854 chunks in 2 graph layers,61854 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray
"Array Chunk Bytes 1.73 TiB 31.96 MiB Shape (366, 17999, 36000) (1, 1447, 2895) Dask graph 61854 chunks in 2 graph layers Data type timedelta64[ns] numpy.ndarray",36000  17999  366,

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,31.96 MiB
Shape,"(366, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61854 chunks in 2 graph layers,61854 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,883.47 GiB,15.98 MiB
Shape,"(366, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61854 chunks in 2 graph layers,61854 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 883.47 GiB 15.98 MiB Shape (366, 17999, 36000) (1, 1447, 2895) Dask graph 61854 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  366,

Unnamed: 0,Array,Chunk
Bytes,883.47 GiB,15.98 MiB
Shape,"(366, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61854 chunks in 2 graph layers,61854 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,31.96 MiB
Shape,"(366, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61854 chunks in 2 graph layers,61854 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.73 TiB 31.96 MiB Shape (366, 17999, 36000) (1, 1447, 2895) Dask graph 61854 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  366,

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,31.96 MiB
Shape,"(366, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61854 chunks in 2 graph layers,61854 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,15.98 MiB
Shape,"(366, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118584 chunks in 2 graph layers,118584 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.73 TiB 15.98 MiB Shape (366, 17999, 36000) (1, 1023, 2047) Dask graph 118584 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  366,

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,15.98 MiB
Shape,"(366, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118584 chunks in 2 graph layers,118584 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


### 3.2 Combining two year-long combined reference files
Individual files for 2020 are created and combined into a single reference file, then this file is combined with the 2019 reference file. As before, parallel computing is used to speed up creation of the files, but this could also be accomplished with a for-loop. 

In [28]:
## Save reference JSONs in this directory:
dir_refs_indv_2020 = './reference_jsons_individual_2020/'
!mkdir $dir_refs_indv_2020

In [29]:
%%time

## --------------------------------------------
## Create single reference files for 2020 with parallel computing using Coiled
## --------------------------------------------

# Wrap `create_single_ref` into coiled function:
single_ref_earthaccess_par = coiled.function(
    region="us-west-2", spot_policy="on-demand", 
    vm_type="m6i.large", n_workers=10
    )(single_ref_earthaccess)

# Begin computations:
results = single_ref_earthaccess_par.map(fobjs[365:])

# Save results to JSONs as they become available:
for reference, endpoint in results:
    name_ref = dir_refs_indv_2020 + endpoint.split('/')[-1].replace('.nc', '.json')
    with open(name_ref, 'w') as outf:
        outf.write(ujson.dumps(reference))

Output()

Output()

CPU times: user 3.23 s, sys: 264 ms, total: 3.49 s
Wall time: 4min 34s


In [30]:
single_ref_earthaccess_par.cluster.shutdown()

In [31]:
%%time

## --------------------------------------------
## Create combined reference file for 2020
## --------------------------------------------

ref_files_indv = [dir_refs_indv_2020+f for f in os.listdir(dir_refs_indv_2020) if f.endswith('.json')]
ref_files_indv.sort()

## Combined reference file
kwargs_mzz = {'remote_protocol':"s3", 'remote_options':fs.storage_options, 'concat_dims':["time"]}
mzz = MultiZarrToZarr(ref_files_indv, **kwargs_mzz)
ref_combined = mzz.translate()

 # Save reference info to JSON:
with open("ref_combined_2020.json", 'wb') as outf:
    outf.write(ujson.dumps(ref_combined).encode())

CPU times: user 7.27 s, sys: 763 ms, total: 8.04 s
Wall time: 1min 45s


In [51]:
%%time

## --------------------------------------------
## Create combined reference file for 2019 and 2020
## --------------------------------------------

kwargs_mzz = {'remote_protocol':"s3", 'remote_options':fs.storage_options, 'concat_dims':["time"]}
mzz = MultiZarrToZarr(["ref_combined_2019.json", "ref_combined_2020.json"], **kwargs_mzz)
ref_combined_2years = mzz.translate()

 # Save reference info to JSON:
with open("ref_combined_2019-2020.json", 'wb') as outf:
    outf.write(ujson.dumps(ref_combined_2years).encode())

CPU times: user 8.21 s, sys: 748 ms, total: 8.96 s
Wall time: 9.18 s


***Note the large difference in computation time to create the 2020 combined reference file from the individual reference files, vs. combining the two year-long reference files for 2019 and 2020. The latter is much shorter!***

In [34]:
%%time
# Open data using new reference file:
data = opendf_kerchunk(json.load(open("ref_combined_2019-2020.json")), fs)

CPU times: user 13.6 s, sys: 1.68 s, total: 15.3 s
Wall time: 15.6 s


In [36]:
print(len(data["time"]))
data

732


Unnamed: 0,Array,Chunk
Bytes,3.45 TiB,15.98 MiB
Shape,"(732, 17999, 36000)","(1, 1023, 2047)"
Dask graph,237168 chunks in 2 graph layers,237168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 3.45 TiB 15.98 MiB Shape (732, 17999, 36000) (1, 1023, 2047) Dask graph 237168 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  732,

Unnamed: 0,Array,Chunk
Bytes,3.45 TiB,15.98 MiB
Shape,"(732, 17999, 36000)","(1, 1023, 2047)"
Dask graph,237168 chunks in 2 graph layers,237168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.45 TiB,15.98 MiB
Shape,"(732, 17999, 36000)","(1, 1023, 2047)"
Dask graph,237168 chunks in 2 graph layers,237168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 3.45 TiB 15.98 MiB Shape (732, 17999, 36000) (1, 1023, 2047) Dask graph 237168 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  732,

Unnamed: 0,Array,Chunk
Bytes,3.45 TiB,15.98 MiB
Shape,"(732, 17999, 36000)","(1, 1023, 2047)"
Dask graph,237168 chunks in 2 graph layers,237168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.45 TiB,31.96 MiB
Shape,"(732, 17999, 36000)","(1, 1447, 2895)"
Dask graph,123708 chunks in 2 graph layers,123708 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray
"Array Chunk Bytes 3.45 TiB 31.96 MiB Shape (732, 17999, 36000) (1, 1447, 2895) Dask graph 123708 chunks in 2 graph layers Data type timedelta64[ns] numpy.ndarray",36000  17999  732,

Unnamed: 0,Array,Chunk
Bytes,3.45 TiB,31.96 MiB
Shape,"(732, 17999, 36000)","(1, 1447, 2895)"
Dask graph,123708 chunks in 2 graph layers,123708 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,15.98 MiB
Shape,"(732, 17999, 36000)","(1, 1447, 2895)"
Dask graph,123708 chunks in 2 graph layers,123708 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.73 TiB 15.98 MiB Shape (732, 17999, 36000) (1, 1447, 2895) Dask graph 123708 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  732,

Unnamed: 0,Array,Chunk
Bytes,1.73 TiB,15.98 MiB
Shape,"(732, 17999, 36000)","(1, 1447, 2895)"
Dask graph,123708 chunks in 2 graph layers,123708 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.45 TiB,31.96 MiB
Shape,"(732, 17999, 36000)","(1, 1447, 2895)"
Dask graph,123708 chunks in 2 graph layers,123708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 3.45 TiB 31.96 MiB Shape (732, 17999, 36000) (1, 1447, 2895) Dask graph 123708 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  732,

Unnamed: 0,Array,Chunk
Bytes,3.45 TiB,31.96 MiB
Shape,"(732, 17999, 36000)","(1, 1447, 2895)"
Dask graph,123708 chunks in 2 graph layers,123708 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.45 TiB,15.98 MiB
Shape,"(732, 17999, 36000)","(1, 1023, 2047)"
Dask graph,237168 chunks in 2 graph layers,237168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 3.45 TiB 15.98 MiB Shape (732, 17999, 36000) (1, 1023, 2047) Dask graph 237168 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  732,

Unnamed: 0,Array,Chunk
Bytes,3.45 TiB,15.98 MiB
Shape,"(732, 17999, 36000)","(1, 1023, 2047)"
Dask graph,237168 chunks in 2 graph layers,237168 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
