# Test 3: Appending reference files together.

#### Two tests:

1. Add a single reference file to an existing combined reference file. This would we useful if PODAAC was maintaining a combined reference for a forward streaming data set. Rather than recreating the entire ref file from scratch, can we append the new file to the existing combined reference?
2. Concatinating multiple combined reference files of similar size. This would be useful if PODAAC hosted combined reference files for e.g. each year of a record, and then the user could combine those references as needed.

All work performed in the cloud, using earthaccess to locate file-objects and access endpoints. Data set used is MUR 0.01 degree. 

#### Results

1. Single MUR reference file was successfully appended and resulting combined reference file works. Appending the file took about 2 seconds, confirming that updating a reference file (e.g. for a forward-stream collection) can be done. Performance (e.g. accessing, subsetting, taking the mean) of the combined-appeneded reference file was the same as for both a single reference file, and a combined reference file direct from the individuals with no appending.
2. Five combined MUR reference files were created for each of the first five years of the MUR record. Then, further combining those 5 reference files into a single, 5-year long reference file, was accomplished, taking 20 seconds. Kerchunk documentation confirmed that this "tree reduction" method of creating the end product (that is, creating several yearly references, then creating the multi-year ref) is more efficient.
3. Performance (e.g. accessing, subsetting, taking the mean) of the 5-year reference file is the same as for a single-year ref file. This confirms that e.g. PO.DAAC could produce a ref file one-per-year and the user could combine them with only a small amount of overhead.

## Install packages

To install kerchunk, used
```
!pip install git+https://github.com/fsspec/kerchunk

```

In [1]:
import os
import fsspec
import kerchunk
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
import ujson
import xarray as xr
import earthaccess
import coiled

In [2]:
earthaccess.login()
shortname = "MUR-JPL-L4-GLOB-v4.1"
granule_info = earthaccess.search_data(
    short_name=shortname,
    #temporal=("2019-01-01", "2019-12-31"),
    count=(365*5)
    )

Enter your Earthdata Login username:  deanh808
Enter your Earthdata password:  ········


Granules found: 8102


In [3]:
fobjs = earthaccess.open(granule_info)

Opening 1825 granules, approx size: 614.38 GB
using endpoint: https://archive.podaac.earthdata.nasa.gov/s3credentials


QUEUEING TASKS | :   0%|          | 0/1825 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/1825 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/1825 [00:00<?, ?it/s]

# 1. Add a single reference file to an existing combined reference file
First, a combined reference file for the first 180 days of MUR data is created. Then, a single reference file for the 181st day is created, and we attempt to append it to the combined ref file. The results are tested/validated.
## 1.1 Create a combined reference file for first 180 days of MUR 0.01 degree data

In [3]:
## Store reference JSONs in these directories:
dir_refs_indv = './reference_jsons_individual/'
dir_refs_comb = './reference_jsons_combined/'

In [7]:
!mkdir $dir_refs_indv
!mkdir $dir_refs_comb

In [8]:
def single_ref_earthaccess(fobj):
    """
    Inputs
    ------
    fobj: earthaccess.store.EarthAccessFile object
        Obtained from a call to earthaccess.open().
    """
    endpoint = fobj.full_name
    reference = SingleHdf5ToZarr(fobj, endpoint, inline_threshold=0).translate()
    return reference, endpoint # returns both the kerchunk reference and the path the file on podaac-ops-cumulus-protected

In [34]:
%%time

## --------------------------------------------
## Create single reference files with parallel computing using Coiled
## --------------------------------------------

# Wrap `create_single_ref` into coiled function:
single_ref_earthaccess_par = coiled.function(
    region="us-west-2", spot_policy="on-demand", 
    vm_type="m6i.large", n_workers=50
    )(single_ref_earthaccess)

# Begin computations:
fobjs_process = fobjs[:180]
results = single_ref_earthaccess_par.map(fobjs_process)

# Save results to JSONs as they become available:
for reference, endpoint in results:
    name_ref = dir_refs_indv + endpoint.split('/')[-1].replace('.nc', '.json')
    with open(name_ref, 'w') as outf:
        outf.write(ujson.dumps(reference))

single_ref_earthaccess_par.cluster.shutdown()

Output()

Output()

CPU times: user 3.11 s, sys: 198 ms, total: 3.31 s
Wall time: 1min 33s


In [5]:
## List of all single ref files created
ref_files_indv = [dir_refs_indv+f for f in os.listdir(dir_refs_indv) if f.endswith('.json')]
ref_files_indv.sort()
ref_files_indv[:5]

['./reference_jsons_individual/20190101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json',
 './reference_jsons_individual/20190102090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json',
 './reference_jsons_individual/20190103090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json',
 './reference_jsons_individual/20190104090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json',
 './reference_jsons_individual/20190105090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json']

In [6]:
## Get AWS creds
fs = earthaccess.get_s3fs_session(daac="PODAAC")

In [7]:
%%time

## --------------------------------------------
## Create combined reference file
## --------------------------------------------

## Combined reference file
mzz = MultiZarrToZarr(
    ref_files_indv,
    remote_protocol="s3",
    remote_options=fs.storage_options,
    concat_dims=["time"], 
    )
ref_combined = mzz.translate()

 # Save reference info to JSON:
name_refcombined = dir_refs_comb + shortname + "_combined.json"
with open(name_refcombined, 'wb') as outf:
    outf.write(ujson.dumps(ref_combined).encode())

CPU times: user 7.1 s, sys: 753 ms, total: 7.85 s
Wall time: 2min 5s


In [8]:
%%time

## --------------------------------------------
## Test combined reference file
## --------------------------------------------

name_refcombined = dir_refs_comb + shortname + "_combined.json"
data = xr.open_dataset(
    "reference://", engine="zarr", chunks={},
    backend_kwargs={
        "storage_options": {
            "fo": name_refcombined,
            "remote_protocol": "s3",
            "remote_options": fs.storage_options
            },
        "consolidated": False
        }
)
data

CPU times: user 1.35 s, sys: 132 ms, total: 1.48 s
Wall time: 1.8 s


Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray
"Array Chunk Bytes 1.72 TiB 31.96 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type timedelta64[ns] numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 881.06 GiB 15.98 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 31.96 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## 1.2 Create an additional single reference file and try to append it to the combined one

In [19]:
ref_add, endpoint_add = single_ref_earthaccess(fobjs[180])

name_ref_add = dir_refs_indv + endpoint_add.split('/')[-1].replace('.nc', '.json')
with open(name_ref_add, 'w') as outf:
    outf.write(ujson.dumps(ref_add))

In [21]:
%%time

## --------------------------------------------
## Try to concatinate single ref file to the combined one
## --------------------------------------------

mzz = MultiZarrToZarr(
    [name_refcombined, name_ref_add],
    remote_protocol="s3",
    remote_options=fs.storage_options,
    concat_dims=["time"], 
    )
ref_combined_testing = mzz.translate()

 # Save reference info to JSON:
name_testing = dir_refs_comb + shortname + "_TESTING_combined.json"
with open(name_testing, 'wb') as outf:
    outf.write(ujson.dumps(ref_combined_testing).encode())

CPU times: user 2.08 s, sys: 168 ms, total: 2.24 s
Wall time: 2.35 s


In [38]:
%%time

## --------------------------------------------
## Test combined reference file
## --------------------------------------------

name_refcombined = dir_refs_comb + shortname + "_TESTING_combined.json"
data = xr.open_dataset(
    "reference://", engine="zarr", chunks={},
    backend_kwargs={
        "storage_options": {
            "fo": name_refcombined,
            "remote_protocol": "s3",
            "remote_options": fs.storage_options
            },
        "consolidated": False
        }
)
data

CPU times: user 1.2 s, sys: 60.2 ms, total: 1.26 s
Wall time: 1.39 s


Unnamed: 0,Array,Chunk
Bytes,873.82 GiB,15.98 MiB
Shape,"(181, 17999, 36000)","(1, 1023, 2047)"
Dask graph,58644 chunks in 2 graph layers,58644 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 873.82 GiB 15.98 MiB Shape (181, 17999, 36000) (1, 1023, 2047) Dask graph 58644 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  181,

Unnamed: 0,Array,Chunk
Bytes,873.82 GiB,15.98 MiB
Shape,"(181, 17999, 36000)","(1, 1023, 2047)"
Dask graph,58644 chunks in 2 graph layers,58644 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,873.82 GiB,15.98 MiB
Shape,"(181, 17999, 36000)","(1, 1023, 2047)"
Dask graph,58644 chunks in 2 graph layers,58644 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 873.82 GiB 15.98 MiB Shape (181, 17999, 36000) (1, 1023, 2047) Dask graph 58644 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  181,

Unnamed: 0,Array,Chunk
Bytes,873.82 GiB,15.98 MiB
Shape,"(181, 17999, 36000)","(1, 1023, 2047)"
Dask graph,58644 chunks in 2 graph layers,58644 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,873.82 GiB,31.96 MiB
Shape,"(181, 17999, 36000)","(1, 1447, 2895)"
Dask graph,30589 chunks in 2 graph layers,30589 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray
"Array Chunk Bytes 873.82 GiB 31.96 MiB Shape (181, 17999, 36000) (1, 1447, 2895) Dask graph 30589 chunks in 2 graph layers Data type timedelta64[ns] numpy.ndarray",36000  17999  181,

Unnamed: 0,Array,Chunk
Bytes,873.82 GiB,31.96 MiB
Shape,"(181, 17999, 36000)","(1, 1447, 2895)"
Dask graph,30589 chunks in 2 graph layers,30589 chunks in 2 graph layers
Data type,timedelta64[ns] numpy.ndarray,timedelta64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,436.91 GiB,15.98 MiB
Shape,"(181, 17999, 36000)","(1, 1447, 2895)"
Dask graph,30589 chunks in 2 graph layers,30589 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 436.91 GiB 15.98 MiB Shape (181, 17999, 36000) (1, 1447, 2895) Dask graph 30589 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  181,

Unnamed: 0,Array,Chunk
Bytes,436.91 GiB,15.98 MiB
Shape,"(181, 17999, 36000)","(1, 1447, 2895)"
Dask graph,30589 chunks in 2 graph layers,30589 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,873.82 GiB,31.96 MiB
Shape,"(181, 17999, 36000)","(1, 1447, 2895)"
Dask graph,30589 chunks in 2 graph layers,30589 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 873.82 GiB 31.96 MiB Shape (181, 17999, 36000) (1, 1447, 2895) Dask graph 30589 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  181,

Unnamed: 0,Array,Chunk
Bytes,873.82 GiB,31.96 MiB
Shape,"(181, 17999, 36000)","(1, 1447, 2895)"
Dask graph,30589 chunks in 2 graph layers,30589 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## 1.3 Test/validate performance of appended ref file

#### First, test that the last file was appended correctly.
For the dataset opened with the appended reference file, take the last timestamp (corresponding to the appended file), and perform some computations on those data. Confirm that the same results are obtained if the netCDF file for that appended day is opened directly and computations performed.

In [8]:
data_appended = xr.open_dataset(
    "reference://", engine="zarr", chunks={},
    backend_kwargs={
        "storage_options": {
            "fo": dir_refs_comb + shortname + "_TESTING_combined.json",
            "remote_protocol": "s3",
            "remote_options": fs.storage_options
            },
        "consolidated": False
        }
)

In [13]:
%%time
## Subset to region and take mean:
data_appended['analysed_sst'].sel(time=data_appended["time"][-1], lat=slice(-10,10), lon=slice(-20,0)).mean().compute()

CPU times: user 192 ms, sys: 59.1 ms, total: 251 ms
Wall time: 446 ms


In [17]:
## Compare against the original single netCDF file:
data_single_compare = xr.open_dataset(fobjs[180])

In [18]:
%%time
data_single_compare['analysed_sst'].sel(lat=slice(-10,10), lon=slice(-20,0)).mean().compute()

CPU times: user 226 ms, sys: 53.6 ms, total: 279 ms
Wall time: 510 ms


#### Next, test that performance of the reference file with appending performs as well as the reference file before appending (e.g. the one created on the first 180 days) 

In [None]:
## Open dataset with ref file for original, first 180 days:
data_original = xr.open_dataset(
    "reference://", engine="zarr", chunks={},
    backend_kwargs={
        "storage_options": {
            "fo": dir_refs_comb + shortname + "_combined.json",
            "remote_protocol": "s3",
            "remote_options": fs.storage_options
            },
        "consolidated": False
        }
)

**Subset to temporal and spatial region, then compute mean**

In [27]:
%%time
data_appended['analysed_sst'].sel(time=data_appended["time"][:20], lat=slice(-20,20), lon=slice(-20,10)).mean().compute()

CPU times: user 12.1 s, sys: 523 ms, total: 12.6 s
Wall time: 5.78 s


In [26]:
%%time
data_original['analysed_sst'].sel(time=data_original["time"][:20], lat=slice(-20,20), lon=slice(-20,10)).mean().compute()

CPU times: user 12.1 s, sys: 628 ms, total: 12.7 s
Wall time: 6.04 s


# 2. Concatinating yearly combined reference files into one large combined reference
Combined reference files for each of the first 5 years of MUR data are created (one combined ref file per year). Then the speed at which these yearly ref files can be combined, and the performance of the resulting output, is tested. 

#### Create all individual ref files for first five years

In [10]:
%%time

## --------------------------------------------
## Create single reference files with parallel computing using Coiled
## --------------------------------------------

# Wrap `create_single_ref` into coiled function:
single_ref_earthaccess_par = coiled.function(
    region="us-west-2", spot_policy="on-demand", 
    vm_type="t4g.large", n_workers=100
    )(single_ref_earthaccess)

# Begin computations:
fobjs_process = fobjs[:365*5]
results = single_ref_earthaccess_par.map(fobjs_process)

# Save results to JSONs as they become available:
for reference, endpoint in results:
    name_ref = dir_refs_indv + endpoint.split('/')[-1].replace('.nc', '.json')
    with open(name_ref, 'w') as outf:
        outf.write(ujson.dumps(reference))

single_ref_earthaccess_par.cluster.shutdown()

Output()

Output()

CPU times: user 12.9 s, sys: 671 ms, total: 13.6 s
Wall time: 4min 24s


In [11]:
ref_files_indv = [dir_refs_indv+f for f in os.listdir(dir_refs_indv) if f.endswith('.json')]
ref_files_indv.sort()
ref_files_indv[:5]

['./reference_jsons_individual/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json', './reference_jsons_individual/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json', './reference_jsons_individual/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json', './reference_jsons_individual/20020604090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json', './reference_jsons_individual/20020605090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json']


#### Create combined ref files for each year

In [None]:
## Get AWS creds
fs = earthaccess.get_s3fs_session(daac="PODAAC")

In [17]:
%%time

for i in range(1,6):
    mzz = MultiZarrToZarr(
    ref_files_indv[365*(i-1):365*(i)],
    remote_protocol="s3",
    remote_options=fs.storage_options,
    )
    ref_combined = mzz.translate()
    
     # Save reference info to JSON:
    fname = dir_refs_comb + shortname + "_year" + str(i).zfill(2) + "_combined.json"
    with open(fname, 'wb') as outf:
        outf.write(ujson.dumps(ref_combined).encode())

#### Create a single combined ref file for all five years
Create the ref file from the yearly ref files, rather than from all the individual (daily) ref files.

In [6]:
ref_files_comb = [dir_refs_comb+f for f in os.listdir(dir_refs_comb) if f.endswith('.json')]
ref_files_comb.sort()

In [23]:
%%time

mzz = MultiZarrToZarr(
    ref_files_comb,
    remote_protocol="s3",
    remote_options=fs.storage_options,
    concat_dims=["time"], 
    )
ref_combined = mzz.translate()

 # Save reference info to JSON:
fname = dir_refs_comb + shortname + "_allyears" "_combined.json"
with open(fname, 'wb') as outf:
    outf.write(ujson.dumps(ref_combined).encode())

CPU times: user 17.6 s, sys: 1.52 s, total: 19.1 s
Wall time: 19.1 s


#### Test results
Compare computation time for the ref file from one year to the ref file for all 5 years. Compuation is the mean of a temporal/spatial subset.

In [10]:
%%time

data_allyears = xr.open_dataset(
    "reference://", engine="zarr", chunks={},
    backend_kwargs={
        "storage_options": {
            "fo": dir_refs_comb + shortname + "_allyears" "_combined.json",
            "remote_protocol": "s3",
            "remote_options": fs.storage_options
            },
        "consolidated": False
        }
)
data_allyears

CPU times: user 14 ms, sys: 7.31 ms, total: 21.3 ms
Wall time: 196 ms


Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.60 TiB 15.98 MiB Shape (1825, 17999, 36000) (1, 1023, 2047) Dask graph 591300 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.60 TiB 15.98 MiB Shape (1825, 17999, 36000) (1, 1023, 2047) Dask graph 591300 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.30 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.30 TiB 15.98 MiB Shape (1825, 17999, 36000) (1, 1447, 2895) Dask graph 308425 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,4.30 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,31.96 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.60 TiB 31.96 MiB Shape (1825, 17999, 36000) (1, 1447, 2895) Dask graph 308425 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,31.96 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [9]:
%%time

data_oneyear = xr.open_dataset(
    "reference://", engine="zarr", chunks={},
    backend_kwargs={
        "storage_options": {
            "fo": ref_files_comb[3],
            "remote_protocol": "s3",
            "remote_options": fs.storage_options
            },
        "consolidated": False
        }
)
data_oneyear

CPU times: user 17 ms, sys: 3.51 ms, total: 20.5 ms
Wall time: 88.7 ms


Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 881.06 GiB 15.98 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 31.96 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [12]:
%%time
data_oneyear['analysed_sst'].sel(time=data_oneyear["time"][:100], lat=slice(-20,20), lon=slice(-20,10)).mean().compute()

CPU times: user 58.4 s, sys: 2.37 s, total: 1min
Wall time: 52.5 s


In [13]:
%%time
data_allyears['analysed_sst'].sel(time=data_oneyear["time"][:100], lat=slice(-20,20), lon=slice(-20,10)).mean().compute()

CPU times: user 57.3 s, sys: 2.86 s, total: 1min
Wall time: 48 s
