# Test 6: Saving reference file in parquet format

## Summary:

For larger datasets, the combined reference file in JSON format can become large, e.g. for MUR 0.01 degree, the entire data set is estimated to have a 1-2 GB reference file. This notebook compares the reference file size in JSON format vs parquet, and verifies that the parquet file works as expected.

The reference files are created using kerchunk and earthaccess is used to access NASA Earthdata. First, daily reference files are created for the first 5 years of the record in JSON format. Then, yearly combined reference files are created for both JSON and parquet formats and performances are compared. Lastly, a 5-year combined reference file is created from the yearly reference files, each for JSON and parquet. Again, performances are compared.

## Results

* Successfully able to create parquet reference files that worked with xarray.
* JSON and parquet computation performance are the same, and the parqet files are about 30x smaller on disk.
* Creating the 5-year combined ref file from the yearly ref files did take about 3x as long for parquet then JSON, 15 seconds vs 50 seconds.
* Parallel computing with parquet ...

## Questions

* Since parquet is a directory structure rather than single file, how would a data store work with this? Especially if we wanted to stream the contents of the reference file from out buckets to a local machine. 

## Install packages

To install kerchunk, used
```
!pip install git+https://github.com/fsspec/kerchunk

```
Also needed the fastparquet package to save in parquet format
```
!pip install fastparquet
```

In [19]:
import os
import fsspec
import kerchunk
from kerchunk.df import refs_to_dataframe
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
import ujson
import xarray as xr
import earthaccess
import coiled

In [3]:
earthaccess.login()
shortname = "MUR-JPL-L4-GLOB-v4.1"
granule_info = earthaccess.search_data(
    short_name=shortname,
    #temporal=("2019-01-01", "2019-12-31"),
    count=(365*5)
    )

Enter your Earthdata Login username:  deanh808
Enter your Earthdata password:  ········


Granules found: 8102


In [4]:
fobjs = earthaccess.open(granule_info)

Opening 1825 granules, approx size: 614.38 GB
using endpoint: https://archive.podaac.earthdata.nasa.gov/s3credentials


QUEUEING TASKS | :   0%|          | 0/1825 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/1825 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/1825 [00:00<?, ?it/s]

## 1. Create all individual ref files for first five years

In [20]:
## Store reference JSONs in these directories:
dir_refs_indv = './reference_jsons_individual/'
dir_refs_comb = './reference_jsons_combined/'

In [6]:
!mkdir $dir_refs_indv
!mkdir $dir_refs_comb

In [7]:
def single_ref_earthaccess(fobj):
    """
    Inputs
    ------
    fobj: earthaccess.store.EarthAccessFile object
        Obtained from a call to earthaccess.open().
    """
    endpoint = fobj.full_name
    reference = SingleHdf5ToZarr(fobj, endpoint, inline_threshold=0).translate()
    return reference, endpoint # returns both the kerchunk reference and the path the file on podaac-ops-cumulus-protected

In [8]:
%%time

## --------------------------------------------
## Create single reference files with parallel computing using Coiled
## --------------------------------------------

# Wrap `create_single_ref` into coiled function:
single_ref_earthaccess_par = coiled.function(
    region="us-west-2", spot_policy="on-demand", 
    vm_type="t4g.large", n_workers=100
    )(single_ref_earthaccess)

# Begin computations:
fobjs_process = fobjs[:365*5]
results = single_ref_earthaccess_par.map(fobjs_process)

# Save results to JSONs as they become available:
for reference, endpoint in results:
    name_ref = dir_refs_indv + endpoint.split('/')[-1].replace('.nc', '.json')
    with open(name_ref, 'w') as outf:
        outf.write(ujson.dumps(reference))

single_ref_earthaccess_par.cluster.shutdown()

Output()

Output()

CPU times: user 10.8 s, sys: 584 ms, total: 11.4 s
Wall time: 5min 4s


In [9]:
ref_files_indv = [dir_refs_indv+f for f in os.listdir(dir_refs_indv) if f.endswith('.json')]
ref_files_indv.sort()
ref_files_indv[:5]

['./reference_jsons_individual/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json',
 './reference_jsons_individual/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json',
 './reference_jsons_individual/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json',
 './reference_jsons_individual/20020604090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json',
 './reference_jsons_individual/20020605090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.json']

## 2. Create yearly combined ref files in JSON and parquet then compare

### 2.1 Create ref files

In [21]:
fs = earthaccess.get_s3fs_session(daac="PODAAC")

In [14]:
%%time

for i in range(1,6):
    mzz = MultiZarrToZarr(
    ref_files_indv[365*(i-1):365*(i)],
    remote_protocol="s3",
    remote_options=fs.storage_options,
    concat_dims=["time"], 
    )
    ref_combined = mzz.translate()
    
    # Save reference info to JSON:
    fname_json = dir_refs_comb + shortname + "_year" + str(i).zfill(2) + "_combined.json"
    with open(fname_json, 'wb') as outf:
        outf.write(ujson.dumps(ref_combined).encode())

    # Save reference info to parquet:
    fname_parq = dir_refs_comb + shortname + "_year" + str(i).zfill(2) + "_combined.parq"
    refs_to_dataframe(ref_combined, fname_parq)

CPU times: user 37.6 s, sys: 2.13 s, total: 39.7 s
Wall time: 7min 50s


In [22]:
refs_json_1year_combined = [
    dir_refs_comb+f for f in os.listdir(dir_refs_comb)
    if f.endswith(".json")
    ]
refs_json_1year_combined

['./reference_jsons_combined/MUR-JPL-L4-GLOB-v4.1_year05_combined.json',
 './reference_jsons_combined/MUR-JPL-L4-GLOB-v4.1_year01_combined.json',
 './reference_jsons_combined/MUR-JPL-L4-GLOB-v4.1_year04_combined.json',
 './reference_jsons_combined/MUR-JPL-L4-GLOB-v4.1_year03_combined.json',
 './reference_jsons_combined/MUR-JPL-L4-GLOB-v4.1_year02_combined.json']

In [23]:
refs_parq_1year_combined = [
    dir_refs_comb+f for f in os.listdir(dir_refs_comb)
    if f.endswith(".parq")
    ]
refs_parq_1year_combined

['./reference_jsons_combined/MUR-JPL-L4-GLOB-v4.1_year02_combined.parq',
 './reference_jsons_combined/MUR-JPL-L4-GLOB-v4.1_year05_combined.parq',
 './reference_jsons_combined/MUR-JPL-L4-GLOB-v4.1_year01_combined.parq',
 './reference_jsons_combined/MUR-JPL-L4-GLOB-v4.1_year03_combined.parq',
 './reference_jsons_combined/MUR-JPL-L4-GLOB-v4.1_year04_combined.parq']

In [18]:
## Compare size of JSON vs parquet
    # JSON
print(os.path.getsize(refs_json_1year_combined[0])/10**6) # in MB
    # parquet
size_parq = 0 
for path, dirs, files in os.walk(refs_parq_1year_combined[0]):
    for f in files:
        fp = os.path.join(path, f)
        size_parq += os.path.getsize(fp)
print(size_parq/10**6) # in MB

59.152846
1.845972


### 2.2 Test and compare ref files

In [31]:
%%time

## JSON:
data_from_json = xr.open_dataset(
    "reference://", engine="zarr", chunks={},
    backend_kwargs={
        "storage_options": {
            "fo": refs_json_1year_combined[0],
            "remote_protocol": "s3",
            "remote_options": fs.storage_options
            },
        "consolidated": False
        }
)
data_from_json

CPU times: user 1.13 s, sys: 96.5 ms, total: 1.23 s
Wall time: 1.43 s


Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 881.06 GiB 15.98 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 31.96 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [33]:
%%time
data_from_json['analysed_sst'].sel(time=data_from_json["time"][:100], lat=slice(-20,20), lon=slice(-20,10)).mean().compute()

CPU times: user 51.8 s, sys: 16.8 s, total: 1min 8s
Wall time: 1min 5s


In [36]:
%%time

## parquet:
data_from_parq = xr.open_dataset(
    "reference://", engine="zarr", chunks={},
    backend_kwargs={
        "storage_options": {
            "fo": refs_parq_1year_combined[0],
            "remote_protocol": "s3",
            "remote_options": fs.storage_options
            },
        "consolidated": False
        }
)
data_from_parq

CPU times: user 19.9 ms, sys: 6.18 ms, total: 26.1 ms
Wall time: 292 ms


Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 15.98 MiB Shape (365, 17999, 36000) (1, 1023, 2047) Dask graph 118260 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1023, 2047)"
Dask graph,118260 chunks in 2 graph layers,118260 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 881.06 GiB 15.98 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,881.06 GiB,15.98 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.72 TiB 31.96 MiB Shape (365, 17999, 36000) (1, 1447, 2895) Dask graph 61685 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  365,

Unnamed: 0,Array,Chunk
Bytes,1.72 TiB,31.96 MiB
Shape,"(365, 17999, 36000)","(1, 1447, 2895)"
Dask graph,61685 chunks in 2 graph layers,61685 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [37]:
%%time
data_from_parq['analysed_sst'].sel(time=data_from_parq["time"][:100], lat=slice(-20,20), lon=slice(-20,10)).mean().compute()

CPU times: user 50.9 s, sys: 15.1 s, total: 1min 5s
Wall time: 1min 8s


## 3. Create reference files for the entire 5 years in JSON and parquet, then compare

### 3.1 Create ref files

In [24]:
%%time

## JSON
mzz = MultiZarrToZarr(
    refs_json_1year_combined,
    remote_protocol="s3",
    remote_options=fs.storage_options,
    concat_dims=["time"], 
    )
ref_combined = mzz.translate()

 # Save reference info to JSON:
fname = dir_refs_comb + shortname + "_allyears_combined.json"
with open(fname, 'wb') as outf:
    outf.write(ujson.dumps(ref_combined).encode())

CPU times: user 14.5 s, sys: 1.36 s, total: 15.9 s
Wall time: 15.8 s


In [25]:
%%time

## parquet
mzz = MultiZarrToZarr(
    refs_parq_1year_combined,
    remote_protocol="s3",
    remote_options=fs.storage_options,
    concat_dims=["time"], 
    )
ref_combined = mzz.translate()

# Save reference info to parquet:
fname = dir_refs_comb + shortname + "_allyears_combined.parq"
refs_to_dataframe(ref_combined, fname)

CPU times: user 53.6 s, sys: 2.33 s, total: 56 s
Wall time: 54.7 s


In [26]:
## Compare size of JSON vs parquet
    # JSON
print(os.path.getsize(dir_refs_comb + shortname + "_allyears_combined.json")/10**6) # in MB
    # parquet
size_parq = 0 
for path, dirs, files in os.walk(dir_refs_comb + shortname + "_allyears_combined.parq"):
    for f in files:
        fp = os.path.join(path, f)
        size_parq += os.path.getsize(fp)
print(size_parq/10**6) # in MB

296.994684
9.22168


### 3.2 Test and compare reference files

In [29]:
%%time

data_allyears_json = xr.open_dataset(
    "reference://", engine="zarr", chunks={},
    backend_kwargs={
        "storage_options": {
            "fo": dir_refs_comb + shortname + "_allyears_combined.json",
            "remote_protocol": "s3",
            "remote_options": fs.storage_options
            },
        "consolidated": False
        }
)
data_allyears_json

CPU times: user 16 ms, sys: 4.03 ms, total: 20 ms
Wall time: 203 ms


Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.60 TiB 15.98 MiB Shape (1825, 17999, 36000) (1, 1023, 2047) Dask graph 591300 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.60 TiB 15.98 MiB Shape (1825, 17999, 36000) (1, 1023, 2047) Dask graph 591300 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.30 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.30 TiB 15.98 MiB Shape (1825, 17999, 36000) (1, 1447, 2895) Dask graph 308425 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,4.30 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,31.96 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.60 TiB 31.96 MiB Shape (1825, 17999, 36000) (1, 1447, 2895) Dask graph 308425 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,31.96 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [31]:
%%time
data_allyears_json['analysed_sst'].sel(time=data_allyears_json["time"][:100], lat=slice(-20,20), lon=slice(-20,10)).mean().compute()

CPU times: user 52.1 s, sys: 2.47 s, total: 54.6 s
Wall time: 1min 5s


In [30]:
%%time

data_allyears_parq = xr.open_dataset(
    "reference://", engine="zarr", chunks={},
    backend_kwargs={
        "storage_options": {
            "fo": dir_refs_comb + shortname + "_allyears_combined.parq",
            "remote_protocol": "s3",
            "remote_options": fs.storage_options
            },
        "consolidated": False
        }
)
data_allyears_parq

CPU times: user 25.4 ms, sys: 3.87 ms, total: 29.3 ms
Wall time: 346 ms


Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.60 TiB 15.98 MiB Shape (1825, 17999, 36000) (1, 1023, 2047) Dask graph 591300 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.60 TiB 15.98 MiB Shape (1825, 17999, 36000) (1, 1023, 2047) Dask graph 591300 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1023, 2047)"
Dask graph,591300 chunks in 2 graph layers,591300 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.30 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.30 TiB 15.98 MiB Shape (1825, 17999, 36000) (1, 1447, 2895) Dask graph 308425 chunks in 2 graph layers Data type float32 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,4.30 TiB,15.98 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,31.96 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 8.60 TiB 31.96 MiB Shape (1825, 17999, 36000) (1, 1447, 2895) Dask graph 308425 chunks in 2 graph layers Data type float64 numpy.ndarray",36000  17999  1825,

Unnamed: 0,Array,Chunk
Bytes,8.60 TiB,31.96 MiB
Shape,"(1825, 17999, 36000)","(1, 1447, 2895)"
Dask graph,308425 chunks in 2 graph layers,308425 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [32]:
%%time
data_allyears_parq['analysed_sst'].sel(time=data_allyears_parq["time"][:100], lat=slice(-20,20), lon=slice(-20,10)).mean().compute()

CPU times: user 52.3 s, sys: 2.16 s, total: 54.5 s
Wall time: 57.8 s


## Testing parquet reference file with parallel computing

Test on both local and distributed clusters