<img src="https://raw.githubusercontent.com/EO-College/cubes-and-clouds/main/icons/cnc_3icons_process_circle.svg"
     alt="Cubes & Clouds logo"
     style="float: center; margin-right: 10px;" />

<img src="https://raw.githubusercontent.com/pangeo-data/pangeo.io/refs/heads/main/public/Pangeo-assets/pangeo_logo.png"
     alt="Pangeo logo"
     style="float: center; margin-right: 10px;" />

# 2.4 Data chunking with Pangeo

## Data chunking

<div class="alert alert-info">
<i class="fa-question-circle fa" style="font-size: 22px;color:#666;"></i> <b>Overview</b>
    <br>
    <br>
    <b>Questions</b>
    <ul>
        <li>Why do chunking matter?</li>
        <li>How can I read datasets by chunks to optimize memory usage?</li>
    </ul>
    <b>Objectives</b>
    <ul>
        <li>Learn about chunking</li>
        <li>Learn about zarr </li>
    </ul>
</div>

## Context

As explained in 2.4_formats_and_performance, when dealing with large data files or collections, it's often impossible to load all the data you want to analyze into the RAM of a single computer at once. This is a situation where the Pangeo ecosystem is well suited. We've learned about lazy loading in 2.3_data_access.  Xarray provides the ability to lazily work on data __chunks__, i.e. pieces of a whole dataset. By reading a dataset in __chunks__, we can process our data piece by piece on a single computer and even on a distributed computing cluster using Dask (e.g. cloud or HPC).

How we will process these 'chunks' in a parallel environment to vertically scale your computation is discussed in [2.4 dask](./dask.ipynb). In this notebook, you will grasp the essence of __chunk__ through various exercises in this notebook. 

When we process our data piece by piece, it's easier to have our input or output data also stored in __chunks__.  As introduced in 2.4 formats_and_performance.md, [Zarr](https://zarr.readthedocs.io/en/stable/) is a cloud-native data format, and is the reference library in the Pangeo ecosystem to store our `Xarray` multi-dimensional datasets in __chunks__.

## Data
Let's start again with the same sample data retrieval method from the Sentinel-2 STAC collection as described in Exercise 2.3 Data Access Lazy Loading with Pangeo.  

The analysis is very similar to what we did in previous episodes, but we will be using data from a larger area rather than just a small geographic area to show the scaling.


## Load Libraries

In [1]:
import pystac_client
import geopandas as gpd
from shapely.geometry import mapping
import stackstac
import warnings
import xarray as xr
import numpy as np
import rioxarray as rio
warnings.filterwarnings("ignore")

In [2]:
%%time
aoi = gpd.read_file('../assets/catchment_outline.geojson', crs="EPGS:4326")
aoi_geojson = mapping(aoi.iloc[0].geometry)
URL = "https://earth-search.aws.element84.com/v1"
catalog = pystac_client.Client.open(URL)
items = catalog.search(
    intersects=aoi_geojson,
    collections=["sentinel-2-l2a"],
    datetime="2019-02-01/2019-04-28"
).item_collection()
sentinel2_l2a = stackstac.stack(items)

CPU times: user 649 ms, sys: 36.3 ms, total: 685 ms
Wall time: 10.8 s


In [3]:
sentinel2_l2a

Unnamed: 0,Array,Chunk
Bytes,5.42 TiB,8.00 MiB
Shape,"(101, 32, 20982, 10980)","(1, 1, 1024, 1024)"
Dask graph,746592 chunks in 3 graph layers,746592 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 5.42 TiB 8.00 MiB Shape (101, 32, 20982, 10980) (1, 1, 1024, 1024) Dask graph 746592 chunks in 3 graph layers Data type float64 numpy.ndarray",101  1  10980  20982  32,

Unnamed: 0,Array,Chunk
Bytes,5.42 TiB,8.00 MiB
Shape,"(101, 32, 20982, 10980)","(1, 1, 1024, 1024)"
Dask graph,746592 chunks in 3 graph layers,746592 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## What is a __chunk__

If you look carefully to `sentinel2_l2a`,  xarray.DataArray is a `dask.array` with a chunk size of `(1, 1, 1024, 1024)`. The full data would load arrays of dimensions `(101, 32, 20982, 10980)`, 746 592 of the 'chunk', in total 5.42 TiB into the computer's RAM.  

We can see the `dask.array` information by clicking the icon as circled blue in the image below.

![Dask.array](../assets/datasize.png)

By clicking the red circled triangle icon, we can have detailed information on the `xarray.DataArray`, such as Coordinates, Indexes and Attributes.

When you create Xarray object using `stackstac`, we can easily turns STAC collection into a lazy `xarray.DataArray`, in chunk form, so then it is backed by dask.

The size and shape of chunk which we will use defines the parallelisation done by Dask, thus picking a good chunksize will have significant effects on performance.

This is where understanding and using chunking correctly comes into play.

In our case, for the moment, we used stackstac without specifying 'chunk' explicitly. The dataset is composed of 8MiB each, each contains, 1 time step, 1 band, 1024 x 1024 on x and y direction. 

![chunk_original](../assets/chunk_original.png)

If we have a too small chunk size, we will divide our work flow in too small pieces, which can create too many communications, too many 'distribution' overheads.
If we have a too big chunk size, we may not be able to hold the enough memory and our workflow may die.

The right size of chunk depends on your computation and the machine you use.

Here, 8MiB, is very small compare to usual RAM size available. For example, dask's default array size is 128MiB.

In [4]:
import dask
dask.config.get('array.chunk-size')

'128MiB'

## Modifying chunks

Lets try to modify our chunk size.

To modify chunks on your existing `xarray.DataArray` we can use the `chunk` function.
We know that we only need 3 bands for computing the snow index example, so we select only `green`,`swir16` and `scl` to simplify our example.

We would like to have each time series separated in each chunk, then keep all band informnation on one chunk, and let dask to compute x and y coordinate's chunk size.

In [5]:
sentinel2_l2a=sentinel2_l2a.sel(
    band=['green','swir16','scl']).chunk(
    chunks={'time': 1, 'band':3, 'x':'auto','y':'auto'})
sentinel2_l2a

Unnamed: 0,Array,Chunk
Bytes,520.09 GiB,96.00 MiB
Shape,"(101, 3, 20982, 10980)","(1, 3, 2048, 2048)"
Dask graph,6666 chunks in 5 graph layers,6666 chunks in 5 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 520.09 GiB 96.00 MiB Shape (101, 3, 20982, 10980) (1, 3, 2048, 2048) Dask graph 6666 chunks in 5 graph layers Data type float64 numpy.ndarray",101  1  10980  20982  3,

Unnamed: 0,Array,Chunk
Bytes,520.09 GiB,96.00 MiB
Shape,"(101, 3, 20982, 10980)","(1, 3, 2048, 2048)"
Dask graph,6666 chunks in 5 graph layers,6666 chunks in 5 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.47 kiB,96 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 9.47 kiB 96 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,9.47 kiB,96 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type float64 numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,13.41 kiB,136 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 13.41 kiB 136 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,13.41 kiB,136 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,404 B,4 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 404 B 4 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,404 B,4 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,3.95 kiB,40 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 3.95 kiB 40 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,3.95 kiB,40 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,29.20 kiB,296 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 29.20 kiB 296 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,29.20 kiB,296 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,9.47 kiB,96 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 9.47 kiB 96 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,9.47 kiB,96 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type float64 numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,101 B,1 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,bool numpy.ndarray,bool numpy.ndarray
"Array Chunk Bytes 101 B 1 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type bool numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,101 B,1 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,bool numpy.ndarray,bool numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type float64 numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type float64 numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,10.65 kiB,108 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 10.65 kiB 108 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,10.65 kiB,108 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,1.97 kiB,20 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 1.97 kiB 20 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,1.97 kiB,20 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,25.64 kiB,260 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 25.64 kiB 260 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,25.64 kiB,260 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,31.17 kiB,316 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 31.17 kiB 316 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,31.17 kiB,316 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,25.25 kiB,256 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 25.25 kiB 256 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,25.25 kiB,256 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.34 kiB,44 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 4.34 kiB 44 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,4.34 kiB,44 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24.46 kiB,248 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 24.46 kiB 248 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,24.46 kiB,248 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type object numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 808 B 8 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type float64 numpy.ndarray",101  1,

Unnamed: 0,Array,Chunk
Bytes,808 B,8 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,9.47 kiB,96 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 9.47 kiB 96 B Shape (101,) (1,) Dask graph 101 chunks in 1 graph layer Data type",101  1,

Unnamed: 0,Array,Chunk
Bytes,9.47 kiB,96 B
Shape,"(101,)","(1,)"
Dask graph,101 chunks in 1 graph layer,101 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,372 B,372 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 372 B 372 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,372 B,372 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray


If you look into details of any variable in the representation above, you'll see that each x and y coordinate's chunk is bigger, and we have much less chunks (6666 chunks) than the example before. A chunk size of 96MiB is already more manageable than 8MiB small chunk.

Note here from the chunk size, the auto option computed that 2048 for y and x as optimal chunk size if we want to keep the chunk size of time and band as 1 and 3 respectively.  


<div class="alert alert-warning">
    <i class="fa-check-circle fa" style="font-size: 22px;color:#666;"></i> <b>Go Further</b>
    <br>
    <br>
    You can try to apply different ways for specifying chunk.
    <ul>
        <li> chunks = -1 -> the entire array will be used as a single chunk
        <li> chunks = {'x':-1, 'y': 1000} -> chunks of entire _x_ dimension, but splitted every 1000 values on _y_ dimension</li>
        <li> chunks = {'x':-1, 'y': 'auto'} -> Xarray relies on Dask to use an ideal size according to the preferred chunk sizes for _y_ dimension</li>
        <li> chunks = { 'x':-1 ,'y':"500MiB" } -> Xarray seeks the size according to a specific memory target expressed in MiB</li>
        <li> chunks = ( 1, 3, 12048,2048) -> Specifying chunk size in the order of dimension. </li>
    </ul>
</div>

## Defining the chunk at the creatioin of Xarray

We can define the chunk size when we create the object.  
This is usually done with Xarray using the `chunks` kwarg when opening a file with `xr.open_dataset` or with `xr.open_mfdataset`, if you create Xarray from your local file.  
In our snow index example, we create Xarray from stackstac. As stackstac's default 'chunksize' definition is 1024 for x and y dimension, we had that chunksize.  We can pass the chunksize option to stdeackstac and make that bigger.


In [6]:
%%time
sentinel2_l2a = stackstac.stack(items
                                ,assets=['green','swir16','scl']
                               ,chunksize=( 1, 3, 2048,2048)
)
sentinel2_l2a

CPU times: user 27.1 ms, sys: 3.87 ms, total: 31 ms
Wall time: 30.1 ms


Unnamed: 0,Array,Chunk
Bytes,520.09 GiB,96.00 MiB
Shape,"(101, 3, 20982, 10980)","(1, 3, 2048, 2048)"
Dask graph,6666 chunks in 3 graph layers,6666 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 520.09 GiB 96.00 MiB Shape (101, 3, 20982, 10980) (1, 3, 2048, 2048) Dask graph 6666 chunks in 3 graph layers Data type float64 numpy.ndarray",101  1  10980  20982  3,

Unnamed: 0,Array,Chunk
Bytes,520.09 GiB,96.00 MiB
Shape,"(101, 3, 20982, 10980)","(1, 3, 2048, 2048)"
Dask graph,6666 chunks in 3 graph layers,6666 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## So, why chunks?

As explained in 2.4_formats_and_performance, chunks are mandatory for accessing files or dataset that are bigger than a single computer's memory. If all the data has to be accessed, it can be done sequentially e.g. chunks are processed one after the othe).

Moreover, chunks allow for distributed processing and so increased speed for your data analysis, as seen in the next episode.

### Chunks and files

Xarray chunking possibilities also relies on the underlying input or output file format used. Most modern file format allows to store a dataset or a single file using chunks. NetCDF4 uses chunks when storing a file on the disk through the use of HDF5. Any read of data in a NetCDF4 file will lead to the load of at least one chunk of this file. So when reading one of its chunk as defined in `open_dataset` call, Xarray will take advantage of native file chunking and won't have to read the entire file too.


Yet, it is really important to note that __Xarray chunks and file chunks are not necessarily the same__. It is however a really good idea to configure Xarray chunks so that they align well on input file format chunks (so ideally, Xarray chunks should contain one or several input file chunks).

## Zarr storage format

This brings to our next subjects [Zarr](https://zarr.readthedocs.io/en/stable/).

If we can have our original dataset already 'chunked' and accessed in an optimized way according to it's actual byte storage on disk, we won't need to load entire dataset every time, and our data anlayzis, even working on the entire dataset, will be greatly optimized.

Let's convert our intermediate data into Zarr format so that we can learn what it is. We can keep the data as in DataArray or convert that into DataSet before storing them.

We start again from loading data using stackstac, but this time we go to next step, clipping the data and computation of snow index, and lets try to save those intermediate result in a zarr file.  


## Load data using stackstac (with specific chunk) 

In [7]:
%%time
aoi = gpd.read_file("../assets/catchment_outline.geojson", crs="EPGS:4326")
aoi_geojson = mapping(aoi.iloc[0].geometry)
URL = "https://earth-search.aws.element84.com/v1"
catalog = pystac_client.Client.open(URL)
items = catalog.search(
    intersects=aoi_geojson,
    collections=["sentinel-2-l2a"],
    datetime="2019-02-01/2019-04-28"
).item_collection()
ds = stackstac.stack(items
                                ,assets=['green','swir16','scl']
                               ,chunksize=( 1, 3, 1024,1024)
)

CPU times: user 231 ms, sys: 15.2 ms, total: 246 ms
Wall time: 8.36 s


## Coomputing Snow index


In [8]:
green = ds.sel(band='green')
swir = ds.sel(band='swir16')
scl = ds.sel(band='scl')
ndsi = (green - swir) / (green + swir)
snow = xr.where((ndsi > 0.42) & ~np.isnan(ndsi), 1, ndsi)
snowmap = xr.where((snow <= 0.42) & ~np.isnan(snow), 0, snow)
mask = np.logical_not(scl.isin([8, 9, 3])) 
snow_cloud = xr.where(mask, snowmap, 2)

## Clip the data

In [9]:
aoi_utm32 = aoi.to_crs(epsg=32632)
geom_utm32 = aoi_utm32.iloc[0]['geometry']
snow_cloud.rio.write_crs("EPSG:32632", inplace=True)
snow_cloud.rio.set_nodata(np.nan, inplace=True)
snow_cloud = snow_cloud.rio.clip([geom_utm32])

## Lets save the intermediate result of a few days in a zarr format

In [10]:
snow_cloud_small=snow_cloud.isel(time=slice(0,3))
snow_cloud_small

Unnamed: 0,Array,Chunk
Bytes,201.11 MiB,8.00 MiB
Shape,"(3, 3341, 2630)","(1, 1024, 1024)"
Dask graph,36 chunks in 30 graph layers,36 chunks in 30 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 201.11 MiB 8.00 MiB Shape (3, 3341, 2630) (1, 1024, 1024) Dask graph 36 chunks in 30 graph layers Data type float64 numpy.ndarray",2630  3341  3,

Unnamed: 0,Array,Chunk
Bytes,201.11 MiB,8.00 MiB
Shape,"(3, 3341, 2630)","(1, 1024, 1024)"
Dask graph,36 chunks in 30 graph layers,36 chunks in 30 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


### Before saving, we can test other chunk shape.


In [11]:
snow_cloud_small=snow_cloud_small.chunk(chunks = {'x':'auto', 'y': 'auto'}).to_dataset(name='data')
snow_cloud_small

Unnamed: 0,Array,Chunk
Bytes,288 B,288 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 288 B 288 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,288 B,288 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,408 B,408 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 408 B 408 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,408 B,408 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,12 B,12 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 12 B 12 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,12 B,12 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,120 B,120 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 120 B 120 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,120 B,120 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,888 B,888 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 888 B 888 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,888 B,888 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,288 B,288 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 288 B 288 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,288 B,288 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3 B,3 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,bool numpy.ndarray,bool numpy.ndarray
"Array Chunk Bytes 3 B 3 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type bool numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,3 B,3 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,bool numpy.ndarray,bool numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,324 B,324 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 324 B 324 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,324 B,324 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,60 B,60 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 60 B 60 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,60 B,60 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,780 B,780 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 780 B 780 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,780 B,780 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,0.93 kiB,0.93 kiB
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 0.93 kiB 0.93 kiB Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,0.93 kiB,0.93 kiB
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,768 B,768 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 768 B 768 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,768 B,768 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,132 B,132 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 132 B 132 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,132 B,132 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,744 B,744 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 744 B 744 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,744 B,744 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type object numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 24 B 24 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type float64 numpy.ndarray",3  1,

Unnamed: 0,Array,Chunk
Bytes,24 B,24 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,288 B,288 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,
"Array Chunk Bytes 288 B 288 B Shape (3,) (3,) Dask graph 1 chunks in 1 graph layer Data type",3  1,

Unnamed: 0,Array,Chunk
Bytes,288 B,288 B
Shape,"(3,)","(3,)"
Dask graph,1 chunks in 1 graph layer,1 chunks in 1 graph layer
Data type,,

Unnamed: 0,Array,Chunk
Bytes,201.11 MiB,67.04 MiB
Shape,"(3, 3341, 2630)","(1, 3341, 2630)"
Dask graph,3 chunks in 31 graph layers,3 chunks in 31 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 201.11 MiB 67.04 MiB Shape (3, 3341, 2630) (1, 3341, 2630) Dask graph 3 chunks in 31 graph layers Data type float64 numpy.ndarray",2630  3341  3,

Unnamed: 0,Array,Chunk
Bytes,201.11 MiB,67.04 MiB
Shape,"(3, 3341, 2630)","(1, 3341, 2630)"
Dask graph,3 chunks in 31 graph layers,3 chunks in 31 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


### Then clean attribute, and save to zarr

In [12]:
%%time

def remove_attrs(obj, to_remove):
    new = obj.copy()
    new.attrs = {k: v for k, v in obj.attrs.items() if k not in to_remove}
    return new

def encode(obj):
    object_coords = [name for name, coord in obj.coords.items() if coord.dtype.kind == "O"]
    return obj.drop_vars(object_coords).pipe(remove_attrs, ["spec", "transform"])


snow_cloud_small.pipe(encode).to_zarr('test.zarr',mode='w')

CPU times: user 2.6 s, sys: 351 ms, total: 2.95 s
Wall time: 10.3 s


<xarray.backends.zarr.ZarrStore at 0x7f9be8094040>

<div class="alert alert-warning">
    <i class="fa-check-circle fa" style="font-size: 22px;color:#666;"></i> <b>Exercise</b>
    <br>
    <ul>
        <li>What about saving the data in Netcdf format? `ls -la test.zarr` and  `ls -la test.zarr/nobs `</li>
        <li>You can try to explore the zarr file you just created using `ls -la test.zarr` and  `ls -la test.zarr/nobs `</li>
        <li>You can explore zarr metadata file by `cat test.zarr/.zmetadata` </li>
        <li>Did you find the __chunks__ we defined previously in your zarr file? </li>
    </ul>
</div>

## Lets compare how the zarr and netcdf file are stored

In [13]:
xr.open_zarr('test.zarr').to_netcdf('test.nc')

In [14]:
!du -sh test.zarr/ test.nc

1.7M	test.zarr/
202M	test.nc


In [15]:
!ls  -la test.zarr/

total 188
drwxrwxr-x 39 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 .
drwxr-xr-x  5 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 ..
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users     2 Jan 31 06:19 .zattrs
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users    24 Jan 31 06:19 .zgroup
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users 23291 Jan 31 06:19 .zmetadata
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 band
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 constellation
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 created
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 data
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 earthsearch:boa_offset_applied
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 earthsearch:payload_id
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0

In [16]:
!ls  -la test.zarr/data

total 1132
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users   6144 Jan 31 06:19 .
drwxrwxr-x 39 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users   6144 Jan 31 06:19 ..
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users    369 Jan 31 06:19 .zarray
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users    653 Jan 31 06:19 .zattrs
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users 366508 Jan 31 06:19 0.0.0
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users 462125 Jan 31 06:19 1.0.0
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users 309608 Jan 31 06:19 2.0.0


In [17]:
!ls  -la test.zarr/

total 188
drwxrwxr-x 39 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 .
drwxr-xr-x  5 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 ..
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users     2 Jan 31 06:19 .zattrs
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users    24 Jan 31 06:19 .zgroup
-rw-rw-r--  1 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users 23291 Jan 31 06:19 .zmetadata
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 band
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 constellation
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 created
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 data
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 earthsearch:boa_offset_applied
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0a-6a67d88aa718 users  6144 Jan 31 06:19 earthsearch:payload_id
drwxrwxr-x  2 6ecd4b8f-1e28-4f75-8d0

In [18]:
!cat test.zarr/.zmetadata | head -n 30

{
    "metadata": {
        ".zattrs": {},
        ".zgroup": {
            "zarr_format": 2
        },
        "band/.zarray": {
            "chunks": [],
            "compressor": null,
            "dtype": "<U6",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [],
            "zarr_format": 2
        },
        "band/.zattrs": {
            "_ARRAY_DIMENSIONS": []
        },
        "constellation/.zarray": {
            "chunks": [],
            "compressor": null,
            "dtype": "<U10",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [],
            "zarr_format": 2
        },
        "constellation/.zattrs": {


Zarr format main characteristics are the following:

- Every chunk of a Zarr dataset is stored as a single file (see x.y files in `ls -al test.zarr/data`)
- Each Data array in a Zarr dataset has a two unique files containing metadata:
  - .zattrs for dataset or dataarray general metadatas
  - .zarray indicating how the dataarray is chunked, and where to find them on disk or other storage.


## Conclusion

Understanding chunking is key to optimizing your data analysis when dealing with large datasets. In this exercise, we learned how to optimize data access time and memory resources by using native file chunks loaded by stackstac and instructing Xarray to modify the chunk. Computing on large datasets can be very slow on a single machine, and to optimize your time we may need to parallelize your computations. This is what you will learn in the next exercise with Dask.