<img src="https://raw.githubusercontent.com/EO-College/cubes-and-clouds/main/icons/cnc_3icons_process_circle.svg"
     alt="Cubes & Clouds logo"
     style="float: center; margin-right: 10px;" />

<img src="https://raw.githubusercontent.com/pangeo-data/pangeo.io/refs/heads/main/public/Pangeo-assets/pangeo_logo.png"
     alt="Pangeo logo"
     style="float: center; margin-right: 10px;" />

# 2.4 Data chunking with Pangeo

## Data chunking

<div class="alert alert-info">
<i class="fa-question-circle fa" style="font-size: 22px;color:#666;"></i> <b>Overview</b>
    <br>
    <br>
    <b>Questions</b>
    <ul>
        <li>What is chunking and why does it matter?</li>
        <li>How can we utilize chunking to make our processing more efficient?</li>
    </ul>
    <b>Objectives</b>
    <ul>
        <li>Explore chunking of data</li>
        <li>Learn about the Zarr file format</li>
    </ul>
</div>

## Context

As explained in Section 2.4 - Formats and Performance, when working with large data files or collections, it is often impractical to load the entire dataset into the memory of a single computer at once. This is where the Pangeo ecosystem is particularly useful. In Section 2.3 - Data Access, we discussed the concept of lazy loading. Xarray enables lazy processing of data in __chunks__, meaning the dataset is divided into manageable pieces. By reading and processing the data in these chunks, we can efficiently handle large datasets on a single computer or scale the processing to a distributed computing cluster using Dask (e.g., on the cloud or high-performance computing environments).

How we process these chunks in a parallel environment to scale our computation vertically is discussed in [2.4 dask](./dask.ipynb). In this notebook, you will explore the concept of chunks through various exercises.

Processing data piece by piece is more efficient when both our input and output data are also stored in chunks. As introduced in Section 2.4 - Formats and Performance, [Zarr](https://zarr.readthedocs.io/en/stable/) is a cloud-native data format and serves as the reference library in the Pangeo ecosystem for storing Xarray multi-dimensional datasets in __chunks__.

## Data
We'll begin with the same sample data retrieval method from the Sentinel-2 STAC collection, as described in Exercise 2.3 - Data Access: Lazy Loading with Pangeo.

The analysis will be similar to our previous exercises, but this time, we’ll use a larger spatial extent to demonstrate the scalability.


We start by copying the data files needed to complete the exercise using the following shell commands

In [None]:
!cp -r ${DATA_PATH%/*/*}/notebooks/cubes-and-clouds/lectures/2.4_formats_and_performance/exercises/assets $HOME/

## Load Libraries

In [1]:
import pystac_client
import geopandas as gpd
from shapely.geometry import mapping
import stackstac
import warnings
import xarray as xr
import numpy as np
import rioxarray as rio
warnings.filterwarnings("ignore")

## Load Sentinel-2 data

In [2]:
aoi = gpd.read_file('./assets/catchment_outline.geojson', crs="EPGS:4326")
aoi_geojson = mapping(aoi.iloc[0].geometry)
URL = "https://earth-search.aws.element84.com/v1"
catalog = pystac_client.Client.open(URL)
items = catalog.search(
    intersects=aoi_geojson,
    collections=["sentinel-2-l2a"],
    datetime="2019-02-01/2019-04-28",
    query= {"proj:epsg": dict(eq=32632)}
).item_collection()
sentinel2_l2a = stackstac.stack(items,assets=["red","nir"])
sentinel2_l2a

Unnamed: 0,Array,Chunk
Bytes,350.16 GiB,8.00 MiB
Shape,"(102, 2, 20982, 10980)","(1, 1, 1024, 1024)"
Dask graph,47124 chunks in 3 graph layers,47124 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 350.16 GiB 8.00 MiB Shape (102, 2, 20982, 10980) (1, 1, 1024, 1024) Dask graph 47124 chunks in 3 graph layers Data type float64 numpy.ndarray",102  1  10980  20982  2,

Unnamed: 0,Array,Chunk
Bytes,350.16 GiB,8.00 MiB
Shape,"(102, 2, 20982, 10980)","(1, 1, 1024, 1024)"
Dask graph,47124 chunks in 3 graph layers,47124 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## What is a __Chunk__?

If you closely examine the `sentinel2_l2a` dataset, you'll notice that the `xarray.DataArray` is backed by a `dask.array` with a chunk size of `(1, 1, 1024, 1024)`. The full dataset consists of arrays with dimensions `(102, 2, 20982, 10980)`, totaling 47,124 chunks, which amounts to 350.16 GiB of data potentially loaded into the computer's RAM.

You can view the `dask.array` information by clicking the blue-circled icon in the image below.

![Dask.array](https://raw.githubusercontent.com/EO-College/cubes-and-clouds/refs/heads/main/lectures/2.4_formats_and_performance/exercises/assets/datasize.png)

By clicking the red-circled triangle icon, you'll see detailed information about the `xarray.DataArray`, including its Coordinates, Indexes, and Attributes.

When creating an `Xarray` object using `stackstac`, we can easily convert a STAC collection into a lazily-loaded, chunked `xarray.DataArray` backed by Dask.

The size and shape of the chunks determine the level of parallelization performed by Dask. Therefore, selecting an appropriate chunk size can significantly impact the performance of your computation.

This is where understanding and effectively using chunking becomes crucial.


In our case, for the moment, we used `stackstac` without specifying the 'chunk' explicitly. The dataset is composed of 8 MiB chunks, each containing 1 time step, 1 band, and a resolution of 1024 x 1024 in the x and y directions.

![chunk_original](https://raw.githubusercontent.com/EO-College/cubes-and-clouds/refs/heads/main/lectures/2.4_formats_and_performance/exercises/assets/chunk_original.png)

If the chunk size is too small, our workflow will be divided into too many tiny pieces, which can lead to excessive communication and increased distribution overhead.

On the other hand, if the chunk size is too large, there may not be enough memory available to handle the workload, causing the workflow to fail.

The optimal chunk size depends on both your computation and the machine you're using.

For example, 8 MiB is relatively small compared to the typical RAM size available. Dask’s default array chunk size, for instance, is 128 MiB.


In [None]:
import dask
dask.config.get('array.chunk-size')

## Modifying chunks

Let's try to modify our chunk size.

To modify chunks on your existing `xarray.DataArray` we can use the `chunk` method.
We know that we only need 2 bands to compute the Normalized Difference Vegetation Index (NDVI) example, so we select only `red` and `nir` to simplify our example.

We would like to have each time series separated in each chunk, then keep all band information on one chunk, and let dask to compute x and y coordinate's chunk size.

In [None]:
def NDVI(data):
    red = data.sel(band="red")
    nir = data.sel(band="nir")
    ndvi = (nir - red)/(nir + red)
    return ndvi

ndvi_xr = NDVI(sentinel2_l2a)
ndvi_xr

In [None]:
sentinel2_l2a = sentinel2_l2a.sel(
    band=['red','nir']).chunk(
    chunks={'time': 1, 'band':2, 'x':'auto','y':'auto'})
sentinel2_l2a

If you look into the details of any variable in the representation above, you'll see that each x and y coordinate's chunk is bigger, and we have less chunks than the example before.

Note here from the chunk size, the auto option computed the optimal chunk size for y and x if we want to keep the chunk size of time and band as 1 and 2 respectively.  


<div class="alert alert-warning">
    <i class="fa-check-circle fa" style="font-size: 22px;color:#666;"></i> <b>Go Further</b>
    <br>
    <br>
    You can try to apply different ways for specifying chunk.
    <ul>
        <li> chunks = -1 -> the entire array will be used as a single chunk
        <li> chunks = {'x':-1, 'y': 1000} -> chunks of entire _x_ dimension, but splitted every 1000 values on _y_ dimension</li>
        <li> chunks = {'x':-1, 'y': 'auto'} -> Xarray relies on Dask to use an ideal size according to the preferred chunk sizes for _y_ dimension</li>
        <li> chunks = { 'x':-1 ,'y':"500MiB" } -> Xarray seeks the size according to a specific memory target expressed in MiB</li>
        <li> chunks = ( 1, 3, 12048,2048) -> Specifying chunk size in the order of dimension. </li>
    </ul>
</div>

## Defining the chunk size at the creation with Xarray

We can define the chunk size when we create the object.  
This is usually done with Xarray using the `chunks` kwarg when opening a file with `xr.open_dataset` or with `xr.open_mfdataset`, if you create Xarray from your local file.  

In our NDVI example, we create Xarray from `stackstac`. As `stackstac`'s default 'chunksize' definition is 1024 for the x and y dimensions, we had that chunk size. We can pass the `chunksize` option to `stackstac` and make it larger.



In [None]:
%%time
sentinel2_l2a = stackstac.stack(items,
                                assets=['red','nir'],
                                chunksize=( 1, 2, 2048,2048)
)
sentinel2_l2a

## So, why chunks?

As explained in **Section 2.4 - Formats and Performance**, chunks are mandatory for accessing files or datasets that are larger than a single computer's memory. If all the data has to be accessed, it can be done sequentially (i.e., chunks are processed one after the other).

Moreover, chunks allow for distributed processing and increased speed for your data analysis, as seen in the next section.

### Chunks and files

Xarray chunking capabilities also depend on the underlying input or output file format used. Most modern file formats allow datasets or single files to be stored using chunks. For example, **NetCDF4** uses chunks when storing a file on the disk via HDF5. Any read of data from a NetCDF4 file will load at least one chunk of that file. So when reading one of its chunks as defined in the `open_dataset` call, Xarray will take advantage of native file chunking and won't need to read the entire file.

However, it is important to note that **Xarray chunks and file chunks are not necessarily the same**. It is a good practice to configure Xarray chunks so that they align well with the input file format chunks (ideally, Xarray chunks should contain one or several input file chunks).


## Zarr storage format

This brings us to our next subject: [Zarr](https://zarr.readthedocs.io/en/stable/).

If we can have our original dataset already 'chunked' and accessed in an optimized way according to its actual byte storage on disk, we won't need to load the entire dataset every time. This can greatly optimize our data analysis, even when working with the entire dataset.

Let's convert our intermediate data into Zarr format so that we can learn what it is. We can keep the data as a `DataArray` or convert it into a `Dataset` before storing it.

We start again by loading data using `stackstac`, but this time, we proceed to the next step: clipping the data and computing the NDVI. Then, let's try to save those intermediate results in a Zarr file.


## Data Loading

Load data using stackstac (with specific chunk size)

In [None]:
aoi = gpd.read_file("./assets/catchment_outline.geojson", crs="EPGS:4326")
aoi_geojson = mapping(aoi.iloc[0].geometry)
URL = "https://earth-search.aws.element84.com/v1"
catalog = pystac_client.Client.open(URL)
items = catalog.search(
    intersects=aoi_geojson,
    collections=["sentinel-2-l2a"],
    datetime="2019-02-01/2019-04-28",
    query= {"proj:epsg": dict(eq=32632)}
).item_collection()
ds = stackstac.stack(items,
                     assets=['red','nir'],
                     chunksize=( 1, 2, 1024,1024)
)

## NDVI computation

Compute the NDVI as in 2.3 Data Access - Reduce

In [None]:
def NDVI(data):
    red = data.sel(band="red")
    nir = data.sel(band="nir")
    ndvi = (nir - red)/(nir + red)
    return ndvi

ndvi_xr = NDVI(ds)

## Spatial clipping

Restrict the data to the area of interest from the loaded polygon

In [None]:
aoi_utm32 = aoi.to_crs(epsg=32632)
geom_utm32 = aoi_utm32.iloc[0]['geometry']
ndvi_xr.rio.write_crs("EPSG:32632", inplace=True)
ndvi_xr.rio.set_nodata(np.nan, inplace=True)
ndvi_xr = ndvi_xr.rio.clip([geom_utm32])

## Save to Zarr

Select just a few days, to reduce the amount of data for this example

In [None]:
ndvi_small = ndvi_xr.isel(time=slice(0,3))
ndvi_small

Before saving, we can modify the chunk shape

In [None]:
ndvi_small = ndvi_small.chunk(chunks = {'x':'auto', 'y': 'auto'}).to_dataset(name='data')
ndvi_small

Then clean attributes that might create issues while writing and save to Zarr

In [None]:
%%time

def remove_attrs(obj, to_remove):
    new = obj.copy()
    new.attrs = {k: v for k, v in obj.attrs.items() if k not in to_remove}
    return new

def encode(obj):
    object_coords = [name for name, coord in obj.coords.items() if coord.dtype.kind == "O"]
    return obj.drop_vars(object_coords).pipe(remove_attrs, ["spec", "transform"])

ndvi_small.pipe(encode).to_zarr('test.zarr',mode='w')

<div class="alert alert-warning">
    <i class="fa-check-circle fa" style="font-size: 22px;color:#666;"></i> <b>Exercise</b>
    <br>
    <ul>
        <li>What about saving the data in Netcdf format? `ls -la test.zarr` and  `ls -la test.zarr/nobs `</li>
        <li>You can try to explore the zarr file you just created using `ls -la test.zarr` and  `ls -la test.zarr/nobs `</li>
        <li>You can explore zarr metadata file by `cat test.zarr/.zmetadata` </li>
        <li>Did you find the __chunks__ we defined previously in your zarr file? </li>
    </ul>
</div>

## Compare Zarr and netCDF
Lets compare how the zarr and NetCDF files are stored.  
We read our sample dataset from Zarr and store it as netCDF:

In [None]:
xr.open_zarr('test.zarr').to_netcdf('test.nc')

Compare the disk space used by the two formats:

In [None]:
!du -sh test.zarr/ test.nc

List the content of the Zarr directory:

In [None]:
!ls  -la test.zarr/

List the content of the Zarr data directory:

In [None]:
!ls  -la test.zarr/data

Print the content of the Zarr metadata file

In [None]:
!cat test.zarr/.zmetadata | head -n 30

### Zarr format main characteristics

- Every chunk of a Zarr dataset is stored as a single file (see x.y files in `ls -al test.zarr/data`)
- Each Data array in a Zarr dataset has a two unique files containing metadata:
  - .zattrs for dataset or dataarray general metadatas
  - .zarray indicating how the dataarray is chunked, and where to find them on disk or other storage.

## Conclusion

Understanding chunking is key to optimize your data analysis when dealing with large datasets. In this exercise, we learned how to optimize data access time and memory resources by using native file chunks loaded by stackstac and instructing Xarray to modify the chunk. Computing on large datasets can be very slow on a single machine, and to optimize your time we may need to parallelize your computations. This is what you will learn in the next exercise with Dask.