# Dask and chunking in the real world

The [data formats section](../data/data-chunking.md) introduces NetCDF chunking, the previous chapter on computations(computations.ipynb) introduces dask and shows how to visualise chunks and how they can affect data analysis. This section illustrate some examples of functions that make use of chunking to make data analysis more efficient and solve memory issues.

```{note}
Dask has a comprehensive but accessible [blog introducing chunks](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes), including how to choose an optimal chunk size in dask and how to align chunks to the original file chunks.
```

In [1]:
## This is setup for the plots later on in the notebook - on the website this
## cell (and the cells making the diagrams) is hidden by default, using the 'hide-input' cell tag

import matplotlib
import matplotlib.pyplot as plt
import numbers
import numpy

def draw_chunks(ax, size = (10, 8), nchunks = (5, 2), chunk_size = None, chunk_color = None):
    """
    Draw a chunk diagram
    
    Args:
        ax:          matplotlib.pyplot axis to draw on
        size:        size of the array (x, y)
        nchunks:     number of chunks (x, y)
        chunk_size:  size of each chunk (x, y) (default size/nchunks)
        chunk_color: colour of each chunk (array with shape nchunks)
    """
    
    spacing = 0.1
    
    if chunk_size is None:
        chunk_size = (None, None)
        
    if chunk_color is None:
        chunk_color = numpy.full(nchunks, 'wheat')
    else:
        chunk_color = numpy.asarray(chunk_color)
        
    # Fill in None values
    chunk_size = tuple(chunk_size[i] if chunk_size[i] is not None else size[i] / nchunks[i]
                        for i in range(2))
    
    if isinstance(chunk_size[0], numbers.Number):
        xsize = numpy.full(nchunks[0], chunk_size[0]) - spacing
    else:
        xsize = numpy.asarray(chunk_size[0]) - spacing
        
    if isinstance(chunk_size[1], numbers.Number):
        ysize = numpy.full(nchunks[1], chunk_size[1]) - spacing
    else:
        ysize = numpy.asarray(chunk_size[1]) - spacing

                        
    # Chunk cell centre
    xc = (numpy.arange(nchunks[0], dtype='f') + 0.5) * (size[0] / nchunks[0])
    yc = (numpy.arange(nchunks[1], dtype='f') + 0.5) * (size[1] / nchunks[1])
    
    for ii in range(nchunks[0]):
        for jj in range(nchunks[1]):
            box = matplotlib.patches.Rectangle((xc[ii] - xsize[ii]/2,
                                                yc[jj] - ysize[jj]/2),
                                               xsize[ii],
                                               ysize[jj], 
                                               facecolor=chunk_color[ii,jj], edgecolor='black')
            
            ax.add_patch(box)
            
    ax.set_xbound(0, size[0])
    ax.set_ylim(0, size[1])
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_frame_on(False)
    

## Rechunking

To rechunk a dataset is to read it in and write it back out again, but in a way that's optimised for analysis in a different dimension - e.g. you might have a dataset that's optimised to read lat-lon slices, but you want to create a time-series climatology.

Rechunking may also combine multiple input files - say a dataset has one file per model day that contains all of its variables, for a timeseries analysis you may want to swap this to one file per variable per model year to reduce the number of files that need to be opened in the analysis.

<div class="alert alert-info" role="alert">
<b>Resources</b>

 - **Xarray** The [encoding](http://xarray.pydata.org/en/stable/user-guide/io.html#chunk-based-compression) argument to `to_dataset()` can specify file chunking, combine multiple files with [open_mfdataset](http://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html) or [concat](http://xarray.pydata.org/en/stable/generated/xarray.concat.html)
 - [**Rechunker**](https://github.com/pangeo-data/rechunker) A Python library for rechunking files in *Zarr* format
 - **NCO** There are several arguments to specify output chunking, see e.g. `ncks --help | grep cnk`. To combine input files along time see `ncrcat`
</div>

## Simple function to retrieve file chunks

[This blog](https://climate-cms.org/posts/2021-07-29-coarsen_climatology.html) includes a simple function to retrieve a netcdf file chunks.

## Chunks effects on parallel computations with dask

This [parallel training](https://coecms-training.github.io/parallel/dask-intro.html) has many references to chunks and their effects on computation in its dask and case studies sections.
!!!Any more example on map_blocks would be brillinat as it is hard to find them!

## Using map_blocks

Dask provides the dask.array.map_blocks() function that allows you to run a function on every chunk of an array.
The last section of [this blog](https://climate-cms.org/posts/2021-11-24-api.html?highlight=chunk#pure-dask-advanced) shows an example of how to use map_blocks()

## Using dask delayed

Dask ...

## Using dask futures

Dask ...