# Rebinning and How to Use Dask

## Fundamentals of Dask

In this session we are going to use the example of rebinning VISP data to discuss how the Dask array backing the `Dataset` object works.
Dask is a Python package for out-of-memory and parallel computation in Python, it provides an array-like object where data is only loaded and processed on demand.
`Dataset` uses Dask to track the data arrays, which it stores in the `Dataset.data` attribute.

To demonstrate this let's load our VISP dataset from yesterday, and slice it to a more manageable size again.

In [None]:
import dkist
import dkist.net
import matplotlib.pyplot as plt

ds = dkist.Dataset.from_directory('/home/drew/sunpy/data/VISP/AGLKO/')
ds = ds[0, 520:720, :, 1000:1500]

This Dask object behaves in many ways just like a numpy array.
For instance, it can indexed and sliced in the same way.

In [None]:
ds.data[:, :, :200]

And it has many of the same methods for calculating useful properties of the data, such as `min()`, `max()`, `sum()`, etc.
These are in fact just wrappers around the numpy functions themselves, so they behave in the same way.
For example, to find the sum over the spatial dimensions of our cropped data to make a spectrum, we could do:

In [None]:
ds.data.sum(axis=(0,2))

What you will notice when you call these functions that they don't return a number as you would expect.
Instead they give us a Dask array which represents the result of that calculation.
This is because Dask delays the actual calculation of the value until you explicitly tell it to do it using the `compute()` method.

In [None]:
spectrum = ds.data.sum(axis=(0,2)).compute()
plt.plot(spectrum)

A benefit of this is that since the operations returns us another Dask array, we can do more calculations with that, and those are also delayed.
This means that you can string together any number of calculations and only perform the costly step of getting the actual answer once.
So if we want to find the location of the lowest value in this spectrum, we can do

In [None]:
spectrum = ds.data.sum(axis=(0, 2))
wl_idx = spectrum.argmin()
wl_idx = wl_idx.compute()
wl = ds.wcs.pixel_to_world(0, wl_idx, 0)[1]
wl

When performing these operations, Dask breaks up the array into chunks, and operations will generally be faster and use less memory when they require fewer chunks. 
In the case of a `Dataset`, these chunks are aligned with the files, so each chunk essentially consists of the array stored in one FITS file.
In the future we may break down a FITS file into more chunks, so the whole array does not always have to be read.

## Rebinning with `NDCube`