# Dask Arrays


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dask-Arrays" data-toc-modified-id="Dask-Arrays-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Dask Arrays</a></span><ul class="toc-item"><li><span><a href="#Learning-Objectives" data-toc-modified-id="Learning-Objectives-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Learning Objectives</a></span></li><li><span><a href="#Core-Concepts" data-toc-modified-id="Core-Concepts-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Core Concepts</a></span><ul class="toc-item"><li><span><a href="#Create-dask.array-object" data-toc-modified-id="Create-dask.array-object-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Create <code>dask.array</code> object</a></span></li><li><span><a href="#Compute-result" data-toc-modified-id="Compute-result-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Compute result</a></span></li><li><span><a href="#Task-Graph" data-toc-modified-id="Task-Graph-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Task Graph</a></span></li><li><span><a href="#Exercise-1:-Compute-the-mean" data-toc-modified-id="Exercise-1:-Compute-the-mean-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Exercise 1: Compute the mean</a></span></li></ul></li><li><span><a href="#Bigger-Calculation" data-toc-modified-id="Bigger-Calculation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Bigger Calculation</a></span></li><li><span><a href="#Reduction" data-toc-modified-id="Reduction-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Reduction</a></span></li><li><span><a href="#Going-Further" data-toc-modified-id="Going-Further-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Going Further</a></span></li></ul></li></ul></div>

## Learning Objectives

- Understand core concepts behind dask arrays


## Core Concepts

A dask array looks and feels a lot like a numpy array. However, a dask array doesn't directly hold any data. Instead, it symbolically represents the computations needed to generate the data. Nothing is actually computed until the actual numerical values are needed. This mode of operation is called "lazy"; it allows one to build up complex, large calculations symbolically before turning them over the scheduler for execution.

If we want to create a numpy array of all ones, we do it like this:

In [None]:
import numpy as np

In [None]:
shape = (100, 320, 384)
ones_np = np.ones(shape)
ones_np

### Create `dask.array` object

Now let's create the same array using dask's array interface.

In [None]:
import dask.array as da

In [None]:
ones = da.ones(shape)

This did work, and in the background dask automatically creates an array with one chunk because we didn't tell dask how to split up the array:

In [None]:
ones

A crucal difference with dask is that:

- we can specify the chunks argument. "Chunks" describes how the array is split up over many small pieces.
- we can perform large computations by performing many smaller computations


![dask-array-black-text](../../../assets/dask-array-black-text.svg)

source: [Dask Array Documentation](https://docs.dask.org/en/latest/array.html)

There are several ways to specify chunks. In this tutorial, we will use a block shape.

In [None]:
chunk_shape = (20, 320, 384)
ones = da.ones(shape, chunks=chunk_shape)
ones

Notice that we just see a symbolic representation of the array, including its shape, dtype, and chunksize. No data has been generated yet. 

### Compute result

Dask.array objects are lazily evaluated.  Operations build up a graph of blocked tasks to execute.  

When we call `.compute()` on a dask array, the computation is trigger and the dask array becomes a numpy array.


In [None]:
ones.compute()

### Task Graph


![dask-dag](../../../assets/dask-dag.gif)

When working with dask, dask builds up a graph of blocked tasks to execute. This task graph is also known as **Directed Acyclic Graph (DAG)**. A DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

Dask allows us to visualize the task graph that gets executed when the computation is triggered. To see what this graph looks like, we call `.visualize()` on a dask object:

In [None]:
ones.visualize()

Our array has five chunks. To generate it, dask calls `np.ones` five times and then concatenates this together into one array.

Rather than immediately loading a dask array (which puts all the data into RAM), it is more common to want to reduce the data somehow. For example:

In [None]:
sum_of_ones = ones.sum()
sum_of_ones

In [None]:
sum_of_ones.visualize()

Here we see dask's strategy for finding the sum. This simple example illustrates the beauty of dask: **it automatically designs an algorithm appropriate for custom operations with our data.**

If we make our operation more complex, the graph gets more complex.

In [None]:
# Compute standard deviation 
complex_calculation = (ones * ones[::-1, ::-1]).std()
complex_calculation.visualize()

### Exercise 1: Compute the mean 

Now that we've seen the simple example above try doing a slightly different problem, compute the mean of the `ones` array along the vertical axis (axis=1), and horizontal axis (axis=2)


In [None]:
# %load solutions/02_dask_arrays_mean.py

## Bigger Calculation

The examples above were toy examples; the data (98 MB) is nowhere nearly big enough to warrant the use of dask.

We can make it a lot bigger!

In [None]:
bigshape = (500, 2400, 3600)
chunk_shape = (10, 1200, 1800)
big_ones = da.ones(bigshape, chunks=chunk_shape)
big_ones

This dataset is 35 GB, rather MB! This is greater than the amount of available RAM on a decent personal computer. Nevertheless, dask has no problem working on it.

<div class="alert alert-block alert-warning">

Do not try to visualize big_ones.visualize() this array!! We repeat, do not visualize the graph for this array because it is too big, and it may cause your notebook server to crash! 

</div>

When doing a big calculation, dask also has some tools to help us understand what is happening under the hood

In [None]:
from dask.diagnostics import ProgressBar

In [None]:
big_calc = (big_ones * big_ones[::-1, ::-1]).std()

with ProgressBar():
    result = big_calc.compute()
result

## Reduction

All the usual numpy methods work on dask arrays. You can also apply numpy function directly to a dask array, and it will stay lazy.

In [None]:
big_ones_reduce = (np.cos(big_ones)**2).mean(axis=[1, 2])
big_ones_reduce

Plotting also triggers computation, since we need the actual values

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
plt.plot(big_ones_reduce)

## Going Further 

* [Documentation](http://dask.readthedocs.io/en/latest/array.html)
* [API reference](http://dask.readthedocs.io/en/latest/array-api.html)

<div class="alert alert-block alert-success">
  <p>Previous: <a href="01_overview.ipynb">Overview: Dask</a></p>
  <p>Next: <a href="03_distributed.ipynb">Distributed</a></p>
</div>