!["Anaconda"](img/anaconda-logo.png)
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Dask: Graph Foundations

<img src="img/fail-case.gif" width=40% align="right">

*Dask is a way to represent computations as dictionaries, and then analyze and execute them.*

* Dask supports parallel computing.  Internally it executes graphs of tasks with data dependencies.  
* In this section we talk about what these graphs look like and how to construct them.  
* We finish with exercises manually building graphs that use basic Pandas functionality.  
* This is straightforward but somewhat tedious.  We'll automate it in future sections.

***You can safely skip this section if you don't care about how dask works internally.***

**Related Documentation**

*  [Dask graph specification](http://dask.pydata.org/en/latest/spec.html)
*  [Discussion on custom graphs](http://dask.pydata.org/en/latest/custom-graphs.html)

## Table of Contents
* [Dask: Graph Foundations](#Dask:-Graph-Foundations)
* [Normal Programming](#Normal-Programming)
	* [Make functions](#Make-functions)
	* [Call functions in code](#Call-functions-in-code)
* [Computation as a data structure](#Computation-as-a-data-structure)
* [Delayed Evaluation in Python](#Delayed-Evaluation-in-Python)
	* [Example 1: `eval()`](#Example-1:-eval%28%29)
	* [Example 2: `lambda`](#Example-2:-lambda)
	* [Example 3: `functools`](#Example-3:-functools)
	* [Dask delays evaluation](#Dask-delays-evaluation)
* [Defining dask graphs](#Defining-dask-graphs)
* [Executing dask graphs](#Executing-dask-graphs)
* [Analyzing and Visualizing Graphs](#Analyzing-and-Visualizing-Graphs)
* [Exercise 1: `read_csv`](#Exercise-1:-read_csv)
	* [Data prep](#Data-prep)
	* [File reads](#File-reads)
	* [Construct a dask graph](#Construct-a-dask-graph)
	* [Solution](#Solution)
	* [Execute your dask graph](#Execute-your-dask-graph)
* [Exercise 2: Sum of amounts](#Exercise-2:-Sum-of-amounts)
	* [Solution](#Solution)
* [Conclusion](#Conclusion)


# Normal Programming

Normally we write functions and then use those function in linear code.  

The Python interpreter executes this code from the top down.

## Make functions

In [None]:
def inc(x):
    return x + 1

def add(x, y):
    return x + y

## Call functions in code

In [None]:
a = 1
b = inc(a)

x = 10
y = inc(x)

z = add(b, y)
z

Even though some of this work could have happened in parallel, Python went ahead and executed one line after the other sequentially.

If we want to execute code in parallel then we need to stop Python from taking control.

# Computation as a data structure

Instead of writing normal code we store the stages of the computation above as a Python dictionary where each function call becomes a Python tuple.

This is going to look a little strange but we'll have the entire computation stored in a Python data structure that we can manipulate with *other* Python code.

In [None]:
dsk = {'a': 1, 
       'b': (inc, 'a'),
       
       'x': 10,
       'y': (inc, 'x'),
       
       'z': (add, 'b', 'y')}

In [None]:
type(dsk)

We call a dictionary that looks like this a *dask graph*.  ***A dask graph is just a dictionary.***

# Delayed Evaluation in Python

Representing Python functions as tuples containing function names and arguments may seem strange, but in reality you are already familiar with the style.

## Example 1: `eval()`

In [None]:
# Sometimes we defer computations with strings
x = 15
y = 30
z = "x + y"
eval(z)

# The variable 'z' stores a string that is a valid Python statement
# We call eval to fully evaluate `z' and obtain the answer.

## Example 2: `lambda`

In [None]:
# Sometimes we defer computations with a lambda

x = 15
y = 30
z = lambda: x + y
z

In [None]:
# z delays the execution of x + y until we call z()
# This is very similar to (add, 'x', 'y')
z()

## Example 3: `functools`

In [None]:
# Sometimes we use functools.partial

import functools
z = functools.partial(add, x, y)
z

In [None]:
z()

## Dask delays evaluation

In [None]:
# Dask delays evaluation with tuples
z = (add, x, y)
z

# Defining dask graphs

To be fully explicit, here is the definition of a dask graph taken from the [dask documentation](http://dask.pydata.org/en/latest/spec.html)

A **dask graph** is a dictionary mapping data-keys to values or tasks.

```python
{'x': 1,
 'y': 2,
 'z': (add, 'x', 'y'),
 'w': (sum, ['x', 'y', 'z'])}
```

A **key** can be any hashable value that is not a task.

```python
'x'
('x', 2, 3)
```

A **task** is a tuple with a callable first element. Tasks represent atomic units of work meant to be run by a single worker.

```python
(add, 'x', 'y')
```

We represent a task as a `tuple` such that the *first element is a callable function* (like `add`), and the succeeding elements are *arguments* for that function.

An **argument** to a task may be one of the following:

1. Any key present in the dask like `'x'`
2. Any other value like `1`, to be interpreted literally
3. Other tasks like `(inc, 'x')`
4. List of arguments, like `[1, 'x', (inc, 'x')]`

So all of the following are valid tasks

```python
(add, 1, 2)
(add, 'x', 2)
(add, (inc, 'x'), 2)
(sum, [1, 2])
(sum, ['x', (inc, 'x')])
(np.dot, np.array([...]), np.array([...]))
```

# Executing dask graphs

The dask library contains functions to execute these dictionaries in parallel with multiple threads or multiple processes.

In [None]:
from dask.threaded import get
get(dsk, 'z')  # Execute in multiple threads

In [None]:
from dask.multiprocessing import get
get(dsk, 'z')  # Execute in multiple processes

So as long as you're willing to write code in this funny way with dictionaries, dask will run your separate functions in parallel.

# Analyzing and Visualizing Graphs

Because our computation is just a dictionary we can write arbitrary functions to do a variety of useful analyses on these dictionaries.  A simple yet common operation is just to visualize the computation as a visual graph.

In [None]:
# Requires that you have pydot and graphviz installed
# !conda install graphviz pydot

In [None]:
# If you don't want to install graphviz, feel free to skip this cell!
from dask.dot import dot_graph
dot_graph(dsk)

That's it
----------

The rest of this tutorial contains fancy ways to construct and execute dask graphs.  We won't make any more by hand after this notebook.  

If you'd like to learn more, read the [dask graph spec](http://dask.pydata.org/en/latest/spec.html).

# Exercise 1: `read_csv`

As an exercise we'll parallelize some basic Pandas code by rewriting it as a dask graph.  

* This will be a little tedious but should give us speed-ups right away.  
* In future sections we'll learn how dask submodules like `dask.dataframe` automate this work for us.

There are three CSV files in your `data` directory.

* We count how many rows are in all of these csv files total.
* In normal Python we solve this problem in the following way...

## Data prep

In [None]:
from src.dask_prep import accounts_csvs  # Prep data if it doesn't exist
accounts_csvs(3, 1000000, 500)

## File reads

In [None]:
import pandas as pd
import os
filenames = [os.path.join('tmp', 'accounts.%d.csv' % i) for i in [0, 1, 2]]
filenames

In [None]:
!ls -lh tmp/accounts*

In [None]:
pd.read_csv(filenames[0], nrows=5)  # a sample of the first file

In [None]:
%%time 
a = pd.read_csv(filenames[0])
b = pd.read_csv(filenames[1])
c = pd.read_csv(filenames[2])

na = len(a)
nb = len(b)
nc = len(c)

total = sum([na, nb, nc])
total

## Construct a dask graph

Construct a dask graph/dictionary for this computation

Just as we turned code that looks like 

```python
y = f(x)
```

into dictionaries like 

```python
{'y': (f, 'x')}
```

We can transform the above calls to `pd.read_csv`, `len`, and `sum` into a dictionary of tuples

```python
dsk = {'a': (pd.read_csv, filenames[0]),
       'b': ...,
       ...
       'total': ...}
```

In [None]:
# Enter your solution here: should just be a dictionary!
dsk = {  }



## Solution

In [None]:
%load solutions/Foundations-01.py


## Execute your dask graph

We execute dask graphs with the `get` functions.  There is a get function for both multi-threading and multi-processing.  Get takes two arguments

    get(dsk, output_key)

Run the following cells and see how each get function performs.  Why is there a difference?

In [None]:
from dask.threaded import get
%time get(dsk, 'total')

In [None]:
from dask.multiprocessing import get
%time get(dsk, 'total')

# Exercise 2: Sum of amounts

As a slightly more complex example we'll compute the sum of the amounts column in each CSV file and then add up these sums to get the total amount over all CSV files.

To make the graph construction slightly more challenging we'll use Python for loops rather than write out every entry by hand.

In normal sequential code we might execute the following:

In [None]:
sums = list()
for fn in filenames:
    df = pd.read_csv(fn)
    sums.append(df.amount.sum())
total = sum(sums)
total


Now create the same computation as a dask graph.

We suggest building and using a small function to compute the sum of the amount of a dataframe and using this function in your dask graph.

In [None]:
def amount_sum(df):
    return df.amount.sum()

```python
dsk = dict()

for fn in filenames:
    dsk[...] = ...
...
```

## Solution

In [None]:
%load solutions/Foundations-02.py

In [None]:
get(dsk, 'total')

# Conclusion

We've learned about how dask graph represent computations and how we can execute these computations with dask schedulers / get functions.  We've made a few of these dictionaries by hand.  It's straightforward but perhaps tiresome.

In the next sections we'll play with systems that generate these dictionaries for us.

<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*