Exercises  - DASK Delayed
=========================

**Author:** Steffen Schober



## Motivation



We start with a simple example:



In [None]:
from time import sleep


def inc(x):
    sleep(1)
    return x + 1


def add(x, y):
    sleep(1)
    return x + y

In [None]:
%%time
# This takes three seconds to run because we call each
# function sequentially, one after the other
x = inc(1)
y = inc(2)
z = add(x, y)

Obviously, the running time could be improved,
if `inc(1)` and `inc(2)` are run in parallel.
Let's start implementing this with DASK.



## Dask delayed



First some imports.



In [None]:
import dask
from dask import delayed

### First example



To make a lazy function we wrap the python functions with `dask.delayed`:



In [None]:
%%time

x = delayed(inc)(1)
y = delayed(inc)(2)
z = delayed(add)(x, y)

Note that so far no computations where performed, only
the graph is created. The following requires `graphviz` to be installed:



In [None]:
z.visualize()

To trigger the computation we call the method `compute`:



In [None]:
%%time
# This actually runs our computation using a local thread pool
z.compute()

### Second example



Here is another example, using the `delayed decorator`:



In [None]:
import numpy as np
from numpy import random


@delayed
def func1(x):
    # process item x will take radom time
    duration = random.rand()
    sleep(duration)
    # report "processing time"
    return 2 * x

Before you execute the next cell, make a guess for the processing time:



In [None]:
%%time
[func1(i).compute() for i in range(10)]

Maybe not what you expected&#x2026;

Here, how to trigger the tasks in parallel:



In [None]:
%%time
dask.compute(*[func1(i) for i in range(10)])

Should be much faster.



## Tasks



### Parallelizing a for-loop



In the example below we iterate through a list of inputs. If that input is even then we want to call inc. If the input is odd then we want to call double. This is<sub>even</sub> decision to call inc or double has to be made immediately (not lazily) in order for our graph-building Python code to proceed



In [None]:
def double(x):
    sleep(1)
    return 2 * x


def is_even(x):
    return not x % 2


data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [None]:
%%time
# Sequential code

results = []
for x in data:
    if is_even(x):
        y = double(x)
    else:
        y = inc(x)
    results.append(y)

total = sum(results)
print(total)

**Task**: parallelize the sequential code above using `dask.delayed`.
You will need to delay some functions, but not all.



### Reading data



We start by preparing some data.

1.  Make sure, that the `prep.py` is in the same directory than this noteboook.
2.  Create a directory `data` and run the following cell:



In [None]:
%run prep.py -d accounts

In [None]:
import os

import pandas as pd

filenames = [os.path.join("data", "accounts.%d.csv" % i) for i in [0, 1, 2]]
filenames

In [None]:
%%time

# normal, sequential code
a = pd.read_csv(filenames[0])
b = pd.read_csv(filenames[1])
c = pd.read_csv(filenames[2])

na = len(a)
nb = len(b)
nc = len(c)

total = sum([na, nb, nc])
print(total)

**Task**: Recreate the  this graph again using the delayed function on the original Python code.
The three functions you want to delay are `pd.read_csv`, `len` and `sum`.

