# Dask Schedulers

In this notebook we demonstrate how to set up different dask schedulers.
* A few words about dask schedulers
* Dask Schedulers on a single machine
    * local threads
    * local processes
    * single thread
* Apply scheduler options to weather station data
* Distributed schedulers (local)

-----

- Authors: NCI Virtual Research Environment Team
- Keywords: Dask, Dataframe, schedulers
- Creation Date: 2020-May
- Lineage/Reference: This tutorial is referenced to [dask-tutorial](https://github.com/dask/dask-tutorial). It is important to note that ``dask`` is a rapidly evolving library and information contained in this tutorial may be obsolete at time of viewing.

----

## Schedulers

In the previous notebooks, we used `dask.delayed` and `dask.dataframe` to parallelise computations.
This work built a *task graph* instead of executing immediately, with each *task* representing a function to call on some data. The full *graph* shows the relationship between all of the different tasks.

When we wanted the actual result, we called `.compute()` or `.load()`, which handed the task graph off to a *scheduler*.

**Schedulers are responsible for running a task graph and producing a result**.

![](https://raw.githubusercontent.com/dask/dask-org/master/images/grid_search_schedule.gif)

First, there are single machine schedulers that execute things in parallel using threads or processes (or synchronously for debugging). These are what we've used up until now. Second, there's the `dask.distributed` scheduler, which is newer and has more features than the single machine scheduler.

In this notebook we'll first talk about the different schedulers. Then we'll use the `dask.distributed` scheduler in more depth.

### Local Schedulers

Dask separates computation description (task graphs) from execution (schedulers). This allows you to write code once, and run it locally or scale it out across a cluster.

Dask has two families of task schedulers:

- Single machine scheduler: This scheduler provides basic features on a local process or thread pool. This scheduler was made first and is the default. It is simple and cheap to use, although it can only be used on a single machine/node and does not scale.

- Distributed scheduler: This scheduler is more sophisticated, offers more features, but also requires a bit more effort to set up. It can run locally or distributed across a cluster.

For different computations you may find better performance with particular scheduler settings. This lesson helps you understand how to choose between and configure different schedulers, and provides guidelines on which one might be more appropriate.

#### Local Threads

```python
- `dask.config.set(scheduler='threads')`  # overwrite default with threaded scheduler
```

The threaded scheduler executes computations with a local `multiprocessing.pool.ThreadPool`. It is lightweight and requires no setup. It introduces very little task overhead (around 50$\mu$s per task) and, because everything occurs in the same process, it incurs no costs to transfer data between tasks. However, due to Python’s Global Interpreter Lock (GIL), this scheduler only provides parallelism when your computation is dominated by non-Python code, as is primarily the case when operating on numeric data in NumPy arrays, Pandas DataFrames, or using any of the other C/C++/Cython based projects in the ecosystem.

The threaded scheduler is the default choice for Dask Array, Dask DataFrame, and Dask Delayed. However, if your computation is dominated by processing pure Python objects like strings, dicts, or lists, then you may want to try one of the process-based schedulers below (we currently recommend the distributed scheduler on a local machine).

#### Local Processes

```python
import dask.multiprocessing
dask.config.set(scheduler='processes')  # overwrite default with multiprocessing scheduler
```

The multiprocessing scheduler executes computations with a local multiprocessing.Pool. It is lightweight to use and requires no setup. Every task and all of its dependencies are shipped to a local process, executed, and then their result is shipped back to the main process. This means that it is able to bypass issues with the GIL and provide parallelism even on computations that are dominated by pure Python code, such as those that process strings, dicts, and lists.

However, moving data to remote processes and back can introduce performance penalties, particularly when the data being transferred between processes is large. The multiprocessing scheduler is an excellent choice when workflows are relatively linear, and so does not involve significant inter-task data transfer as well as when inputs and outputs are both small, like filenames and counts.

#### Single Thread

```python
import dask
dask.config.set(scheduler='synchronous')  # overwrite default with single-threaded scheduler
```

The single-threaded synchronous scheduler executes all computations in the local thread with no parallelism at all. This is particularly valuable for debugging and profiling, which are more difficult when using threads or processes.

For example, when using iPython or Jupyter notebooks, the `%debug`, `%pdb`, or `%prun` magics will not work well when using the parallel Dask schedulers (they were not designed to be used in a parallel computing context). However, if you run into an exception and want to step into the debugger, you may wish to rerun your computation under the single-threaded scheduler where these tools will function properly.

Here we discuss the *local* schedulers - schedulers that run only on a single machine. We experimented with these in the Dask_02 lesson. In each case we change the scheduler used in a few different ways:

- By providing a `scheduler=` keyword argument to `compute`:

```python
max_rain.compute(scheduler='processes')
# or 
max_rain.compute(scheduler='synchronous')
```

- Using `dask.set_options`:

```python
# Use multiprocessing in this block
with dask.set_options(scheduler='processes'):
    max_rain.compute()
# Use multiprocessing globally
dask.set_options(scheduler='synchronous')
```

Here we repeat a simple dataframe computation from the previous section using the different schedulers:

In [1]:
import os
import dask.dataframe as dd

In [2]:
# Provide path to the ACT rainfall data used in Dask_05
filename = os.path.join('/g/data/dk92/notebooks/demo_data/', 'Weather_Stations_ACT','IDCJAC0009_*_*','IDCJAC0009*.csv')
df = dd.read_csv(filename)
# rename column headers
df.columns = ['code','station','year','month','day','rainfall','period','quality']
# Maximum rainfall
max_rain=df.rainfall.max()

In [3]:
max_rain

dd.Scalar<series-..., dtype=float64>

In [4]:
%time _ = max_rain.compute()  # this uses threads by default

CPU times: user 2.79 s, sys: 1.34 s, total: 4.13 s
Wall time: 2.82 s


In [5]:
import dask.multiprocessing
%time _ = max_rain.compute(scheduler='processes')  # this uses processes

CPU times: user 473 ms, sys: 102 ms, total: 575 ms
Wall time: 3.07 s


In [6]:
%time _ = max_rain.compute(scheduler='synchronous')  # This uses a single thread

CPU times: user 2.03 s, sys: 606 ms, total: 2.63 s
Wall time: 2.84 s


By default the threaded and multiprocessing schedulers use the same number of workers as cores. You can change this using the `num_workers` keyword in the same way that you specified `scheduler` above:

```
max_rain.compute(scheduler='processes', num_workers=2)
```

To see how many cores you have on your computer, you can use `multiprocessing.cpu_count`

In [7]:
from multiprocessing import cpu_count
cpu_count()

8

### Some Questions to Consider:

- How much speedup is possible for this task (hint, look at the graph)?
- Given how many cores are on this machine, how much faster could the parallel schedulers be than the single-threaded scheduler?
- How much faster was using threads over a single thread? Why does this differ from the optimal speedup?
- Why is the multiprocessing scheduler so much slower here?

---

## In what cases would you want to use one scheduler over another?

http://dask.pydata.org/en/latest/setup/single-machine.html

---

## Distributed Scheduler

The `dask.distributed` system is composed of a single centralized scheduler and many worker processes. [Deploying](http://dask.pydata.org/en/latest/setup.html) a remote Dask cluster involves some additional effort. But doing things locally just involves creating a `Client` object, which lets you interact with the "cluster" (local threads or processes on your machine). For more information see [here](http://dask.pydata.org/en/latest/setup/single-distributed.html).

In [None]:
from dask.distributed import Client

# Setup a local cluster.
# By default this sets up 1 worker per core
client = Client()
client.cluster

By default, creating a `Client` makes it the default scheduler. Any calls to `.compute` will use the cluster your `client` is attached to (See http://dask.pydata.org/en/latest/scheduling.html for how to specify which scheduler to use).

In [9]:
%time max_rain.compute()

CPU times: user 615 ms, sys: 69.5 ms, total: 685 ms
Wall time: 2.49 s


322.1

#### Some Questions to Consider

- How does this compare to the optimal parallel speedup?
- Why is this faster than the threaded scheduler?

---

### Exercise

Run the following computations while looking at the diagnostics page (dask dashboard). In each case what is taking the most time?

In [10]:
import os
import dask
import pandas as pd
filename = os.path.join('/g/data/dk92/notebooks/demo_data/', 'Weather_Stations_ACT','IDCJAC0009_*_*','IDCJAC0009*.csv')
import dask.dataframe as dd
ddf = dd.read_csv(filename)
ddf.columns = ['code','station','year','month','day','rainfall','period','quality']

In [11]:
### 1.) How many rows are in our dataset?
len(ddf)

1754661

In [12]:
### 2.) In total, how many days records were taken?
period = pd.read_parquet('/g/data/dk92/notebooks/demo_data/ACT_weather.parquet', columns=['period'], engine='pyarrow')
period.count()

period    418018
dtype: int64

In [13]:
### 3.) In total, how many non-record days from each weather station?
ddf.groupby("station").period.count().compute()

station
42010       41
70000    10222
70011    11574
70014     8377
70015     6168
         ...  
70354     1010
71073     4099
72011     2606
72157      377
73148      243
Name: period, Length: 125, dtype: int64

In [14]:
### 4.) What was the average rainfall from each station?
ddf.groupby("station").rainfall.mean().compute()

station
42010    1.372022
70000    1.767808
70011    1.760142
70014    1.689061
70015    1.791328
           ...   
70354    1.790016
71073    2.526155
72011    2.510485
72157    2.541190
73148    1.709147
Name: rainfall, Length: 125, dtype: float64

### Close the client

Before moving on to the next exercise, make sure to close your client or stop this kernel.

In [15]:
client.close()

### More...

The distributed scheduler is more sophisticated than the single machine schedulers. It can compute asynchronously, and also provides an API similar to that of `concurrent.futures`. For further information you can see the docs http://distributed.readthedocs.io/en/latest/.

## Reference

https://docs.dask.org/en/latest/scheduling.html