# Distributed Computation on Many Machines

## Outcomes

- overview of options for distributed compute in Python in 2022,
- demonstration of a AWS/Dask/Coiled/Prefect stack to distribute compute over a cluster on EC2.


## Why distribute compute over many machines?

Limit on the size of a single machine (largest instance on EC2 etc).

Many small machines can be cheaper & larger than the largest single machine.

Modern distributed compute platforms/environments will be fault tolerant to failures of individual workers - a single EC2 instance won't be.

## Ecosystems

Spark:

- accessing Scala code with Python bindings,
- Databricks is a modern way to run Spark.

[Ray](https://docs.ray.io/en/latest/index.html) & [Dask](https://docs.dask.org/en/stable/):

- distributed compute frameworks,
- DAGs for computation.

Tensorflow & PyTorch:

- multi-GPU training,
- accessing C++ code with Python bindings.

Plus more - Celery, lots of AWS Lambda...


## Our focus

A stack of Dask / Coiled / Prefect / EC2.

Requires two accounts - AWS account, Coiled account - Prefect account is optional. 


## Dask

Dask is an execution framework - one scheduler is responsible for executing many workers on many tasks.

<center><img src="../assets/dask.png" alt="Drawing" style="width: 600px;"/></center>

While Dask is a core part of this stack (it gives us concurrent computation - both parallelism + async), we will not write any low level Dask (or Dask DataFrame) code.


## Coiled

<center><img src="../assets/many-machine/f1.png" alt="Drawing" style="width: 600px;"/></center>

Manages AWS infrastructure for running Dask clusters on EC2:

- turns a `requirements.txt` into a *software environment* - Docker image with `pip install`,


## Prefect

Acts as a wrapper around Dask.  Prefect offers more functionality than just Dask execution:

- scheduling,
- monitoring,
- intelligent re-execution of pipelines (aka back-filling).

Prefect 2.0 is currently in beta (not yet production ready) - we are using Prefect 2.0.


# Prefect & Dask on a Single Machine

Let's start by writing the program from the last exercise of the previous notebook:

In [1]:
%%timeit -n 1 -r 1
!python ../src/naive.py

13.3 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Now try with naive Prefect:

In [2]:
%%timeit -n 1 -r 1
!python ../src/naive_dask_prefect.py

23:25:57.837 | INFO    | prefect.engine - Created flow run 'micro-quail' for flow 'main'
23:25:57.838 | INFO    | prefect.task_runner.dask - Creating a new Dask cluster with `distributed.deploy.local.LocalCluster`
23:25:59.658 | INFO    | prefect.task_runner.dask - The Dask dashboard is available at http://127.0.0.1:8787/status
23:26:01.538 | INFO    | Flow run 'micro-quail' -  downloading http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2021/MMSDM_2021_01/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_DISPATCH_UNIT_SCADA_202101010000.zip
23:26:01.844 | INFO    | Flow run 'micro-quail' - Created task run 'download-ccd6cdb6-0' for task 'download'
23:26:02.312 | INFO    | Flow run 'micro-quail' - Submitted task run 'download-ccd6cdb6-0' for execution.
23:26:02.312 | INFO    | Flow run 'micro-quail' -  processing http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2021/MMSDM_2021_01/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_DISPATCH_UNIT_SCADA_2021010

Now let's use Prefect with `asyncio`:

In [3]:
%%timeit -n 1 -r 1
!python ../src/async_prefect.py

23:26:13.498 | INFO    | prefect.engine - Created flow run 'crouching-pig' for flow 'main'
23:26:13.502 | INFO    | prefect.task_runner.dask - Creating a new Dask cluster with `distributed.deploy.local.LocalCluster`
23:26:15.276 | INFO    | prefect.task_runner.dask - The Dask dashboard is available at http://127.0.0.1:8787/status
23:26:17.163 | INFO    | Flow run 'crouching-pig' -  downloading http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2021/MMSDM_2021_01/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_DISPATCH_UNIT_SCADA_202101010000.zip
23:26:17.473 | INFO    | Flow run 'crouching-pig' - Created task run 'download-ccd6cdb6-0' for task 'download'
23:26:17.938 | INFO    | Flow run 'crouching-pig' - Submitted task run 'download-ccd6cdb6-0' for execution.
23:26:17.938 | INFO    | Flow run 'crouching-pig' -  processing http://www.nemweb.com.au/Data_Archive/Wholesale_Electricity/MMSDM/2021/MMSDM_2021_01/MMSDM_Historical_Data_SQLLoader/DATA/PUBLIC_DVD_DISPATCH_UNIT_SCA

# Prefect & Dask Running on a Coiled Cluster (Many Machines)

<center><img src="../assets/many-machine/f2.png" alt="Drawing" style="width: 600px;"/></center>

Requires a few accounts to get setup:

- AWS account - cluster will run on EC2,
- Coiled account - adds & manages AWS infrastructure needed for a Dask cluster.

Stack:

- EC2,
- Dask,
- Prefect,
- Coiled.

Example of running on a Coiled cluster:

In [1]:
%%timeit -n 1 -r 1
!python ../src/dask_coiled_prefect.py

Creating new software environment
Creating new ecr build
STEP 1: FROM coiled/default:sha-6b4e896
STEP 2: COPY environment.yml environment.yml
--> Using cache 715f431717389d9e2bbe2078c5d019fd29c5293c30a365b9bf87ea9ad21e9f3e
--> 715f4317173
--> Using cache 661e893e3cae54adf001e0665a01de463f3564fe632ee5ea91f90b696d835131
STEP 3: RUN conda env update -n coiled -f environment.yml     && rm environment.yml     && conda clean --all -y     && echo "conda activate coiled" >> ~/.bashrc
--> 661e893e3ca
STEP 4: ENV PATH /opt/conda/envs/coiled/bin:$PATH
--> 8e5159d1bd9
--> Using cache 8e5159d1bd97f7cfb7cb455f2c934ec4c74139efa8d1d34c111818b03c89cd00
--> 539154a4062
--> Using cache 539154a40629d3406444a41a4ee44d84aeb160dfe9f86333537c67f3f23a18f0
STEP 5: SHELL ["conda", "run", "-n", "coiled", "/bin/bash", "-c"]
STEP 6: COPY requirements.txt requirements.txt
--> 7e09dc82cfd
STEP 7: RUN pip install -r requirements.txt     && rm requirements.txt
Downloading Babel-2.10.3-py3-none-any.whl (9.5 MB)
━━━━━━━━

# Setting up the AWS/Dask/Coiled/Prefect stack

## AWS Setup

Pre-requisite is an AWS account.

First setup a new IAM user (below I call this user `coiled`) with programmatic access (key + secret key) - remember to download / copy your credentials to CSV!

We will use this user to manage & run the Coiled cluster on EC2.

Create IAM policies & AWS infrastructure so you can run Dask clusters in your AWS account.

[Coiled AWS setup](https://docs.coiled.io/user_guide/aws-cli.html). 

[Coiled IAM policies](https://docs.coiled.io/user_guide/aws_reference.html) - one is for setting up the IAM user (don't need if you are using credentials with admin access)

- create 2 IAM policies `coiled-setup` & `coiled-ongoing` from JSON,
- attach policies to your IAM user


## Coiled account setup

Create Coiled account - https://cloud.coiled.io/signup - add your credentials in *Cloud Provider*.

Or do the same thing via the shell:

<center><img src="../assets/many-machine/f3.png" alt="Drawing" style="width: 600px;"/></center>

```shell
$ pip install coiled
#  use token here
$ coiled login
$ coiled setup aws
```

Wasn't sure how to configure `region` with the browser *Cloud Provider*.


## Login to Coiled in your local environment

Tested on Python 3.8.12.

Create Coiled API token https://cloud.coiled.io/profile:

```shell
$ pip install coiled
#  use token here
$ coiled login
```

Now you can run the Dask example:

In [6]:
!python ../src/dask_coiled.py


+-------------+----------------+----------------+----------------+
| Package     | client         | scheduler      | workers        |
+-------------+----------------+----------------+----------------+
| dask        | 2022.8.0       | 2022.6.0       | 2022.6.0       |
| distributed | 2022.8.0       | 2022.6.0       | 2022.6.0       |
| lz4         | None           | 4.0.0          | 4.0.0          |
| msgpack     | 1.0.4          | 1.0.3          | 1.0.3          |
| numpy       | 1.23.2         | 1.21.6         | 1.21.6         |
| pandas      | 1.4.3          | 1.4.2          | 1.4.2          |
| python      | 3.8.13.final.0 | 3.9.13.final.0 | 3.9.13.final.0 |
| toolz       | 0.12.0         | 0.11.2         | 0.11.2         |
+-------------+----------------+----------------+----------------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6
Dask Dashboard: http://3.88.37.65:8787
                    x         y
name                           
Alice     1047.46660

## Optional - Adding Prefect Cloud

<center><img src="../assets/many-machine/f4.png" alt="Drawing" style="width: 600px;"/></center>


```shell
$ prefect cloud workspace set --workspace "adamgreenadgefficiencycom/kiwipycon-tutorial"

$ prefect cloud login -k $YOUR_PREFECT_API_KEY
```


## Exercise

1. Setup this Dask/Coiled stack on an EC2 cluster,
2. Add Prefect Cloud.