<a href="https://www.dask.org/" target="_blank">
<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">
</a>

# Scalability

In this notebook, we will explain how Dask achieves scalability from multi-core local machines to large distributed clusters in the cloud for conducting large scale data analytics.

**Introduction**

Dask employs the **client-server model** to map computations to multiple cores in a single machine or distributed clusters. 

* In the client-side, a Python/Notebook application can send tasks (computations) to a **Dask Cluster**.
* A Dask Cluster is composed of a **scheduler** and the **workers**.
* The scheduler receives tasks (computations) and decide which worker will perform every task.
* The workers perform computations and store/share results with other workers.

<center>
<img src="img/dask-cluster.png" width="80%"/>
</center>
<center>
<a href="https://tutorial.dask.org/00_overview.html" target="_blank" width="30%"> Reference: Dask Tutorial Documentation </a>
</center>

**Content**

1. Cluster in a Local Machine.
2. Cluster in a High Performance Computing System.
3. Cluster in a Cloud Computing System.

## 1. Cluster in a Local Machine

The local cluster is the best option for researchers who are **just learning Dask** or **just starting a large-scale data analysis**. This configuration allows the realization of preliminary tests to later deploy the solution in large Supercomputers or Cloud Computing infrastructure. 

You can define a local cluster in two ways [2].

* Implicitly, Dask crteates a default local cluster for you.
* Explicitly, You define the local cluster by yourself using the Dask library.

### 1.1. Implicit cluster definition

In the implicit mode, the user doesn't have to define the cluster. Once the user defines a computation and uses the method `compute`, a default cluster is created for her. The default configuration uses all computer's available cores. More information can be found in [2]. Let's take a look on this.

__1. Import required libraries, define required variables and functions__

In [None]:
import dask.array as da

In [None]:
x = da.random.random((10000,10000), chunks=(1000,1000))
x

__2. Visualize the computations to be performed per array chunk.__

_Hint: This computations will be performed in the cores available in your computer._

In [None]:
# x.visualize() # not working yet

__3. Compute the result__

_Hint: Dask submit all the computations to the scheduler, then, the scheduler will be in charge of distributing the computations among the available computing cores._ 

In [None]:
x.compute() # This uses Dask default local cluster

__4. Get `compute` method documentation for available parameters.__

_Hint: Advanced users can decide if they want to execute their computations using `threads` or `processes` by setting the scheduler parameter, `.compute(scheduler=)`. In addition they can decide if they want their computations to run sequentially by setting the scheduler option to `synchronous`._

In [None]:
help(x.compute)

### 1.2. Explicit cluster definition

Advanced users prefer to define the cluster by themsleves, i.e., explicitly, this give to them more flexibility for the configuration of the cluster. For example, they can define how many **workers** they want to use and how many `cores` and `memory` they want each worker to have. Finally, when you define the cluster by yourself, you will be able to access the **Dask Dashborad**.

In the explicit mode, you can define a cluster in one of two ways:
* By defining the client
* By defining the cluster and the client

#### 1.2.1. By defining the client

__1. Import required libraries, define required variables and functions__

In [None]:
import dask
from dask.distributed import Client

__2. Create a cluster by defining the client__

_Hint: The creation of the client, implies the creation of a cluster._

_Note: **Be carefull** you should not execute the cell bellow multiple times, that will create a lot of clusters._

_Activities:_ 

1. Run the cell bellow
2. Use the option `Lauch dashborad in JupyterLab`, this will display the Dask Dashboard.
3. In the Dashboard you will need to change the ip addres from `127.0.0.1` to `192.5.87.217`
4. The Daskboard has a plently of options to evaluate the performance of our cluster when performing a computation. Take a look on some of the options, for instance `cluster`.

In [None]:
client = Client()
client

__3. Perform some computation__

_Hint: Use the Dashboard to see what is happenning while your computation is taking place._

In [None]:
# generate random timeseries of data
df = dask.datasets.timeseries("2000", "2005", partition_freq="2w").persist()

# perform a groupby with an aggregation
df.groupby("name").aggregate({"x": "sum", "y": "max"}).compute()

__4. Showdown the cluster__

_Hint: This is **MANDATORY**, one you finish using a cluster you must turn it of, since it will release the computing resources your cluster was using_

In [None]:
# Shutdown the cluster 
client.shutdown()

__5. Close the connection between the client and the cluster__

In [None]:
client.close()

#### 1.2.2. By defining the cluster and the client

__1. Import required libraries, define required variables and functions__

In [None]:
import dask
from dask.distributed import Client, LocalCluster

__2. Create a Dask cluster__

In [None]:
cluster = LocalCluster()  # Launches a scheduler and workers locally
cluster

__3. Create a Dask Client and connect the client to the Dask cluster__

In [None]:
client = Client(cluster)  # Connect to distributed cluster and override default
client

__3. Perform some computation__

_Hint: Use the Dashboard to see what is happenning while your computation is taking place._

In [None]:
# generate random timeseries of data
df = dask.datasets.timeseries("2000", "2005", partition_freq="2w").persist()

# perform a groupby with an aggregation
df.groupby("name").aggregate({"x": "sum", "y": "max"}).compute()

__3. Showdown the cluster__

_Hint: This is **MANDATORY**, one you finish using a cluster you must turn it of, since it will release the computing resources your cluster was using_

In [None]:
cluster.close()

__4. Close the connection between the client and the cluster__

In [None]:
client.close()

## 2. Cluster in a High Performance Computing System

The High Performance Computing (HPC) system for researchers with a clear research objective and understanding on how to use Dask to scale their experiments beyond the capabilities of a single computer. Depending on how Dask was configured in the HPC, it willbring significant advantages in communication-intensive computations.

__1. Import required libraries, define required variables and functions__

In [None]:
import dask
from dask_jobqueue import SLURMCluster
from dask.distributed import Client

__2. Create a Dask cluster__

_Hint: Here you will define how namy `cores` and `memory` will have every Dask worker._

In [None]:
cluster = SLURMCluster(
    cores=1,
    memory="2 GB"
)

__3. Deploy workers for your Dask cluster__

_Hint: each worker will be a slurm job._

In [None]:
cluster.scale(jobs=4)

__5. Check if the Dask worker were deployed in the High Performance Computing System.__

In [None]:
!squeue -u `whoami`

__4. Adjust the number of workers according to the workload__

_Hint: This also works with adaptive clusters. This automatically launches and kill workers based on load [7]._

In [None]:
cluster.adapt(maximum_jobs=20)

__6. Create a Dask Client and connect the client to the Dask cluster in the High Performance Computing System__

_Activities:_ 

1. Run the cell bellow
2. Use the option `Lauch dashborad in JupyterLab`, this will display the Dask Dashboard.
3. The Daskboard has a plently of options to evaluate the performance of our cluster when performing a computation. Take a look on some of the options, for instance `cluster`.

In [None]:
client = Client(cluster)  # Connect to distributed cluster and override default
client

__7. Perform some computation__

_Hint: Use the Dashboard to see what is happenning while your computation is taking place._

In [None]:
# generate random timeseries of data
df = dask.datasets.timeseries("2000", "2005", partition_freq="2w").persist()

# perform a groupby with an aggregation
df.groupby("name").aggregate({"x": "sum", "y": "max"}).compute()

__8. Showdown the cluster__

_Hint: This is **MANDATORY**, one you finish using a cluster you must turn it of, since it will release the computing resources your cluster was using_

In [None]:
cluster.close()

__9. Close the connection between the client and the cluster__

In [None]:
client.close()

<a href="https://www.dask.org/" target="_blank">
<img src="img/coiled.svg"
     align="right"
     width="30%"
     alt="Dask logo\">
</a>


## 3. Cluster in a Cloud Computing System

Easy, secure and efficient computing
with Dask and Coiled. Coiled manages your cloud resources.

<center>
<img src="img/coiled-cloud.png" width="60%">
</center>

__1. Authenticate in the Coiled cloud__

Hint: For authentication you will need a token, this toke will be provided by the tutor. Then replace `[ASK_YOUR_TUTOR_FOT_THE_TOKEN]` by the token. 

1. Open a terminal.
2. Update the token in the command bellow and execute it in the terminal.

```bash
coiled login --token [ASK_YOUR_TUTOR_FOT_THE_TOKEN] --account dask-tutorials
```

3. Use the following command to deploy the tutorial in Coiled

```bash
coiled notebook start --software dask-tutorial-hpc-2023
```
4. Once you have finished, use Control + c to close the connection.

__1. Import required libraries, define required variables and functions__

In [None]:
import dask
import coiled
from dask.distributed import Client

__2. Create a Dask cluster using the Coiled API__

_Hint: Change `...` by your user name. For example `ss0`_

In [None]:
cluster = coiled.Cluster(name=...,n_workers=5)

__3. Create a Dask Client and connect the client to the Dask cluster in Coiled__

_Activities:_ 

1. Run the cell bellow
2. Use the option `Lauch dashborad in JupyterLab`, this will display the Dask Dashboard.
3. The Daskboard has a plently of options to evaluate the performance of our cluster when performing a computation. Take a look on some of the options, for instance `cluster`.

In [None]:
client = Client(cluster)
client

__4. Run some computation__

_Hint: Use the Dashboard to see what is happenning while your computation is taking place._

In [None]:
# generate random timeseries of data
df = dask.datasets.timeseries("2000", "2005", partition_freq="2w").persist()

# perform a groupby with an aggregation
df.groupby("name").aggregate({"x": "sum", "y": "max"}).compute()

__6. Showdown the cluster__

_Hint: This is **MANDATORY**, one you finish using a cluster you must turn it of, since it will release the computing resources your cluster was using_

In [None]:
cluster.shutdown()

__7. Close the connection between the client and the cluster__

In [None]:
client.close()

# References

1. Deploy Dask Clusters - https://docs.dask.org/en/stable/deploying.html
2. Single-Machine Scheduler - https://www.devdoc.net/python/dask-2.23.0-doc/setup/single-machine.html
3. Getting Started with Dask: A Dask Setup Guide - https://www.youtube.com/watch?v=TQM9zIBzNBo&t=82s
4. Scheduler Overview - https://docs.dask.org/en/stable/scheduler-overview.html
5. JupyterLab Extension: How to Integrate Dask Dashboards & JupyterLab in 5 Minutes - https://www.youtube.com/watch?v=EX_voquHdk0&t=293s
6. Dask on HPC Introduction - https://www.youtube.com/watch?v=FXsgmwpRExM&t=9s
7. dask_jobqueue.SLURMCluster - https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html
8. Configure Dask-Jobqueue - https://jobqueue.dask.org/en/latest/configuration-setup.html