<a href="https://www.dask.org/" target="_blank">
<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">
</a>

# Performance

In this notebook, we will demonstrate how to monitor in the **Dask DasHbard** aspects such as **communication** and **load balance** that that are critical for scientific and data analysis applications' performance. 

**Content**

1. Cluster in a High Performance Computing System.

## 1. Cluster in a High Performance Computing System

* High Performance Computing (HPC) systems are **tightly coupled, custom, specialized computers**. They primary objetcive is the **accelerate numerical analysis at large scale**. However, in the last years, supercomputers have been adapted to comply with both numerical and data analysis. 
* Dask can be deployed in a HPC system to perform large scale data analysis. Depending on how Dask was configured in the HPC, it will bring significant advantages in communication-intensive computations.

| Diagram                                                                                                                             | High Performance Computing System (El Capitan)                                                                                                                             |
|-----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
|                                                         |
| <img src="img/hpc_diagram.png" width="600px"> | <img src="img/hpc_el_capitan.jpeg" width="600px"> |
<center>

</center>

__1. Import required libraries, define required variables and functions__

In [None]:
import dask
from dask_jobqueue import SLURMCluster
from dask.distributed import Client

__2. Create a Dask cluster__

_Hint: Here you will define how namy `cores` and `memory` will have every Dask worker._

In [None]:
cluster = SLURMCluster(
    cores=1,
    memory="2 GB",
    scheduler_options={"dashboard_address": ":0"}
)

__3. Create a Dask Client and connect the client to the Dask cluster in the High Performance Computing System__

_Activities:_ 

1. Run the cell bellow
2. Use the option `Lauch dashborad in JupyterLab`, this will display the Dask Dashboard.

In [None]:
client = Client(cluster)  # Connect to distributed cluster and override default
client

__4. Deploy workers for your Dask cluster__

_Hint: each worker will be a slurm job._

In [None]:
cluster.scale(jobs=4)

__5. Perform COMPUTATION 1__

_Hint: This is an example of a **load balanced**, **lower comunication overhead**, and **lower processors idle time computation**. Lets see how we can identify these aspects in the Dashboard._

In [None]:
%%time 

import dask.array as da

x = da.random.random((10000,10000,10), chunks=(1000,1000,5))
y = da.random.random((10000,10000,10), chunks=(1000,1000,5))
z = (da.arcsin(x) + da.arccos(y)).sum(axis=(1,2))
z.compute()

__6. Perform COMPUTATION 2__

_Hint: This is an example of a **NON-load balanced**, **HIGHER comunication overhead**, and **HIGH processors idle time computation**. Lets see how we can identify these aspects in the Dashboard._

__7. Perform COMPUTATION 3__

In [None]:
%%time 

import dask.array as da

# Good performance: 10_00_000
# Bad performance: 100
x = da.random.random(10_000_000, chunks=(10_00_000,))
z = x.sum().compute()

__7. Showdown the cluster__

_Hint: This is **MANDATORY**, one you finish using a cluster you must turn it of, since it will release the computing resources your cluster was using_

In [None]:
cluster.close()

__9. Close the connection between the client and the cluster__

In [None]:
client.close()

# [Index](0.Introduction.ipynb)