<img src="https://raw.githubusercontent.com/NCAR/dask-tutorial/main/images/NCAR-contemp-logo-blue.png"
     width="750px"
     alt="NCAR logo"
     style="vertical-align:middle;margin:30px 0px"/>


# Dask Overview

**ESDS dask tutorial | 06 February, 2023**  

Brian Vanderwende and Negin Sobhani  
Computational & Information Systems Lab (CISL)  
[vanderwb@ucar.edu](mailto:vanderwb@ucar.edu) and [negins@ucar.edu](negins@ucar.edu)


---------

**In this tutorial, you learn:**

* Dask basics and components of Dask


## Introduction

Complex data structures enable data science in Python. For example:
* [NumPy arrays](https://numpy.org/doc/stable/)
* [Pandas series and dataframes](https://pandas.pydata.org/)
* [XArray arrays](https://docs.xarray.dev/)

*But datasets are getting larger all of the time! What if my data science is too big to fit into memory, or takes too long to complete an analysis?*

## Introducing Dask
### What is Dask?

<img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg"
     width="500px"
     alt="NCAR logo"
     style="vertical-align:middle;margin:30px 0px"/>

* Dask is an open-source Python library for parallel and distributed computing that scales the existing Python ecosystem.

* Dask was developed to scale Python packages such as Numpy, Pandas, and Xarray to multi-core machines and distributed clusters when datasets exceed memory. Dask can scale up to full laptop capacity, thousand-node HPC clusters, and on the cloud.

* Dask increases the size of possible work from *fits-in-memory* to *fits-on-disk* (sometimes doing it faster) via distributed parallelism. 


<div class="alert alert-block alert-warning" markdown="1">

<b>NOTE:</b> **Dask should only be used when necessary as it incurs overhead.**
<ul>
    Avoid Dask if you can easily:
    <li> Speed up your code with use of compiled routines in libraries like NumPy</li>
    <li> Profile and optimize your serial code to minimize bottlenecks</li>
    <li> Read in a subset of data to gain the insight you need</li>

</ul>

<img src="https://raw.githubusercontent.com/NCAR/dask-tutorial/main/images/dask_twitter.png"
     width="500px"/>


</div>


### Dask is composed of two main parts

#### 1.  Dask Collections

A Dask *collection* is the fundamental thing we wish to parallelize. 
Dask features different levels of collection types:

* ##### High-level collections 
Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and pandas but can operate in parallel on datasets that don’t fit into memory.

    Most of the time, you will probably use one of the following *high-level* (big)data structures:

| Collection | Serial | Dask |
|-|-|-|
| Arrays | numpy.array | dask.array.from_array |
| Dataframes | pandas.read_csv | dask.dataframe.read_csv |
| Unstructured | [1,2,3] | dask.bag.from_sequence([1,2,3]) |

* ##### Low-level collections
Dask also feature two low-level collection types - `delayed` and `futures`. These collections give user finer control to build custom parallel and distributed computations. 
    * **delayed** - run any arbitrary Python function using Dask task parallelism (think looped function calls)
    * **futures** - similar to delayed but allows for concurrency on the client (think backgrounded processes)



![Dask Collections](https://tutorial.dask.org/_images/high_vs_low_level_coll_analogy.png)



#### 2. Dynamic Task Scheduling
When a computational task is submitted, the Dask distributed scheduler sends it off to a Dask cluster. We can basically think of the Dask scheduler as our task orchestrator. 

A Dask cluster consist of the following : 

* **scheduler** : A scheduler creates and manages task graphs and distributes tasks to workers.

* **workers** : A worker is typically a separate Python process on either the local host or a remote machine. A Dask cluster usually consists of many workers. Basically, a worker is a Python interpretor which will perform work on a subset of our dataset.  

* **client** - A client is a local object that points to the scheduler (often local but not always). 


![Dask Distributed Cluster](https://tutorial.dask.org/_images/distributed-overview.png)

### Why Dask?

#### Familiar Interface

In geosciences, many researchers often use Python libraries such as Xarray, Numpy, Pandas, and Scikit-Learn to analyze their simulations and observations. However, many atmospheric and oceanographic datasets consist of multi-dimensional arrays of numerical data, such as temperature sampled on a regular latitude, longitude, depth, and time grid. When researchers want to apply their analyses to larger datasets, they find that their developed tools are not scalable beyond a single machine. 

Dask collections such as Dask Array, Dask DataFrames provide decent NumPy and Pandas compatible APIs. This means Dask provides ways to parallelize Pandas, Xarray, and Numpy workflows with minimal code rewriting. 

#### Flexibility
Dask provides several tools that help with data analysis on large datasets. For example, you can easily wrap your function in `dask.delayed` decorator to make it run in parallel. 

Dask also supports easily interfacing with popular HPC resource managers and job queueing system such as PBS, SLURM, and SGE. 


#### Scale up and scale down
Dask scales well from single machine (laptop) to HPC clusters. 
This ease of transition between single machine to moderate clusters makes it easy for users to prototype their workflows on their local machines and seamlessy transition to a cluster when needed.  


#### Responsiveness
Dask provides rapid feedback and interactive dashboard to keep users informed on how the computation is progressing. This helps users identify and resolve potential issues earlier. 



## How to follow this Tutorial:

## Useful Links

*  Reference
    *  [Docs](https://dask.org/)
    *  [Examples](https://examples.dask.org/)
    *  [Code](https://github.com/dask/dask/)
    *  [Blog](https://blog.dask.org/)
*  Ask for help
    *   [`dask`](http://stackoverflow.com/questions/tagged/dask) tag on Stack Overflow, for usage questions
    *   [github issues](https://github.com/dask/dask/issues/new) for bug reports and feature requests
    *   [discourse forum](https://dask.discourse.group/) for general, non-bug, questions and discussion
    *   Attend a live tutorial

---
## Addendum: Using JupyterLab on HPC systems

Ideally, you have access to JupyterHub, which provides a web portal to notebooks, terminals, and Dask Distributed dashboards. If not, you will need to create SSH tunnels to forward the port for Jupyter *and, if desired, the Dask dashboard.*  

**Remote System**
```
conda activate my-dask-env
jupyter lab --no-browser [--port 8888]
```
**Local System**
```
ssh -N -L 8888:localhost:8888 remote.hpc.system.edu
```
Then, you would navigate to `http://localhost:8888` in your browser and sign into JupyterLab. Once you start a distributed dask cluster, you will then have its port number (8787 by default if unoccupied).  

Fortunately, you do not need to forward the Dask cluster port, as Jupyter can proxy it for you. You instead can use the following URL in your browser:
```
http://localhost:<jupyter_port>/proxy/<cluster_port>/status
```