# Welcome to Dask

<img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" align="right" width="30%" alt="Dask logo">


Wouldn't it be nice if you could keep writing simple Python code and scale it up **from a single node of your laptop to a massive computing cluster**? Welcome to Dask.

Dask enables computations on:
- Larger-than-memory data
- More than one core in parallel
- More than one machine in parallel

...and all using simple python code

## What is [Dask]("https://www.dask.org/")?

Today, we will see two main components of Dask:
* Dask Collections/API:
    * High-level (Arrays & Dataframes)
    * Low-level (Delayed & Futures)
* Distributed: to create and manage clusters


### Collections/API

Dask provides **multi-core** and **distributed+parallel** execution on **larger-than-memory** datasets

We can think of Dask's APIs (also called collections)  at a high and a low level:

<center>
<img src="images/high_vs_low_level_coll_analogy.png" width="75%" alt="High vs Low level clothes analogy" style="background-color:white;">
</center>

### Distributed

Most of the times when you are using Dask, you will be using a distributed scheduler. The Dask cluster is structured as:

<center>
<img src="images/distributed-overview.png" width="75%" alt="Distributed overview" style="background-color:white;">
</center>

## Prepare

#### 1. You should clone this repository


    git clone git@github.com:Nollde/dask-tutorial.git

and then install necessary packages.

#### 2) Create a conda environment

In the main repo directory


    conda env create -f env.yml
    conda activate dask-tutorial

## Tutorial Structure

Each section is a Jupyter notebook. There's a mixture of text, code, and exercises.

0. [Overview](00_overview.ipynb) - dask's place in the universe.

1. [Dataframe](01_dataframe.ipynb) - parallelized operations on many pandas dataframes spread across your cluster.

2. [Array](02_array.ipynb) - blocked numpy-like functionality with a collection of numpy arrays spread across your cluster.

3. [Delayed](03_delayed.ipynb) - the single-function way to parallelize general python code.

5. [Futures](04_futures.ipynb) - non-blocking results that compute asynchronously.

4. [Distributed](05_distributed.ipynb) - Dask's scheduler for clusters, with details of how to view the UI.

6. Conclusion & Beyond Dask