<center>
    <a href="https://dask.org/">
    <img src="./images/dask_horizontal_white_no_pad.svg" alt="Dask Logo" style="background-color:black;" width="900" height="1200"></a>
    <br/>
    <sup id="backref_IMG1"><a href="#ref_IMG1">1</a></sup>
</center>

# What is [Dask](https://docs.dask.org/en/latest/)?

> *Dask is a flexible library for parallel computing in Python*.

According to the official documentation, Dask is composed of two parts:

- **Dynamic task scheduling** optimized for computation that is optimized for interactive computational workloads.
- **“Big Data” collections** like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

# [Why Dask?](https://docs.dask.org/en/latest/why.html#links-and-more-information)

The popularity of Python as a coding language to power Data Science is evident, and this is thanks to its simplicity and low learning curve, and also thanks to powerful libraries such as [Numpy](https://numpy.org/), [Pandas](https://pandas.pydata.org/), and [scikit-learn](https://scikit-learn.org/stable/), which makes much easier manipulating and visualizing data.

Nevertheless, all these libraries have a potentially huge problem for data scientists working with **Big Data** and **High Performance Computing (HPC)**: they are designed to run on a **single core**.

As we should know, this fact makes us run into memory problems at the time when your data overfits RAM of our local system and the type of computation requires multicore processing.

# [Why Dask?](https://docs.dask.org/en/latest/why.html#links-and-more-information)

[Dask](https://dask.org/) has been created to solve this problem, by distributing the data across multiple cores of the machine and providing ways to scale Pandas, Scikit-Learn, and Numpy workflows natively, with minimal rewriting, meaning you don’t have to learn an entire language and drastically change the way you wrote your code to implement Dask.

The key concept here is that:

> Dask aims to be a parallel computing library that works by distributing larger computations and breaking it down into smaller computations through a task scheduler and task workers, which has been designed to extend Numpy, Pandas and scikit-learn implementations without forcing Data Scientists to drastically change their code.

What makes so great Dask is **_its ease of integration into the Python code_**.

You can explore in more detail the reasons about [why people choose to adopt Dask](https://docs.dask.org/en/latest/why.html#links-and-more-information).

# [Why Dask?](https://docs.dask.org/en/latest/why.html#links-and-more-information)

## Key remarks

### Dask scales out to Clusters

Dask figures out how to break up large computations and route parts of them efficiently onto distributed hardware. Dask is routinely run on thousand-machine clusters to process hundreds of terabytes of data efficiently within secure environments.

- Dask has utilities and documentation on how to deploy in-house, on the cloud, or on HPC super-computers.

- It supports encryption and authentication using TLS/SSL certificates.

- It is resilient and can handle the failure of worker nodes gracefully and is elastic, and so can take advantage of new nodes added on-the-fly.

# [Why Dask?](https://docs.dask.org/en/latest/why.html#links-and-more-information)

## Key remarks

### Dask scales down to Single Computers

Dask can enable efficient parallel computations on single machines by leveraging their multi-core CPUs and streaming data efficiently from disk.

> It can run on a distributed cluster, but it doesn’t have to.

- Dask allows you to swap out the cluster for single-machine schedulers which are surprisingly lightweight, require no setup, and can run entirely within the same process as the user’s session.

- To avoid excess memory use, Dask is good at finding ways to evaluate computations in a low-memory footprint when possible by pulling in chunks of data from disk, doing the necessary processing, and throwing away intermediate values as quickly as possible.

Both capabilities above, require no configuration and no setup, meaning that:

- Adding Dask to a single-machine computation adds very little cognitive overhead.

# [Why Dask?](https://docs.dask.org/en/latest/why.html#links-and-more-information)

## Key remarks

### Dask supports Complex Applications

Some parallel computations are simple and just apply the same routine onto many inputs without any kind of coordination. These are simple to parallelize with any system.

Somewhat more complex computations can be expressed with the _map-shuffle-reduce_ pattern popularized by **Hadoop** and **Spark**. This is often sufficient to do most data cleaning tasks, database-style queries, and some lightweight machine learning algorithms.

However

> More complex parallel computations exist which do not fit into these paradigms, and so are difficult to perform with traditional big-data technologies. These include more advanced algorithms for statistics or machine learning, time series or local operations, or bespoke parallelism often found within the systems of large enterprises.

# [Why Dask?](https://docs.dask.org/en/latest/why.html#links-and-more-information)

## Key remarks

### Dask supports Complex Applications

Dask helps to resolve these situations by exposing low-level APIs to its internal task scheduler which is capable of executing very advanced computations.

This gives engineers within the institution the ability to build their own parallel computing system using the same engine that powers Dask’s arrays, DataFrames, and machine learning algorithms, but now with the institution’s own custom logic.

> This allows engineers to keep complex business logic in-house while still relying on Dask to handle network communication, load balancing, resilience, diagnostics, etc..

# [Why Dask?](https://docs.dask.org/en/latest/why.html#links-and-more-information)

## Key remarks

### Dask Delivers Responsive Feedback

Because everything happens remotely, interactive parallel computing can be frustrating for users. They don’t have a good sense of how computations are progressing, what might be going wrong, or what parts of their code should they focus on for performance. The added distance between a user and their computation can drastically affect how quickly they are able to identify and resolve bugs and performance problems, which can drastically increase their time to solution.

Dask keeps users informed and content with a suite of helpful diagnostic and investigative tools including the following:

- A real-time and responsive dashboard that shows current progress, communication costs, memory use, and more, updated every 100ms

- A statistical profiler installed on every worker that polls each thread every 10ms to determine which lines in your code are taking up the most time across your entire computation

- An embedded IPython kernel in every worker and the scheduler, allowing users to directly investigate the state of their computation with a pop-up terminal

- The ability to reraise errors locally, so that they can use the traditional debugging tools to which they are accustomed, even when the error happens remotely

# How it works?

<center>
    <img src="./images/dask-overview.svg" alt="Dask Overview" width="500" height="600">
    <br/>
    <sup id="backref_IMG2"><a href="#ref_IMG2">2</a></sup>
</center>

# Dask use cases

Dask uses can be roughly divided in the following two categories:

1. Large NumPy/Pandas/Lists with Dask Array, Dask DataFrame, Dask Bag, to analyze large datasets with familiar techniques. This is similar to [Spark](https://spark.apache.org) or big array libraries.

2. **Custom task scheduling**. You submit a graph of functions that depend on each other for custom workloads. This is similar to [Airflow](https://airflow.apache.org) or [Celery](https://docs.celeryproject.org/en/stable/).

Most people today approach Dask assuming it is a framework like Spark, designed for the first use case around large collections of uniformly shaped data. However, many of the more productive and novel use cases fall into the second category where Dask is used to parallelize custom workflows.

You can also look for [real-world applications](https://stories.dask.org/en/latest/) in which people end up using both sides of Dask to achieve novel results.

# References

- [Dask Documentation](https://docs.dask.org/en/latest/)
- [Dask: An Introduction and Tutorial](https://gongster.medium.com/dask-an-introduction-and-tutorial-b42f901bcff5)

# Images' references

<a id="ref_IMG1">1</a>: Taken from [Dask](https://dask.org/)
    [↩](#backref_IMG1)
    
<a id="ref_IMG2">2</a>: Taken from [Dask](https://dask.org/)
    [↩](#backref_IMG2)