# Dask for Scalable Computing in Python

<img src="../../../assets/dask_horizontal.svg" align="right" width="30%">

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dask-for-Scalable-Computing-in-Python" data-toc-modified-id="Dask-for-Scalable-Computing-in-Python-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Dask for Scalable Computing in Python</a></span><ul class="toc-item"><li><span><a href="#Learning-Objectives" data-toc-modified-id="Learning-Objectives-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Learning Objectives</a></span></li><li><span><a href="#What-is-&quot;Big-Data&quot;?" data-toc-modified-id="What-is-&quot;Big-Data&quot;?-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>What is "Big Data"?</a></span></li><li><span><a href="#What-is-Dask?" data-toc-modified-id="What-is-Dask?-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>What is Dask?</a></span><ul class="toc-item"><li><span><a href="#Dask-Components" data-toc-modified-id="Dask-Components-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Dask Components</a></span></li></ul></li><li><span><a href="#Going-Futher" data-toc-modified-id="Going-Futher-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Going Futher</a></span></li></ul></li></ul></div>

## Learning Objectives

- Define what we mean by "Big Data"
- Get an overview of dask and its components 



## What is "Big Data"?


There is a lot of hype around the buzzword "big data" today.  To avoid the ill-defined and often-overused term "big data", we’ll use a three-tiered definition throughout this tutorial to describe different-sized datasets. 


<div class="alert alert-block alert-info">
The boundaries between the following thresholds are a bit fuzzy and depend on how powerful your computer is. <b>The significane lies more in the different orders of magnitude rather than hard size limits</b>.
</div>



**An _opinioned_ tiered definition of data set sizes**

| Dataset Type | Size range | Fits in RAM? | Fits on local disk? |
| -------- | -------- | -------- |-----------------------|
| Small dataset     | < 2-4 GB    | ✔    | ✔
| Medium dataset     | < 2 TB    | ✖    | ✔
| Large dataset     | > 2 TB   | ✖    | ✖


- Small datasets:
    - Fit comfortably in RAM, leaving memory to spare for manipulation and transformations.
    - Usually no more than 2-4 GB in size.
    - Complex opertions like aggregation can be done on these datasets without paging (spilling intermediate results to disk).
    - Tools like NumPy, Pandas, Xarray, Scikit-learn are the best tools for the job. 
    - Throwing more sophisticated tools at these datasets is not only **overkill**, but can be counterproductive by adding unnecessary layers of complexity and overhead that can negatively impact performance. 

- Medium datasets:
    - Cannot be held entirely in RAM but can fit comfortably in a single computer's local disk. 
    - Typically range in size from 10 GB to 2 TB.
    - The same toolset used to analyze small datasets can be used to analyze medium datasets:
      * A significant performance penality is imposed because these tools must use paging in order to avoid out of memory errors.
    - Are large enough that it make senses to introduce parallelism (multithreading, multiprocessing) to cut down processing time.

- Large datasets:
    - Can neither fit in RAM nor fit in a single computer's persistent storage.
    - Typically above 2 TB in size, can reach into petabytes and beyond.
        - Once you have many TB of data to analyze, you are definitely in the realm of "big data"
        - By this definition, many datasets we regularly confront in Earth science are in this category.
    - NumPy, Pandas, Xarray alone are not suitable for datasets of this size, because they were not inherently built to operate on distributed datasets.

    

## What is Dask?

- Dask is a tool that helps us easily extend our familiar Python data analysis toolset (NumPy, Pandas, Xarray, Scikit-learn, etc...) to medium and big data, i.e. dataset that can't fit in our computer's RAM. 
- In many cases, dask also allows us to speed up our analysis by using mutiple CPU cores. 
- Dask can help us work more efficiently on our laptop, and it can also help us scale up our analysis to thousand-node clusters on HPC and cloud platforms. 
- **Most importantly, dask is almost invisible to the user, meaning that you can focus on your science, rather than the details of parallel computing.**


### Dask Components 

Dask consists of several different components and APIs, which can be categorized into three layers: 

- The scheduler
- Low-level APIs
- High-level APIs


<img src="../../../assets/dask-components.jpeg" align="right" width="100%">





<div class="alert alert-block alert-info">

It is probably easiest to illustrate what these mean through examples, so in the next notebook, we will jump right in. 

</div>

## Going Futher

- [The Dask Documentation](http://dask.readthedocs.io/en/latest/)
- [The Dask Github Site](https://github.com/dask/dask)

<div class="alert alert-block alert-success">
  <p>Next: <a href="02_dask_arrays.ipynb">Dask Arrays</a></p>
</div>