# Dask Basic
---
**High level collections:**  Dask provides high-level Array, Bag, and DataFrame
collections that mimic NumPy, lists, and Pandas but can operate in parallel on
datasets that don't fit into memory.  Dask's high-level collections are
alternatives to NumPy and Pandas for large datasets.

`dask.delayed` is the only function that you will need to convert functions for use with Dask.

In [None]:
from time import sleep

data = [1, 2, 3, 4, 5, 6, 7, 8]

def inc(x):
    sleep(1)  # wait 1 sec
    return x+1

In [None]:
%%time
results = []

for x in data:
    y = inc(x)
    results.append(y)
    
total = sum(results)

This is exactly what we expect. The list has 8 element and we have to wait 1 sec for each of them.

### Parallelize with the `dask.delayed` decorator

Those increment calls *could* be called in parallel, because they are totally **independent** of one-another.

`dask.delayed` is a function which does 2 things. If you call delayed on another function, it makes that function lazy (it doesn’t run immediately). If you call delayed on data, it makes everything that touch the data lazy, it will run later.

So, we'll change the code a little bit. Instead of executing the commands immediatly, we'll make them lazy.

In [None]:
from dask import delayed

# All it does is build a graph (no calculation)

for x in data:
    y = delayed(inc)(x)
    results.append(y)
    
total = delayed(sum)(results)

print(total)

And let them compute later. Doing so, behind the scene, takes advantage of Dask to schedule each task to be executed efficiently. We'll see the time is decreased into 2 sec.

In [None]:
%%time
z = total.compute()

We can see the Dask computaional graph be calling `.visualize()` We will visually see how Dask do tasks in parallel.

In [None]:
print('result =',z)
total.visualize()

# Dask Array
---
Dask array provides a parallel, larger-than-memory ability using blocked algorithms

### Blocked algorithm 
A Blocked algorithm executes on a large dataset by breaking it up into many small blocks and takes advantage of `dask.delayed` function to schedule tasks in parallel.

Create a `dask.array` object with the `da.from_array` function by passing

1.  `data`: Any object that supports slicing
2.  `chunks`: A chunk size of the small block

In [None]:
import dask.array as da
import numpy as np

x = da.from_array([range(100) for i in range(100)], chunks=(10, 10))
x

We can do pretty much everything like normal Numpy array object. The difference is just it will be lazy and won't be computed immediately.

In [None]:
x.sum()

In [None]:
x.sum().compute()

## Compare with Numpy

In [None]:
%%time
x = np.random.normal(10, 0.1, size=(20000, 20000))
print(f"This matrix has {x.nbytes/ 1e9} GB")
z = x.mean(axis=0)[::100]

In [None]:
%%time
x = da.random.normal(10, 0.1, size=(20000, 20000), 
                     chunks=(1000, 1000))
print(f"This matrix has {x.nbytes/ 1e9} GB")
z = x.mean(axis=0)[::100].compute()

# Dask Dataframe
---
Dask dataframes look and feel like Pandas dataframes but they run on the same infrastructure that powers `dask.delayed`. Pandas is great for tabular datasets that fit in memory. By using Blocked algorithm, Dask becomes useful when the dataset you want to analyze is larger than your machine's RAM. 

`dask.dataframe` module implements a blocked parallel `DataFrame` object that mimics a large subset of the `Pandas DataFrame` API. One Dask `DataFrame` is comprised of many in-memory `pandas DataFrames` separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.

In [None]:
import pandas as pd
import dask.dataframe as dd
import warnings
warnings.filterwarnings('ignore')

<div class='alert alert-info'>
    <h3>Unlike Pandas</h3>
    <code>pandas.read_csv</code> reads in the entire file before inferring datatypes. <code>dask.dataframe.read_csv</code> only reads in a sample from the beginning of the file to infer their datatypes. Thus, these inferred datatypes are then enforced when reading all partitions. The issues might happen for example when most of the beginning of the file is <code>NaN</code>. <br>
    &#8594 Specifying dtypes directly using the <code>dtype</code> keyword is the recommended solution.
    </div>

In [None]:
df_dask = dd.read_csv("/kaggle/input/flight-delays/flights.csv")

try :
    df_dask.head()
except :
    print("Got some error")

Let's have a first look on how they perform on 564 MB dataset.

In [None]:
%%time
df_pd = pd.read_csv("/kaggle/input/flight-delays/flights.csv", 
                    dtype={'SCHEDULED_TIME':float,
                           'TAIL_NUMBER':str})

x = df_pd.groupby('ORIGIN_AIRPORT').DEPARTURE_DELAY.mean()

In [None]:
%%time
df_dask = dd.read_csv("/kaggle/input/flight-delays/flights.csv",
                     dtype={'SCHEDULED_TIME':float,
                           'TAIL_NUMBER':str})

y = df_dask.groupby('ORIGIN_AIRPORT').DEPARTURE_DELAY.mean().compute()

<div class='alert alert-warning'>
    <h3>Dask is always lazy</h3>
As we are with <code>Dask.delayed</code>, we always need to call <code>.compute()</code> because  it is a lazy object.
    </div>
We can see how parallel Dask scheduler provides by calling <code>.visualize()</code> 

In [None]:
x = df_dask.DEPARTURE_DELAY.max()
print(x.compute())
x.visualize()

### Example of Pandas-like operation

In [None]:
# Number of rows
len(df_dask)

In [None]:
# Print all columns names
df_dask.columns

In [None]:
# Count cancelled flights
len(df_dask[df_dask.CANCELLED==1])

In [None]:
# Number of non-cancelled flights were taken from each airport
x = df_dask[df_dask.CANCELLED==0].groupby(by='ORIGIN_AIRPORT').ORIGIN_AIRPORT.count()
x.compute()

In [None]:
# Average departure delay from each airport
x = df_dask.groupby(by='ORIGIN_AIRPORT').DEPARTURE_DELAY.mean()
x.compute()

# Pandas or Dask
---
**If your data fits in memory then you should definitely use Pandas**, as it is more fully featured than `dask.dataframe`.  The `dask.dataframe` module gives you a solution to operate on datasets that don't fit comfortably in memory.

`dask.dataframe` only really becomes meaningful when Pandas breaks with 

    MemoryError:  ...
    
Furthermore, the *distributed scheduler* allows the same dataframe expressions to be executed across a cluster. To enable massive "big data" processing, one could execute data ingestion functions such as `read_csv`, where the data is held on storage accessible to every worker node (e.g., amazon's S3), and because most operations begin by selecting only some columns, transforming and filtering the data, only relatively small amounts of data need to be communicated between the machines.

Dask.dataframe operations use `pandas` operations internally.  Generally they run at about the same speed.

Dask.dataframe only covers a small but well-used portion of the Pandas API.
This limitation is for two reasons:

1.  The Pandas API is *huge*
2.  Some operations are genuinely hard to do in parallel (e.g. sort)

Additionally, some important operations like ``set_index`` work, but are slower
than in Pandas because they include substantial shuffling of data, and may write out to disk.