# Introduction to Dask

In this notebook, we'll learn how to use [Dask](https://dask.org) for reading data from Azure.

The main [dask](https://github.com/dask/dask) and [distributed](https://github.com/dask/distributed) themselves are small and focused. Thousands of tools, some built by the Dask organization and most not, utilize Dask for parallel or distributed processing. Some of the most useful for data science include:

- [dask/adlfs](https://github.com/dask/adlfs)
- [dask/dask-ml](https://github.com/dask/dask-ml)
- [pydata/xarray](https://github.com/pydata/xarray)
- [microsoft/lightgbm](https://github.com/microsoft/lightgbm)
- [dmlc/xgboost](https://github.com/dmlc/xgboost)
- [rapidsai/cudf](https://github.com/rapidsai/cudf)
- [rapidsai/cuml](https://github.com/rapidsai/cuml)

## Install required packages

In [None]:
!pip install --upgrade dask distributed bokeh adlfs fsspec fastparquet pyarrow python-snappy lz4

## Get AML Workspace

You can use the AML workspace to retrieve datastores and keyvaults for accessing data credentials securely.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
ws

## Create a distributed client

The [client](https://distributed.dask.org/en/latest/client.html) is the primary entrypoint for parallel processing with Dask. Calling it without inputs will create a local distributed scheduler, utilizing all the CPUs and cores on your machine. This can be useful for faster processing of larger in memory dataframes, or even computations on out of memory (OOM) data. 

When your local machine isn't powerful enough, you can provision a larger VM in Azure - the M series has 100+ CPUs and TBs of RAM. If this still isn't powerful enough, you can create a distributed Dask cluster on most hardware - see [the Dask setup guide](https://docs.dask.org/en/latest/setup.html) for details.

If you still need acceleration, [RAPIDSAI](https://github.com/rapidsai) further extends the PyData APIs on GPUs.

**Make sure you check out the dashboard!**

In [None]:
from distributed import Client

c = Client()
print(c)
c

## Reading cloud data

Reading data from the cloud is as easy as reading it locally! Sorta!

### Pandas

You can read directly into Pandas from most cloud storage, with a notable exception - from the [`pandas.read_csv` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas-read-csv):

> Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, https, ftp, s3, gs, and file.

### Pythonic Filesystem for Azure

Fortunately, similar protocols have been developed for Azure storage in the [ADLFS](https://github.com/dask/adlfs) package, including:

- `az` or `abfs` for Azure Data Lake Storage Gen2 (ADLSv2) and Blob
- `adl` for Azure Data Lake Storage Gen1 (ADLSv1)

These are included in Python's `fsspec`. You can use the protocol directly in Dask and convert to Pandas for now.


In [None]:
import pandas as pd
import dask.dataframe as dd

from adlfs import AzureBlobFileSystem

In [None]:
# for single files in public blobs, you can use the https protocol and read directly into Pandas
df = pd.read_csv(
    "https://azuremlexamples.blob.core.windows.net/datasets/iris.csv"
)
df.head()

In [None]:
# the same with Dask
df = dd.read_csv(
    "https://azuremlexamples.blob.core.windows.net/datasets/iris.csv"
)
df.head()

In [None]:
# alternative syntax in Dask
storage_options = {"account_name": "azuremlexamples"}
df = dd.read_csv(f"az://datasets/iris.csv", storage_options=storage_options)
df.head()

## Why use Dask?

Those all did the same thing...so why use Dask? There are a few scenarios:

- reading multiple files 
- reading private data from Azure with credentials
- reading directly into GPUs (with [cuDF](https://github.com/rapidsai/cudf))

You can also use the classes implemented in `adlfs` to query for files, depending on permissions.

To provide your own credentials, refer to the `adlfs` documentation for details - generally you can retrieve credentials from the workspace's datastore:

```python
import dask.dataframe as dd
from azureml.core import Workspace

ws = Workspace.from_config()
ds = ws.get_default_datastore() # ws.datastores["my-datastore-name"]

storage_options = {
    "account_name": ds.account_name,
    "account_key": ds.account_key,
}

df = dd.read_parquet(f"az://{ds.container_name}/path/to/data/*.parquet", storage_options=storage_options)
```

The basics are demonstrated below on public data.

In [None]:
color = "green"
container_name = "nyctlc"
storage_options = {"account_name": "azureopendatastorage"}

In [None]:
fs = AzureBlobFileSystem(**storage_options)
fs

In [None]:
fs.ls(f"{container_name}")

In [None]:
fs.ls(f"{container_name}/{color}")

In [None]:
fs.ls(f"{container_name}/{color}/puYear=2016/")

In [None]:
files = fs.glob(f"{container_name}/{color}/puYear=2016/puMonth=12/*.parquet")
files = [f"az://{file}" for file in files]
files[-5:]

In [None]:
%%time
ddf = (
    dd.read_parquet(files, storage_options=storage_options)
    .repartition(npartitions=8)
    .persist()
)
ddf

In [None]:
%%time
len(ddf)

In [None]:
%%time
len(ddf)

In [None]:
ddf.info()

In [None]:
import matplotlib.pyplot as plt

plt.style.use("dark_background")

ddf["tipAmount"].compute().hist(
    figsize=(16, 8), bins=256, range=(0.1, 20),
)

In [None]:
df = ddf.compute()
df.info()

In [None]:
%%time
df.describe()

In [None]:
%%time
gbs = round(df.memory_usage(index=True, deep=True).sum() / 1e9, 2)
print(f"df is {gbs} GBs")

In [None]:
%%time
gbs = round(ddf.memory_usage(index=True, deep=True).sum().compute() / 1e9, 2)
print(f"ddf is {gbs} GBs")