# Introduction to Dask

In this notebook, we'll learn how to use [Dask](https://dask.org) for reading and processing data from Azure.

## Install required packages

In [None]:
!pip install --upgrade dask distributed dask-sql bokeh adlfs fsspec fastparquet pyarrow python-snappy lz4 "pandas>=1.2.0"

## Get AML Workspace

You can use the AML workspace to retrieve datastores and keyvaults for accessing data credentials securely.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
ws

## Create a distributed client

The [client](https://distributed.dask.org/en/latest/client.html) is the primary entrypoint for parallel processing with Dask. Calling it without inputs will create a local distributed scheduler, utilizing all the CPUs and cores on your machine. This can be useful for faster processing of larger in memory dataframes, or even computations on out of memory (OOM) data. 

When your local machine isn't powerful enough, you can provision a larger VM in Azure - the M series has 100+ CPUs and TBs of RAM. If this still isn't powerful enough, you can create a distributed Dask cluster on most hardware - see [the Dask setup guide](https://docs.dask.org/en/latest/setup.html) for details.

If you still need acceleration, [RAPIDSAI](https://github.com/rapidsai) further extends the PyData APIs on GPUs.

**Make sure you check out the dashboard!**

In [None]:
from distributed import Client

# initialize local client
c = Client()

# print Python objects
print(c)
print(c.dashboard_link)

# print notebook widget widget
c

## Reading cloud data

Reading data from the cloud is as easy! Python implements various cloud protocols, including ``az`` for Blob and ADLSv2 and ``adl`` for ADLSv1.

### Public Data

Public data can simply be read via ``https``.


In [None]:
account_name = "azuremlexamples"
container_name = "datasets"

storage_options = {"account_name": account_name}

In [None]:
data_uri = f"https://{account_name}.blob.core.windows.net/{container_name}/iris.csv"
data_uri

In [None]:
import pandas as pd

df = pd.read_csv(data_uri)
df

alternatively, we can use the ``az`` protocol and pass in ``storage_options``:

In [None]:
data_uri = f"az://{container_name}/iris.csv"
data_uri

In [None]:
import pandas as pd

df = pd.read_csv(data_uri, storage_options=storage_options)
df

## Private Data 

Passing in storage options allows for reading private data. For instance, you can easily retrieve the information from an Azure ML Datastore:

```python
from azureml.core import Workspace

ws = Workspace.from_config()
ds = ws.get_default_datastore() # ws.datastores["my-datastore-name"]

container_name = ds.container_name
storage_options = {
    "account_name": ds.account_name,
    "account_key": ds.account_key,
}
```

In [None]:
from adlfs import AzureBlobFileSystem

ds = ws.get_default_datastore()

container_name = ds.container_name
storage_options = {
    "account_name": ds.account_name,
    "account_key": ds.account_key,
}

fs = AzureBlobFileSystem(**storage_options)
fs

In [None]:
fs.ls(f"{container_name}")

## Pythonic Filesystem

In the previous section, we used [ADLFS](https://github.com/dask/adlfs) to initialize a Pythonic filesystem and perform operations.

The below cell demonstrate some basic operations to raed and manipulate data in Python.

In [None]:
color = "green"
container_name = "nyctlc"
storage_options = {"account_name": "azureopendatastorage"}

In [None]:
fs = AzureBlobFileSystem(**storage_options)
fs

In [None]:
fs.ls(f"{container_name}")

In [None]:
fs.ls(f"{container_name}/{color}")

In [None]:
fs.ls(f"{container_name}/{color}/puYear=2016/")

In [None]:
files = fs.glob(f"{container_name}/{color}/puYear=2016/puMonth=12/*.parquet")
files = [f"az://{file}" for file in files]
files[-5:]

In [None]:
import dask.dataframe as dd

ddf = (
    dd.read_parquet(files, storage_options=storage_options)
    .repartition(npartitions=8)
    .persist()
)
ddf

In [None]:
%%time
len(ddf)

In [None]:
%%time
len(ddf)

In [None]:
ddf.info()

In [None]:
import matplotlib.pyplot as plt

plt.style.use("dark_background")

ddf["tipAmount"].compute().hist(
    figsize=(16, 8),
    bins=256,
    range=(0.1, 20),
)

In [None]:
df = ddf.compute()
df.info()

In [None]:
%%time
df.describe()

In [None]:
%%time
gbs = round(df.memory_usage(index=True, deep=True).sum() / 1e9, 2)
print(f"df is {gbs} GBs")

In [None]:
%%time
gbs = round(ddf.memory_usage(index=True, deep=True).sum().compute() / 1e9, 2)
print(f"ddf is {gbs} GBs")