## Scaling your workflow with Dask

This notebook goes through some examples using Dask to scale some common workflows on tabular data.

Dask works either on a single machine (executing in parallel using threads) or on a cluster of many machines. The examples here will run on a cluster, just for fun, but Dask is also useful for working with larger-than-memory datasets on a single machine.

When using Dask on a Cluster the typical pattern is to

1. Create a Dask Cluster (using one of our many [deployment options](https://docs.dask.org/en/latest/setup.html) that talks to the resource manager)
2. Connect to the cluster with a local Client

In [3]:
# This example uses Coiled, a for-profit company that will
# manage Dask deployments for you. You could also do it yourself
# and use one of
# * dask_ssh.SSHCluster()
# * dask_yarn.YarnCluster()
# * dask_jobqueue.PBSCluster()
# * dask_kubernetes.KubeCluster()
# * dask_gateway.GatewayCluster()
# * dask_cloudprovider.FargateCluster()
# * dask_cloudprovider.AzureMLCluster()
# * dask_saturn.SaturnCluster()
# * ...

import coiled
cluster = coiled.Cluster(account="tomaugspurger")
cluster

Creating Cluster. This takes about a minute ...

Exception: <!DOCTYPE html>
<html lang="en">
<head>
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <title>Page not found at /api/v1/tomaugspurger/cluster/None/scale/</title>
  <meta name="robots" content="NONE,NOARCHIVE">
  <style type="text/css">
    html * { padding:0; margin:0; }
    body * { padding:10px 20px; }
    body * * { padding:0; }
    body { font:small sans-serif; background:#eee; color:#000; }
    body>div { border-bottom:1px solid #ddd; }
    h1 { font-weight:normal; margin-bottom:.4em; }
    h1 span { font-size:60%; color:#666; font-weight:normal; }
    table { border:none; border-collapse: collapse; width:100%; }
    td, th { vertical-align:top; padding:2px 3px; }
    th { width:12em; text-align:right; color:#666; padding-right:.5em; }
    #info { background:#f6f6f6; }
    #info ol { margin: 0.5em 4em; }
    #info ol li { font-family: monospace; }
    #summary { background: #ffc; }
    #explanation { background:#eee; border-bottom: 0px none; }
  </style>
</head>
<body>
  <div id="summary">
    <h1>Page not found <span>(404)</span></h1>
    <table class="meta">
      <tr>
        <th>Request Method:</th>
        <td>PATCH</td>
      </tr>
      <tr>
        <th>Request URL:</th>
        <td>https://beta.coiled.io/api/v1/tomaugspurger/cluster/None/scale/</td>
      </tr>
      
    </table>
  </div>
  <div id="info">
    
      <p>
      Using the URLconf defined in <code>cloud.urls</code>,
      Django tried these URL patterns, in this order:
      </p>
      <ol>
        
          <li>
            
                favicon.ico
                
            
          </li>
        
          <li>
            
                admin/
                
            
          </li>
        
          <li>
            
                accounts/
                
            
          </li>
        
          <li>
            
                invitations/
                
            
          </li>
        
          <li>
            
                api/
                
            
                v1/auth/
                
            
          </li>
        
          <li>
            
                api/
                
            
                v1/users/
                [name='users']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/users/&lt;int:pk&gt;/
                [name='user-detail']
            
          </li>
        
          <li>
            
                api/
                
            
                ^v1/users/(?P&lt;pk&gt;me)/$
                [name='user-me']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/members/
                [name='account-members']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/members/&lt;str:user&gt;/
                [name='account-members-detail']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/clusters/
                [name='clusters']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/clusters/&lt;str:name&gt;/
                [name='cluster-by-name']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/cluster/&lt;int:cluster_id&gt;/
                [name='cluster-detail']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/cluster/&lt;int:cluster_id&gt;/scale/
                [name='cluster-scale']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/cluster/&lt;int:cluster_id&gt;/logs/
                [name='cluster-logs']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/software_environments/
                [name='software-environments']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/software_environments/&lt;path:name&gt;/
                [name='software-environment-detail']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/cluster_configurations/
                [name='cluster-configurations']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;str:account&gt;/cluster_configurations/&lt;str:name&gt;/
                [name='cluster-configuration-detail']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;slug:account&gt;/cluster-info/
                [name='scheduler-info']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;slug:account&gt;/all-tasks-compute-time/
                [name='all-tasks-compute-time']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;slug:account&gt;/all-tasks-compute-time/spec.vg.json
                [name='all-tasks-compute-time-plot']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;slug:account&gt;/all-tasks-state/
                [name='all-tasks-state']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/&lt;slug:account&gt;/all-tasks-state/spec.vg.json
                [name='all-tasks-state-plot']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/health/
                [name='health-check']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/health/version
                [name='health-version']
            
          </li>
        
          <li>
            
                api/
                
            
                v1/health/stats
                [name='health-stats']
            
          </li>
        
          <li>
            
                preloads/insights.py
                [name='telemetry-preload']
            
          </li>
        
          <li>
            
                preloads/aws-credentials.py
                [name='aws-cred-preload']
            
          </li>
        
          <li>
            
                
                
            
                users/&lt;int:pk&gt;/download/
                [name='user_download_config']
            
          </li>
        
          <li>
            
                
                
            
                users/github/login
                [name='github_login']
            
          </li>
        
          <li>
            
                
                
            
                users/github/get_token
                [name='github_get_token']
            
          </li>
        
          <li>
            
                
                [name='home']
            
          </li>
        
          <li>
            
                ^(?!api/)(?P&lt;random_path&gt;.*)$
                [name='home-global']
            
          </li>
        
      </ol>
      <p>
        
        The current path, <code>api/v1/tomaugspurger/cluster/None/scale/</code>, didn't match any of these.
      </p>
    
  </div>

  <div id="explanation">
    <p>
      You're seeing this error because you have <code>DEBUG = True</code> in
      your Django settings file. Change that to <code>False</code>, and Django
      will display a standard 404 page.
    </p>
  </div>
</body>
</html>


Once we have a cluster (coiled, PBS, kubernetes, or otherwise), connect to it. After this, all Dask-backed operations will happen on the cluster.

In [None]:
from distributed import Client

client = Client(cluster)
client

In [None]:
import pandas as pd

dtype={
    "payment_type": "UInt8",
    "VendorID": "UInt8",
    "passenger_count": "UInt8",
    "RatecodeID": "UInt8",
}

df = pd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv",
    dtype=dtype,
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    nrows=1000
)
df

In [None]:
counts = df.passenger_count.value_counts()
counts.sort_index().plot.bar(rot=0, width=1, color='k');

In [None]:
df.groupby("passenger_count").tip_amount.mean()

Dask DataFrame mimics the pandas API. This means *many* of the
APIs you're familiar with will work with Dask. There are often
some dask-specific keywords as well, reflecting the fact that
parallel / distributed computing has its own set of concerns.

Dask's readers typically accept a list of URLs / files, or a globstring indicating a list of files to read in.

In [None]:
import dask.dataframe as dd

ddf = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv",
    dtype=dtype,
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    storage_options={"anon": True},
    blocksize="16 MiB",
)
ddf

A few things to note:

1. We reused the `dtype` and `parse_dates` options, just like before
2. We have additional dask-specific options like `blocksize`
3. The result returned almost instantly
4. The values in the table aren't shown, just the structure (column names and dtypes)

Dask's high-level collections like `dask.dataframe` are lazy. The just do enough work to propagate metadata for operations, like the type of the output, the column names and dtypes, etc.

In [None]:
dask_counts = ddf.passenger_count.value_counts()
dask_counts

In [None]:
ax = (
    dask_counts
    .compute()
    .sort_index()
    .plot.bar(rot=0, width=1, color='k')
);

In [None]:
ddf = ddf.persist()

In [None]:
dask_counts = ddf.passenger_count.value_counts()
ax = (
    dask_counts
    .compute()
    .sort_index()
    .plot.bar(rot=0, width=1, color='k')
);

In [None]:
ddf.groupby("passenger_count").tip_amount.mean().compute()

In [None]:
ddf.fare_amount.quantile([0.25, 0.5, .75]).compute()

In [None]:
(ddf.fare_amount + ddf.tip_amount).head()

In [None]:
ddf.RatecodeID.isna().mean().compute()

### Dask is familiar

<img width="40%" src="https://docs.dask.org/en/latest/_images/dask-dataframe.svg"/>

We saw earlier that dask.dataframe mimics the pandas API. We could use the same keywords to get the same behavior. But perhaps more importantly, Dask feels familiar because it uses pandas to do dataframe operations. A `dask.dataframe.value_counts` is just a bunch of `pandas.value_counts` plus a bit of logic to combine the results. Dask Array is a bunch of NumPy arrays with some logic for how to work with them in parallel.

In [None]:
import geopandas

zones = geopandas.read_file("./taxi_zones")
zones.head()

In [None]:
zones.plot();

In [None]:
center = zones.geometry.centroid.to_crs(crs="EPSG:4326")
zones["lng"] = center.x
zones["lat"] = center.y
# for memory savings
zones['borough'] = zones['borough'].astype('category')

In [None]:
df[['PULocationID', 'DOLocationID']].head()

In [None]:
zones[['LocationID', 'borough', 'lat', 'lng']].rename(
    columns=lambda x: f"DO{x}"
)

In [None]:
df2 = pd.merge(
    df,
    zones[['LocationID', 'borough', 'lat', 'lng']].rename(
        columns=lambda x: f"DO{x}"
    )
)
df3 = pd.merge(
    df2,
    zones[['LocationID', 'borough', 'lat', 'lng']].rename(
        columns=lambda x: f"PU{x}"
    )
)
df3.head()

In [None]:
import numpy as np


def gcd(lat1, lng1, lat2, lng2):
    '''
    Calculate great circle distance.
    http://www.johndcook.com/blog/python_longitude_latitude/

    Parameters
    ----------
    lat1, lng1, lat2, lng2: float or array of float

    Returns
    -------
    distance:
      distance from ``(lat1, lng1)`` to ``(lat2, lng2)`` in kilometers.
    '''
    # python2 users will have to use ascii identifiers
    ϕ1 = np.deg2rad(90 - lat1)
    ϕ2 = np.deg2rad(90 - lat2)

    θ1 = np.deg2rad(lng1)
    θ2 = np.deg2rad(lng2)

    cos = (np.sin(ϕ1) * np.sin(ϕ2) * np.cos(θ1 - θ2) +
           np.cos(ϕ1) * np.cos(ϕ2))
    arc = np.arccos(cos)
    return arc * 6373

In [None]:
gcd(df3.PUlat, df3.PUlng, df3.DOlat, df3.DOlng)

In [None]:
ddf2 = dd.merge(
    ddf,
    zones[['LocationID', 'borough', 'lat', 'lng']].rename(
        columns=lambda x: f"DO{x}"
    )
)
ddf3 = dd.merge(
    ddf2,
    zones[['LocationID', 'borough', 'lat', 'lng']].rename(
        columns=lambda x: f"PU{x}"
    )
)
ddf3.head()

In [None]:
distance = gcd(ddf3.PUlat, ddf3.PUlng, ddf3.DOlat, ddf3.DOlng)
distance.head()

In [None]:
distance.quantile([0.1, 0.25, 0.5, 0.75, .9]).compute()

## Parallelizing Custom Code

Not all problems fit in the big array / big dataframe model. There's often some bespoke munging that needs to be done before data can be loaded into an array or dataframe. `dask.delayed` helps here.

In [None]:
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

def add(x, y):
    sleep(1)
    return x + y

In [None]:
%%time
# This takes three seconds to run because we call each
# function sequentially, one after the other

x = inc(1)
y = inc(2)
z = add(x, y)

In [None]:
import dask

In [None]:
%%time
# This runs immediately, all it does is build a graph

x = dask.delayed(inc)(1)
y = dask.delayed(inc)(2)
z = dask.delayed(add)(x, y)

In [None]:
%%time
# This actually runs the computation in parallel
z.compute()

In [None]:
z

In [None]:
z.visualize()

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8]

In [None]:
%%time
results = []
for x in data:
    y = inc(x)
    results.append(y)
    
total = sum(results)

In [None]:
%%time
results = []
for x in data:
    y = dask.delayed(inc)(x)
    results.append(y)
    
total = sum(results)

In [None]:
%time total.compute()

In [None]:
total.visualize(rankdir="LR")

In [None]:
client.restart()

### Privacy Example

https://github.com/capeprivacy/cape-python/blob/master/examples/tutorials/credit/mask_credit_data_in_pandas.ipynb

In [1]:
import matplotlib.pylab as plt
import pandas as pd

import cape_privacy as cape
from cape_privacy.pandas import dtypes
from cape_privacy.pandas import transformations as tfms

In [55]:
url = "https://raw.githubusercontent.com/capeprivacy/cape-python/master/examples/tutorials/credit/data/credit_with_pii.csv"
credit = pd.read_csv(url, parse_dates=["Application_date"])
credit.head()

Unnamed: 0,Name,City,Street_address,Salary,Application_date,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,Robert Fitzgerald,Greenmouth,349 Alexander Coves Apt. 799,36964.79,2018-08-03,67,male,2,own,,little,1169,6,radio/TV,good
1,Daniel Kim,North Jessica,349 Jesse Park Suite 888,87884.39,2018-08-07,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,Frederick Jordan,Lake Erika,389 Graham Drive Suite 973,41157.73,2018-05-14,49,male,1,own,little,,2096,12,education,good
3,Tara Rojas,East Jessica,01579 Ramirez Drives Apt. 587,36214.8,2018-04-30,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,Gail Donovan,Randyport,467 Christopher Well,41353.49,2018-05-27,53,male,2,free,little,little,4870,24,car,bad


In [56]:
tokenize_name = tfms.Tokenizer()
tokenize_sex = tfms.Tokenizer(max_token_len=10)
perturb_age = tfms.NumericPerturbation(dtype=dtypes.Integer, min=-5, max=5)
perturb_application_date = tfms.DatePerturbation(frequency="DAY", min=-3, max=3)
redact_location = tfms.ColumnRedact(columns=["Street_address", "City"])
round_salary = tfms.NumericRounding(dtype=dtypes.Float, precision=-3)

def mask(df):
    df = df.copy()
    name = tokenize_name(df['Name'])
    sex = tokenize_sex(df['Sex'])
    age = perturb_age(df['Age'])
    application_date = perturb_application_date(df['Application_date'])
    salary = round_salary(df['Salary'])
    df = redact_location(df)
    
    return df.assign(
        Name=name, Sex=sex, Age=age, Application_date=application_date,
        Salar=salary
    )

caped_df = mask(credit)
caped_df.head()

Unnamed: 0,Name,Salary,Application_date,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk,Salar
0,c7ca19babf7be43b63fbf0cd98c5ed0d20321f811829de...,36964.79,2018-07-31,68,98439e2dc2,2,own,,little,1169,6,radio/TV,good,37000.0
1,1ff894ac2d0fd34b6d5d7a24867da33406858f0b87c93b...,87884.39,2018-08-06,21,e926a69cbb,2,own,little,moderate,5951,48,radio/TV,bad,88000.0
2,b0dc4a975f0118a960784582b5908481a16e98808e72fc...,41157.73,2018-05-13,49,98439e2dc2,1,own,little,,2096,12,education,good,41000.0
3,f319b90d2ce720fe8035b15db87426bcd09874ef699019...,36214.8,2018-04-30,48,98439e2dc2,2,free,little,little,7882,42,furniture/equipment,good,36000.0
4,d5ff340dcb6e3d3f6e9c7a0d284c0ac0205a4d7b0d0cd3...,41353.49,2018-05-27,53,98439e2dc2,2,free,little,little,4870,24,car,bad,41000.0


In [57]:
tokenize_name(credit.Name)

0      c7ca19babf7be43b63fbf0cd98c5ed0d20321f811829de...
1      1ff894ac2d0fd34b6d5d7a24867da33406858f0b87c93b...
2      b0dc4a975f0118a960784582b5908481a16e98808e72fc...
3      f319b90d2ce720fe8035b15db87426bcd09874ef699019...
4      d5ff340dcb6e3d3f6e9c7a0d284c0ac0205a4d7b0d0cd3...
                             ...                        
995    a7564f309651e72bbf6661f84781d8f75a6ba74775f289...
996    c25a9f04b15095330f9acca6cf4d2cc71a8facf503c757...
997    9e9c6d3d9934d790abb4a5d0a4f383a111856a6987176b...
998    648f4294e2db1adf660debed0b78514a0dfede5ca926f1...
999    58a6bcafe8c6df370068e652fdf73979c82346a38d3d24...
Name: Name, Length: 1000, dtype: object

In [58]:
import dask.dataframe as dd

dcredit = dd.from_pandas(credit, 4)
dcredit

Unnamed: 0_level_0,Name,City,Street_address,Salary,Application_date,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,object,object,object,float64,datetime64[ns],int64,object,int64,object,object,object,int64,int64,object,object
250,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
500,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
750,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [59]:
dcredit.Name.map_partitions(tokenize_name)

Dask Series Structure:
npartitions=4
0      object
250       ...
500       ...
750       ...
999       ...
Name: Name, dtype: object
Dask Name: Tok, 12 tasks

In [75]:
dcape_df = dcredit.map_partitions(mask)
dcape_df

Unnamed: 0_level_0,Name,Salary,Application_date,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk,Salar
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,object,float64,datetime64[ns],int64,object,int64,object,object,object,int64,int64,object,object,float32
250,...,...,...,...,...,...,...,...,...,...,...,...,...,...
500,...,...,...,...,...,...,...,...,...,...,...,...,...,...
750,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [76]:
dcape_df.head()

Unnamed: 0,Name,Salary,Application_date,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk,Salar
0,c7ca19babf7be43b63fbf0cd98c5ed0d20321f811829de...,36964.79,2018-08-05,70,98439e2dc2,2,own,,little,1169,6,radio/TV,good,37000.0
1,1ff894ac2d0fd34b6d5d7a24867da33406858f0b87c93b...,87884.39,2018-08-09,26,e926a69cbb,2,own,little,moderate,5951,48,radio/TV,bad,88000.0
2,b0dc4a975f0118a960784582b5908481a16e98808e72fc...,41157.73,2018-05-12,49,98439e2dc2,1,own,little,,2096,12,education,good,41000.0
3,f319b90d2ce720fe8035b15db87426bcd09874ef699019...,36214.8,2018-04-27,45,98439e2dc2,2,free,little,little,7882,42,furniture/equipment,good,36000.0
4,d5ff340dcb6e3d3f6e9c7a0d284c0ac0205a4d7b0d0cd3...,41353.49,2018-05-29,57,98439e2dc2,2,free,little,little,4870,24,car,bad,41000.0


In [77]:
dcape_df.compute()

KeyError: 0

## Scalable Machine Learning with Dask-ML



In [82]:
import pandas as pd
from cape_privacy.pandas import transformations as tfms

perturb_application_date = tfms.DatePerturbation(frequency="DAY", min=-3, max=3)
s = pd.Series(pd.date_range('2000', periods=12), index=list(range(1, 13)))
s

1    2000-01-01
2    2000-01-02
3    2000-01-03
4    2000-01-04
5    2000-01-05
6    2000-01-06
7    2000-01-07
8    2000-01-08
9    2000-01-09
10   2000-01-10
11   2000-01-11
12   2000-01-12
dtype: datetime64[ns]

In [83]:
perturb_application_date(s)

KeyError: 0

In [None]:
kjk