# Parallelize code with Dask Delayed

In this notebook we demonstrate:

* A few words about Panda
* Reading CSV files using Delayed
* A reading data example
* Sequential code: Mean CO3 Per Core
* Parallelize the sequential code using Dask delayed
---

- Authors: NCI Virtual Research Environment Team
- Keywords: Dask, Delayed, Pandas, DataFrame
- Create Date: 2020-April
---

### A few words about Pandas

Pandas is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of tabular data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

Pandas are tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format.

Python with Pandas is in use in a wide variety of academic and commercial domains, including finance, neuroscience, economics, statistics, advertising, web analytics, and more.

Choose from the following two options to create a client:

In [1]:
# If you run this notebook on your local computer or NCI's VDI instance, you can create cluster
from dask.distributed import Client
client = Client()
print(client)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 39867 instead
  http_address["port"], self.http_server.port


<Client: 'tcp://127.0.0.1:35125' processes=4 threads=8, memory=33.56 GB>


In [1]:
# If you run this notebook on Gadi under pangeo environment, you can create cluster using scheduler.json file
from dask.distributed import Client, LocalCluster
client = Client(scheduler_file='scheduler.json')
print(client)

<Client: 'tcp://10.6.21.68:8773' processes=8 threads=48, memory=202.48 GB>


<div class="alert alert-info">
<b>Warning: Please make sure you specify the correct path to the schedular.json file within your environment.</b>  
</div>

<div class="alert alert-warning">
<b>NOTE:</b> If you run this notebook on your local computer, make sure that your local computer has multiple cores. Otherwise, your parallel code won't perform any better than sequencial code! 
</div>

Starting the Dask Client will provide a dashboard which is useful to gain insight into the computation. The link to the dashboard will become visible when you create the Client. We recommend having the Client open on one side of your screen and your notebook open on the other side, which will be useful for learning purposes.

## Scale up csv file reading using `delayed` 

We will apply `delayed` to a real data processing task, albeit a simple one.

Consider reading three CSV files (found in `/g/data/dk92/notebooks/demo_data/CSV`) with `pd.read_csv` and then measuring their total length. We will consider how you would do this with ordinary Python code, then build a graph for this process using delayed, and finally execute this graph using Dask. The Dask execution provides a handy speed-up factor of more than two (there are only three inputs to parallelize over).

In [2]:
import pandas as pd
import os
from glob import glob
from dask import delayed
import numpy

filenames = sorted(glob('/g/data/dk92/notebooks/demo_data/CSV/*.csv'))
filenames

['/g/data/dk92/notebooks/demo_data/CSV/csvfile1.csv',
 '/g/data/dk92/notebooks/demo_data/CSV/csvfile2.csv',
 '/g/data/dk92/notebooks/demo_data/CSV/csvfile3.csv']

In [3]:
%%time

# normal, sequential code
a = pd.read_csv(filenames[0])
b = pd.read_csv(filenames[1])
c = pd.read_csv(filenames[2])

na = len(a)
nb = len(b)
nc = len(c)

total = sum([na, nb, nc])
print(total)

27
CPU times: user 14.3 ms, sys: 0 ns, total: 14.3 ms
Wall time: 151 ms


Your task is to recreate this graph using the `delayed` function on the original Python code. The three functions you want to delay are `pd.read_csv`, `len` and `sum`. 

In [4]:
%%time

# delayed, sequential code
delayed_read_csv = delayed(pd.read_csv)
a = delayed_read_csv(filenames[0])
b = delayed_read_csv(filenames[1])
c = delayed_read_csv(filenames[2])

delayed_len = delayed(len)
na = delayed_len(a)
nb = delayed_len(b)
nc = delayed_len(c)

delayed_sum = delayed(sum)

total = delayed_sum([na, nb, nc])
total

CPU times: user 1.2 ms, sys: 587 µs, total: 1.79 ms
Wall time: 1.45 ms


Delayed('sum-c2bd7ac4-e9cc-4468-ad76-9f6f495cc8e6')

In [5]:
%time print(total.compute())

27
CPU times: user 6.17 ms, sys: 2.59 ms, total: 8.76 ms
Wall time: 41.9 ms


Next, repeat this using loops rather than writing out all the variables.

In [6]:
# concise version
csvs = [delayed(pd.read_csv)(fn) for fn in filenames]
lens = [delayed(len)(csv) for csv in csvs]
total = delayed(sum)(lens)
%time print(total.compute())

27
CPU times: user 5.9 ms, sys: 0 ns, total: 5.9 ms
Wall time: 38.7 ms


## Real example

### Inspect Data

We will use the supplementary data of the paper **Sequestration of carbon in the deep Atlantic during the last glaciation** by Yu *et al.* published in Nature Geoscience, 2016, doi:10.1038/ngeo2657.

I downloaded the data and reorganized it into several CSV files saved under `/g/data/dk92/notebooks/demo_data/Nature_geo_csv`. This dataset includes lab measurement of pH (i.e., CO$_{3}$ $\mu$mol/kg), Oxygen isotopes, Carbon isotopes, and CaCO$_{3}$ in sediments at different depths of the Ocean Deep Drilling (ODD) cores in the Atlantic Ocean. The name convention for those files are coreID-measurements.csv.

In [7]:
import os
sorted(os.listdir('/g/data/dk92/notebooks/demo_data/Nature_geo_csv'))

['.DS_Store',
 'EW9209-2JPC-PH.csv',
 'MD01-2446-O-C.csv',
 'MD01-2446-PH.csv',
 'MD95-2039-CaCO3.csv',
 'MD95-2039-O-C.csv',
 'MD95-2039-PH.csv',
 'RC13-228-O-C.csv',
 'RC13-228-PH.csv',
 'RC13-229-O-C.csv',
 'RC13-229-PH.csv',
 'RC16-59-PH.csv',
 'TNO57-21-PH.csv']

#### Read one file with pandas.read_csv and compute the mean pH value of a core.

We can use `Pandas.read_csv( )` to access the csv files.

In [8]:
import pandas as pd
# skip the first two lines
# line1: core name
# line2: units of the measurement in each column
df = pd.read_csv("/g/data/dk92/notebooks/demo_data/Nature_geo_csv/TNO57-21-PH.csv",skiprows=2)
df.head()

Unnamed: 0,top,btm,mid,age,Cw B/Ca,CO3
0,815,816,815.5,51.9,123.4,83.3
1,853,854,853.5,54.6,128.8,88.0
2,916,917,916.5,60.7,114.0,75.0
3,925,926,925.5,61.4,113.3,74.5
4,936,937,936.5,62.3,111.3,72.7


In [9]:
# What is the schema?
df.dtypes

top          int64
btm          int64
mid        float64
age        float64
Cw B/Ca    float64
CO3        float64
dtype: object

In [10]:
# Get the mean value of each column
df.mean()

top        1092.583333
btm        1093.604167
mid        1093.093750
age          73.637500
Cw B/Ca     125.895833
CO3          85.506250
dtype: float64

In [11]:
# We can get a single column as a Series using Python's getitem syntax on the DataFrame object.
df['CO3']

# or specify one column to get the mean of that data series only
df.CO3.mean()

# Find the number of data points
import numpy as np
np.size(df['CO3'])

48

### Sequential code: Mean CO$_{3}$ Per Core

The above cell computes the mean departure delay per-airport for one year. Here we expand that to all years using a sequential for loop.

In [12]:
from glob import glob
filenames = sorted(glob('/g/data/dk92/notebooks/demo_data/Nature_geo_csv/*-PH.csv'))
filenames

['/g/data/dk92/notebooks/demo_data/Nature_geo_csv/EW9209-2JPC-PH.csv',
 '/g/data/dk92/notebooks/demo_data/Nature_geo_csv/MD01-2446-PH.csv',
 '/g/data/dk92/notebooks/demo_data/Nature_geo_csv/MD95-2039-PH.csv',
 '/g/data/dk92/notebooks/demo_data/Nature_geo_csv/RC13-228-PH.csv',
 '/g/data/dk92/notebooks/demo_data/Nature_geo_csv/RC13-229-PH.csv',
 '/g/data/dk92/notebooks/demo_data/Nature_geo_csv/RC16-59-PH.csv',
 '/g/data/dk92/notebooks/demo_data/Nature_geo_csv/TNO57-21-PH.csv']

In [13]:
%%time
means = []
counts = []
for fn in filenames:
    # Read in file
    df = pd.read_csv(fn,skiprows=2)
    
    # Get the mean CO3 for each core
    mean_CO3_each = df.CO3.mean()

    # Count how many data points in each core
    count = np.size(df['CO3'])

    # Save the intermediates
    means.append(mean_CO3_each)
    counts.append(count)

# Combine intermediates to get total mean-delay-per-origin
mean_CO3 = np.mean(means)
n_dpoints = sum(counts)

CPU times: user 31.7 ms, sys: 8.61 ms, total: 40.4 ms
Wall time: 119 ms


In [14]:
means

[92.66666666666667,
 97.8157894736842,
 106.03571428571429,
 90.16,
 80.31818181818181,
 94.51515151515152,
 85.50625000000002]

In [15]:
mean_CO3
n_dpoints

263

### Parallelize the code above

Use `dask.delayed` to parallelize the code above. 

Note that methods and attribute access on delayed objects work automatically, so if you have a delayed object you can perform normal arithmetic, slicing, and method calls on it and it will produce the correct delayed calls.

```
x = delayed(np.arange)(10)
y = (x + 1)[::2].sum()  # everything here was delayed
```

Calling the `.compute()` method works well when you have a single output. When you have multiple outputs you might want to use the `dask.compute` function:

```
x = delayed(np.arange)(10)
y = x ** 2
min_, max_ = compute(y.min(), y.max())
min_, max_
(0, 81)
```
This way Dask can share the intermediate values (like `y = x**2`).
Your goal is to parallelize the code above (which has been copied below) using `dask.delayed`. You may also want to visualize some of the computation to see if you’re performing it correctly. This is just one way of using `delayed` - there are several other ways to use this.

In [16]:
from dask import compute
from dask import delayed

In [17]:
%%time

means = []
counts = []
for fn in filenames:
    # Read in file
    df = delayed(pd.read_csv)(fn,skiprows=2)
    
    # Get the mean CO3 for each core
    mean_CO3_each = df.CO3.mean()

    # Count how many data points in each core
    count = np.size(df['CO3'])

    # Save the intermediates
    means.append(mean_CO3_each)
    counts.append(count)

# Compute the intermediates
means, counts = compute(means, counts)

# Combine intermediates to get total mean-delay-per-origin
#mean_CO31 = np.mean(means1)
#n_dpoints = sum(counts1)

CPU times: user 28.4 ms, sys: 1.29 ms, total: 29.6 ms
Wall time: 181 ms


In [18]:
mean_CO3

92.43110767991409

### Close the client

Before moving on to the next exercise, make sure to close your client or stop this kernel.

In [19]:
client.close()

### Summary

This example shows how Pandas work with multiple tabular datasets efficiently using the Dask `delayed` feature.

## Reference

https://tutorial.dask.org