# Dask is Much Faster with Parquet Files than HDF5
Michael Wood-Vasey  
*Last Run*: 2018-10-17  

Dask is 10x faster with Parquet files than HDF5 when retrieving a few specific columns.  It takes 5 seconds to compute the average g-r color across 6.9 million rows.  This significantly improved behavior makes sense given the column-store nature of Parquet.

This is a successor to the DASK+HDF5 tests in [object_catalog_performance_dask.ipynb](object_catalog_performance_dask.ipynb)
where the Dask performance was 70-300 seconds to calculate an average g-r color.

We specifically look at three different scheduling options:

1. `threads`
2. `synchronus` (simple serial)
3. `dask.distributed`

For this relatively simple aggregation, the `synchronus` method was fastest.  This is a simple serial approach.  Its speed emphasizes that it's the reading that's dominant and parallelizing the reading doesn't improve performance.

### Future Work
1. Exploring more complicated analyses, such as binning to make a color-color or color-magnitude density should be explored.  This problem might distribute more work across nodes and benefit from `dask.distributed`.
2. These tests were run against a 'hive'-formatted Parquet storage and also 'simple' Parquet single file.

### Logistics

This was run and developed running the NERSC JupyterLab environment.
I've specified `pyarrow` as the Parquet engine because that was the installation I could get on NERSC with the snappy compression library.

In [1]:
import os

from bokeh.io import output_notebook

import dask
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, progress

import numpy as np

In [2]:
# Load Bokeh into the Notebook
output_notebook()

In [3]:
tract = 4850

data_dir = '/global/projecta/projectdirs/lsst/global/in2p3/Run1.1/summary'

datafile_tract = os.path.join(data_dir, 'dpdd_object_tract_%d.parquet' % tract)
datafile_all_hive = os.path.join(data_dir, 'dpdd_object.parquet')
datafile_all_simple = os.path.join(data_dir, 'dpdd_object_simple.parquet')

In [4]:
# Specify the columns we need.  This allows for significant performance advantages
# particularly when reading a column-based storage format such as Parquet.
columns_to_read = ['mag_g', 'mag_r']
parquet_engine = 'pyarrow'
read_parquet_kws = {'columns': columns_to_read, 'engine': parquet_engine}

da_df_tract = dd.read_parquet(datafile_tract, **read_parquet_kws)
da_df = dd.read_parquet(datafile_all_hive, **read_parquet_kws)
da_df_simple = dd.read_parquet(datafile_all_simple, **read_parquet_kws)

Specify the computation to be done.

In [5]:
df_tract = np.mean(da_df_tract['mag_g'] - da_df_tract['mag_r'])
df = np.mean(da_df['mag_g'] - da_df['mag_r'])
df_simple = np.mean(da_df_simple['mag_g'] - da_df_simple['mag_r'])

Time actually doing the computation:

The single tract is one 20th the size and takes one 10th the time.

In [6]:
%timeit df_tract.compute()

198 ms ± 7.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%timeit df.compute()

6.79 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%timeit df_simple.compute()

683 ms ± 59.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Use default `threads` scheduler from Dask.  This is the default for dataframes.

In [9]:
with dask.config.set(scheduler='threads'):
    df = np.mean(da_df['mag_g'] - da_df['mag_r'])
    %timeit df.compute()

6.51 s ± 90 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
with dask.config.set(scheduler='threads'):
    df_simple = np.mean(da_df_simple['mag_g'] - da_df_simple['mag_r'])
    %timeit df_simple.compute()

537 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Use the `synchronus` scheduler from Dask

In [11]:
with dask.config.set(scheduler='synchronous'):
    df = np.mean(da_df['mag_g'] - da_df['mag_r'])
    %timeit df.compute()

5.51 s ± 123 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
with dask.config.set(scheduler='synchronous'):
    df_simple = np.mean(da_df_simple['mag_g'] - da_df_simple['mag_r'])
    %timeit df_simple.compute()

563 ms ± 37.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Can I Explicitly Set Up a Local Dask Cluster to Go Faster?

No.

The single-machine `threads` or `synchronus` schdulers run in 5-7 seconds.

A LocalCluster configuration with n_workers=1 is 13 seconds.
But as we increase the number of workers we approach and surpass the single-threaded performance.  We hit a floor at 8 workers.

In [13]:
for n in (1, 4, 8, 16):
    cluster = LocalCluster(n_workers=n, threads_per_worker=2)
    client = Client(cluster)
    display(client)
    
    df = np.mean(da_df['mag_g'] - da_df['mag_r'])
    %timeit _ = df.compute()

0,1
Client  Scheduler: tcp://127.0.0.1:45634  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 1  Cores: 2  Memory: 16.91 GB


10.3 s ± 167 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


0,1
Client  Scheduler: tcp://127.0.0.1:38401  Dashboard: http://127.0.0.1:39105/status,Cluster  Workers: 4  Cores: 8  Memory: 67.62 GB


3.52 s ± 74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


0,1
Client  Scheduler: tcp://127.0.0.1:41381  Dashboard: http://127.0.0.1:45932/status,Cluster  Workers: 8  Cores: 16  Memory: 135.24 GB


2.24 s ± 55.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


0,1
Client  Scheduler: tcp://127.0.0.1:42253  Dashboard: http://127.0.0.1:38587/status,Cluster  Workers: 16  Cores: 32  Memory: 270.49 GB


2.56 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
