## Performance Tests of Object Catalog Access
Owners: **Michael Wood-Vasey [@wmwv](https://github.com/LSSTDESC/DC2_Repo/issues/new?body=@wmwv)**  
Last Run: **2018-07-31**

Assess the performance of data manipulations of the Object catalogs using GCRCatalogs and Pandas.

### Summary
1. GCRCatalogs initializes+loads the full catalog in 330 seconds.
2. Pandas loads the full catalog in 130 seconds.
3. Using a trimmed file consistent only of the columns exposed through the DPDD is 10 times smaller than the full file.
4. [Init+]Load times for both GCRCatalogs and Pandas are 10-20 times faster using the trim files.
5. Once data are loaded, GCRCatalogs takes 1 second to compute the average color.  Pandas takes 2 milliseconds.
6. Analyses with the HDF5 `format='table'` are 1-2 times slower than HDF5 `format='fixed'` files when there's no caching.
7. When caching is allowed, using the HDF5 `format='table'` don't get cached and are 30 times slower than using HDF5 `format='fixed'` files.
8. Using the GCR iterator is 30% faster when data are cached.
9. A preliminary Dask analyses is conducted here, but a fuller development of Dask will be done in issues #237.

### Discusion
1. We should create trimed file versions for future releases.
2. The `format='table'` does not work with GCR's caching.  This may be in the reader in GCRCatalogs, or it may be something more fundamental in how the caching is set up in GCR.
3. Pandas creates the compressed representation of memory, mapping the flags to the bools that they are.  This leads to a factor of 10 reduction in the memory footprint of the Pandas object over the on-disk files, such that it easily fits in memory (8 GB).
4. Dask may prove particularly useful once the Pandas data frames no longer fit in memory.  See Dask work (#237).

### Original Charge
1. Identify trivial, moderate, and worst-case use case examples.
2. Measure performance on
    1. single patch
    2. a single tract
    3. the full dataset
3. Record data sizes of each of the above A, B, C
4. Determine if performance considerations mean we should generate a static file that contains a restricted set in columns.
5. Look into again using full tables functionality to write HDF5 files so that they can be read by column efficiently. This was previously not possible because of an error trying to write the thousands of columns in our full coadd catalogs. This is #158


In [None]:
##to use user's gcr-catalogs
#import sys
#sys.path.insert(0, '../../gcr-catalogs/')

In [1]:
import os

import numpy as np

## How fast is GCRCatalogs?

In [2]:
import GCRCatalogs

If you want to use the GCR reader outside of NERSC environment, you can override the `base_dir`.

In [3]:
config = {}

trim_config = config.copy()
trim_config['filename_pattern'] = r'trim_merged_tract_\d+\.hdf5$'
table_trim_config = config.copy()
table_trim_config['filename_pattern'] = r'table_trim_merged_tract_\d+\.hdf5$'

trim_onetract_config = config.copy()
trim_onetract_config['filename_pattern'] = 'trim_merged_tract_4850\.hdf5$'
table_trim_onetract_config = config.copy()
table_trim_onetract_config['filename_pattern'] = 'table_trim_merged_tract_4850\.hdf5$'

### Time loading of GCRCatalogs 

Loading the GCR Catalog is, in principle, just the initialization of the catalog.  In practice the GCRCatalog reader does need to read through all of the metadata in the HDF5 files to figure out what's in there and available.  The onetract version is reading a 7.4 GB file that should fit in memory.  The full Run 1.1p is 78 GB, which does not fit in the average desktop memory.  This size could pontentially fit in the memory of various high-memory shared nodes.  This difference in size is conveniently roughly a factor of 10.  We should naively expect that operations will take 10x longer when all of the data fits in memory, and potentially 10-100x longer when the data do not.  The range is estimated across a variety of usage patterns.

The trim files are 1/10 of the size of the full files due to the removal of 90% of the columns.  The `load_catalog` doesn't load the data, but does need to open and touch each file to read the metadata.  This metadata reading step is only about 2 times faster for the trim files than for the full files.

In [4]:
%%timeit
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', config)

838 ms ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
%%timeit
gc_onetract_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', trim_onetract_config)

512 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', config)
gc_onetract_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', trim_onetract_config)
gc_onetract_table_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', table_trim_onetract_config)

In [7]:
%%timeit
gc = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', config)

11.4 s ± 213 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%%timeit
gc_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', trim_config)

7.01 s ± 89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
%%timeit
gc_table_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', table_trim_config)

12.7 s ± 168 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Note above that the table version of the trim catalogs takes the same amount of time as the full file.

In [10]:
# Here we actually run and save the objects for later use.
# In the %%timeit runs above, the resulting bound names and objects are discarded.
gc = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', config)
gc_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', trim_config)
gc_table_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', table_trim_config)

### Time calculation using loaded GCRCatalogs objects

Summary:
  * The trim catalog files are 10-30 faster to process than the full catalog files.
  * The 'table' format trim catalog files take 1-2 as long to process as the trim catalog files.
  * The full catalog files take 330 seconds to compute the average.

In [11]:
def compute_mean_color_slow(catalog):
    """Compute the mean g-r color of all objects in the 'catalog'.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    average_gmr = (catalog['mag_g'] - catalog['mag_r']).mean()
    return average_gmr

In [12]:
def compute_mean_color_slow_native(catalog):
    """Compute the mean g-r color of all objects in the 'catalog' using native g_mag.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    average_gmr = (catalog['g_mag'] - catalog['r_mag']).mean()
    return average_gmr

In [13]:
def compute_mean_color_faster(catalog):
    """Compute the mean g-r color of all objects in the 'catalog'.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    data = catalog.get_quantities(['mag_g', 'mag_r'])
    average_gmr = (data['mag_g'] - data['mag_r']).mean()
    return average_gmr

In [14]:
def compute_mean_color_faster_iter(catalog):
    """Compute the mean g-r color of all objects in the 'catalog' using iterator.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    sum_gmr = count = 0
    for data in catalog.get_quantities(['mag_g', 'mag_r'], return_iterator=True):
        sum_gmr += (data['mag_g'] - data['mag_r']).sum()
        count += len(data['mag_g'])
    return sum_gmr / count

We below clear the memory cache with `GCRCatalogs` with `clear_cache()` method on the load object to reset for performance tests.  It's harder to control the underlying caching of the GPFS and kernel filesystem memory.

The average color calculation is 18 times faster with the trim files for one tract.  It's only 2 times faster for the full set of files.

There's no particular difference in elapsed time between the `compute_mean_slow`, `compute_mean_fast` and `compute_mean_fast_iter` function.

In [104]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_slow(gc_onetract)

14.9 s ± 1.44 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [98]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_slow_native(gc_onetract)

18.2 s ± 2.15 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
%%timeit
gc_onetract_trim.clear_cache()
compute_mean_color_slow(gc_onetract_trim)

853 ms ± 53.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
%%timeit
gc_onetract_table_trim.clear_cache()
compute_mean_color_slow(gc_onetract_table_trim)

853 ms ± 53.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
%%timeit
gc.clear_cache()
compute_mean_color_slow(gc)

5min 33s ± 5.75 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_faster(gc_onetract)

18.9 s ± 1.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [22]:
%%timeit
gc_trim.clear_cache()
compute_mean_color_faster(gc_trim)

10.4 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [47]:
%%timeit
gc_table_trim.clear_cache()
compute_mean_color_faster(gc_table_trim)

The slowest run took 5.54 times longer than the fastest. This could mean that an intermediate result is being cached.
1min ± 45 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [20]:
%%timeit
gc.clear_cache()
compute_mean_color_faster(gc)

5min 42s ± 26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_faster_iter(gc_onetract)

15.4 s ± 1.36 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [23]:
%%timeit
gc_trim.clear_cache()
compute_mean_color_faster_iter(gc_trim)

11.3 s ± 1.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [48]:
%%timeit
gc_table_trim.clear_cache()
compute_mean_color_faster_iter(gc_table_trim)

26.3 s ± 2.04 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [21]:
%%timeit
gc.clear_cache()
compute_mean_color_faster_iter(gc)

5min 28s ± 8.33 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Time subsequent access to other columns after caching

Next we will look at the performance of GCRCatalogs after having read through the data once for some other column, and then computing the performance on reading another column.

While the user does care about the first experience timing, they really care about repeated access time.

In [20]:
# Give chance to fill cache
_ = gc.get_quantities(['ra', 'dec'])

In [21]:
%%timeit
compute_mean_color_faster(gc)

781 ms ± 76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [26]:
%%timeit
compute_mean_color_faster_iter(gc)

541 ms ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [22]:
# Give chance to fill cache
_ = gc_trim.get_quantities(['ra', 'dec'])

In [23]:
%%timeit
compute_mean_color_faster(gc_trim)

366 ms ± 42.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [27]:
%%timeit
compute_mean_color_faster_iter(gc_trim)

238 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
# Give chance to fill cache
_ = gc_table_trim.get_quantities(['ra', 'dec'])

In [25]:
%%timeit
compute_mean_color_faster(gc_table_trim)

23.9 s ± 1.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [28]:
%%timeit
compute_mean_color_faster_iter(gc_table_trim)

24.3 s ± 2.27 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%prun
compute_mean_color_faster_iter(gc_table_trim)

## How fast is Pandas?

In [1]:
import os
import pandas as pd

tract = 4850

datafile_pattern = 'merged_tract_{:d}.hdf5'
datafile_pattern_trim = 'trim_' + datafile_pattern
datafile_pattern_table_trim = 'table_trim_' + datafile_pattern

datafile_basename = datafile_pattern.format(tract)
datafile_basename_trim = datafile_pattern_trim.format(tract)
datafile_basename_table_trim = datafile_pattern_table_trim.format(tract)

base_dir = '/global/projecta/projectdirs/lsst/global/in2p3/Run1.1/object_catalog'

datafile = os.path.join(base_dir, datafile_basename)
datafile_trim = os.path.join(base_dir, datafile_basename_trim)
datafile_table_trim = os.path.join(base_dir, datafile_basename_table_trim)

key_prefix = 'coadd'
nx, ny = 8, 8
patches = ['%d%d' % (i, j) for i in range(nx) for j in range (ny)]  # Note '%d%d' instead of '%d,%d'
patch = patches[0]
key = '%s_%d_%s' % (key_prefix, tract, patch)

### Time loading of catalog using Pandas

Reading just one patch of the tract.

In [3]:
%%timeit
df_onepatch = pd.read_hdf(datafile, key=key)

43.9 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [4]:
%%timeit
df_onepatch_trim = pd.read_hdf(datafile_trim, key=key)

12.1 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [5]:
%%timeit
df_onepatch_table_trim = pd.read_hdf(datafile_table_trim, key=key)

22.3 ms ± 504 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Now we'll load all of the patches in the tract

In [2]:
def load_tract_into_pandas(datafile, tract, key_prefix='coadd'):
    """Load all of the patches in one tract into Pandas
    
    Returns None if no data was successfully loaded.
    """
    nx, ny = 8, 8
    patches = ['%d%d' % (i, j) for i in range(nx) for j in range (ny)]  # Note '%d%d' instead of '%d,%d'

    dfs = []
    for patch in patches:
        key = '%s_%d_%s' % (key_prefix, tract, patch)
        try:
            df = pd.read_hdf(datafile, key=key)
        except:
            continue
        dfs.append(df)

    if dfs:
        df = pd.concat(dfs, ignore_index=True)
    else:
        df = None

    return df

In [9]:
%%timeit
df_onetract = load_tract_into_pandas(datafile, tract)

54.4 s ± 1.78 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%%timeit
df_onetract_trim = load_tract_into_pandas(datafile_trim, tract)

1.61 s ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
%%timeit
df_onetract_table_trim = load_tract_into_pandas(datafile_table_trim, tract)

2.27 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [28]:
df_onetract = load_tract_into_pandas(datafile, tract)
df_onetract_trim = load_tract_into_pandas(datafile_trim, tract)
df_onetract_table_trim = load_tract_into_pandas(datafile_table_trim, tract)

In [73]:
%%timeit
df_onetract['g_mag'] - df_onetract['r_mag']

2.11 ms ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [74]:
%%timeit
(df_onetract['g_mag'] - df_onetract['r_mag']).mean()

4.09 ms ± 97.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [75]:
%%timeit
df_onetract_trim['g_mag'] - df_onetract_trim['r_mag']

1.56 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [76]:
%%timeit
(df_onetract_trim['g_mag'] - df_onetract_trim['r_mag']).mean()

3.38 ms ± 74.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [77]:
%%timeit
df_onetract_table_trim['g_mag'] - df_onetract_table_trim['r_mag']

332 µs ± 5.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [78]:
%%timeit
(df_onetract_table_trim['g_mag'] - df_onetract_table_trim['r_mag']).mean()

2.45 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Now load all 20 tracts (many are only partially complete)

In [3]:
fullpath_datafile_pattern = os.path.join(base_dir, datafile_pattern)
fullpath_datafile_pattern_trim = os.path.join(base_dir, datafile_pattern_trim)
fullpath_datafile_pattern_table_trim = os.path.join(base_dir, datafile_pattern_table_trim)

In [4]:
def load_all_tracts_into_pandas(
    datafile_pattern,
    tracts=(5066, 5065, 5064, 5063, 5062,
            4852, 4851, 4850, 4849, 4848,
            4640, 4639, 4638, 4637, 4636,
            4433, 4432, 4431, 4430, 4429),
    verbose=False,
    **kwargs):
    dfs = []
    for t in tracts:
        datafile = datafile_pattern.format(t)
        if verbose:
            print(datafile)
        df = load_tract_into_pandas(datafile, t, **kwargs)
        if df is not None:
            dfs.append(df)
    
    return pd.concat(dfs, ignore_index=True)

In [56]:
%%timeit
df = load_all_tracts_into_pandas(fullpath_datafile_pattern)

2min 13s ± 7.65 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [57]:
%%timeit
df_trim = load_all_tracts_into_pandas(fullpath_datafile_pattern_trim)

5.03 s ± 137 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [58]:
%%timeit
df_table_trim = load_all_tracts_into_pandas(fullpath_datafile_pattern_table_trim)

4.19 s ± 187 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [44]:
df = load_all_tracts_into_pandas(fullpath_datafile_pattern)

In [5]:
df_trim = load_all_tracts_into_pandas(fullpath_datafile_pattern_trim)

In [6]:
df_table_trim = load_all_tracts_into_pandas(fullpath_datafile_pattern_table_trim)

In [79]:
%%timeit
df['g_mag'] - df['r_mag']

332 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [80]:
%%timeit
(df['g_mag'] - df['r_mag']).mean()

2.63 ms ± 50.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
%%timeit
df_trim['g_mag'] - df_trim['r_mag']

607 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [8]:
%%timeit
(df_trim['g_mag'] - df_trim['r_mag']).mean()

2.72 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [9]:
%%timeit
df_table_trim['g_mag'] - df_table_trim['r_mag']

611 µs ± 6.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [10]:
%%timeit
(df_table_trim['g_mag'] - df_table_trim['r_mag']).mean()

2.81 ms ± 74.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
%%prun
(df_table_trim['g_mag'] - df_table_trim['r_mag']).mean()

These calculation times are identical for the three different catalogs.  This is consistent with the logical need; each operation is reading two columns and subtracting them.

Note that subtracting the arrays takes only hundreds of microseconds.  Aggregating the result takes 10 times longer, although we're still only a milliseconds.

If we inspect the Pandas objects, we can see that one significant savings is that Pandas as explicitly loaded the flags as `bool`.  This immediately leads to a factor of 10 memory savings over the 80 GB of the HDF5 files.

In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 719228 entries, 0 to 5023
Columns: 2454 entries, base_Blendedness_abs_child_xx to z_slot_Shape_yy
dtypes: bool(1177), float32(67), float64(1181), int32(21), int64(8)
memory usage: 7.4 GB


In [89]:
df_trim.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 719228 entries, 0 to 5023
Columns: 111 entries, i_base_PsfFlux_fluxSigma to z_base_SdssShape_xy
dtypes: bool(22), float32(3), float64(84), int64(2)
memory usage: 500.7 MB


In [90]:
df_table_trim.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 719228 entries, 0 to 5023
Columns: 111 entries, i_base_PsfFlux_fluxSigma to z_base_SdssShape_xy
dtypes: bool(22), float32(3), float64(84), int64(2)
memory usage: 500.7 MB


So while it takes 240 seconds to load the full set of catalogs, the resulting Pandas object is only ~8 GB and so easily fits in memory.

## How fast is Dask?

This section on Dask is very preliminary and will be improved upon in future work.

In [31]:
import dask as da
import dask.dataframe as dd

from dask.distributed import Client
client = Client(processes=False)

In [32]:
tract = 4850

base_dir = '/global/projecta/projectdirs/lsst/global/in2p3/Run1.1/summary'

datafile = os.path.join(base_dir, 'table_trim_merged_tract_%d.hdf5' % tract)
datafile_pattern = os.path.join(base_dir, 'table_trim_merged_tract_*.hdf5')

### Time loading of catalog using Dask

In [33]:
da_df = dd.read_hdf(datafile, key='coadd_*', mode='r')
da_df_all = dd.read_hdf(datafile_pattern, key='coadd_*', mode='r')

In [34]:
df2 = np.mean(da_df['g_mag'] - da_df['r_mag'])
df2_all = np.mean(da_df_all['g_mag'] - da_df_all['r_mag'])

### Time computation using Dask

In [35]:
%%timeit
df2.compute()

2.91 s ± 87.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [36]:
%%timeit
df2_all.compute()

1min 3s ± 1.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Note that the full DASK calculation takes 20 times longer than the onetract calculation for a data volume only 10-times larger.  We're potentially hitting memory limits, but there also might be some ineficiencies.

DASK takes ~38 seconds to do the color average using the `table_trim_` file, while GCRCatalogs takes ~12 seconds using the `trim_` file.

In [37]:
import os
print(os.getenv('OMP_NUM_THREADS'))

4


### Does it matter if we're on Lustre (SCRATCH)

No.

In [38]:
base_dir = '/global/cscratch1/sd/wmwv/DC2/Run1.1p/summary'

datafile_lustre = os.path.join(base_dir, 'table_trim_merged_tract_%d.hdf5' % tract)
datafile_pattern_lustre = os.path.join(base_dir, 'table_trim_merged_tract_*.hdf5')

In [39]:
da_df_lustre = dd.read_hdf(datafile_lustre, key='coadd_*', mode='r')
da_df_all_lustre = dd.read_hdf(datafile_pattern_lustre, key='coadd_*', mode='r')

In [40]:
df2_lustre = np.mean(da_df_lustre['g_mag'] - da_df_lustre['r_mag'])
df2_all_lustre = np.mean(da_df_all_lustre['g_mag'] - da_df_all_lustre['r_mag'])

In [41]:
%%timeit
df2_lustre.compute()

The slowest run took 5.24 times longer than the fastest. This could mean that an intermediate result is being cached.
5.43 s ± 4.82 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [42]:
%%timeit
df2_all_lustre.compute()

1min 28s ± 12.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Time on the Luster file system (2, 42) sec seems about the same as the GPFS (2, 50) sec.

In [43]:
import dask
dask.config.set(scheduler='threads')
print(dask.config.config)

{'distributed': {'worker': {'memory': {'target': False, 'spill': False, 'pause': 0.8, 'terminate': 0.95}, 'multiprocessing-method': 'forkserver', 'use-file-locking': True, 'profile': {'interval': '10ms', 'cycle': '1000ms'}}, 'version': 2, 'scheduler': {'allowed-failures': 3, 'bandwidth': 100000000, 'default-data-size': 1000, 'transition-log-length': 100000, 'work-stealing': True, 'worker-ttl': None}, 'client': {'heartbeat': '5s'}, 'comm': {'compression': 'auto', 'default-scheme': 'tcp', 'socket-backlog': 2048, 'recent-messages-log-length': 0, 'timeouts': {'connect': '10s', 'tcp': '30s'}}, 'dashboard': {'link': 'http://{host}:{port}/status', 'export-tool': False}, 'admin': {'tick': {'interval': '20ms', 'limit': '3s'}, 'log-length': 10000, 'log-format': '%(name)s - %(levelname)s - %(message)s', 'pdb-on-err': False}}, 'jobqueue': {'slurm': {'cores': 64, 'memory': '128GB', 'processes': 2, 'queue': 'debug', 'walltime': '00:10:00', 'job-extra': ['-C haswell', '-L SCRATCH, cscratch1']}}, 'arr

In [44]:
%%timeit
df2_lustre.compute()

3.09 s ± 662 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Does it matter if we're on the burst buffer?