## Performance Tests of Object Catalog Access
Owners: **Michael Wood-Vasey [@wmwv](https://github.com/LSSTDESC/DC2_Repo/issues/new?body=@wmwv)**  
Last Run: **2018-08-01**

Assess the performance of data manipulations of the Object catalogs using GCRCatalogs and Pandas.

Many thanks to Yao-Yuan Mao [@yymao] for key feedback and GCRCatalogs improvements.

### Summary

This was a useful exploratory exercise to develop some thoughts and explore performance of different data storage formats and improve some performance of reading them.

1. GCRCatalogs initializes+loads and Pandas loads the full Run 1.1 object catalog in the same amount of time: ~300 seconds.
2. A set of trimmed files consisting only of the columns exposed through the DPDD is 10 times smaller than the full files.
3. [Init+]Load times for both GCRCatalogs and Pandas are 20 times faster using these trim files.
4. Once data are loaded/cached, the difference between using the GCRCatalogs and Pandas interfaces are mostly the difference between `numpy` (the code written below to use GCRCatalogs) and `numexpr` (what Pandas uses internally).  This difference results in Pandas being about 2x faster.
5. Using the GCR iterator to chunk through the data is 30-100% faster when data are cached than asking for all of a column without using the iterator.  Without caching, there's no significant difference.
6. Using the HDF5 `fixed='table'` format instead of `format='fixed'` is 50-80% slower when loading the data.
7. Explore using the Cori Burst Buffer.

### Conclusions
1. We should create trimmed-file object catalog versions for future releases.

### Further work 
1. Should the flat files should store NaNs for columns for missing filters to make the data model identical across all tracts+patch?
2. Try using GCR through numexpr to fully match what Pandas is doing?
3. Provide the GCRCatalogs reader the schema up front so it doesn't have to open all of the files on initialization:
[gcr-catalogs/#169](https://github.com/LSSTDESC/gcr-catalogs/issues/169)  
[DC2-production/#167](https://github.com/LSSTDESC/DC2-production/issues/243)
3. Dask work here is preliminary and should continue.  The full Pandas data frame test fail here on Cori when other users are on the Jupyter dev node due to exceeding memory.  Dask may prove particularly useful in these situations where the Pandas data frames no longer fit in memory.  See Dask work (#237).
4. Performance of batch jobs that get an entire node or multiple nodes.  The testing reported here was done on the Jupyter-dev execution node, which has some contention for resources.

### Original Charge
1. Identify trivial, moderate, and worst-case use case examples.
2. Measure performance on
    1. single patch
    2. a single tract
    3. the full dataset
3. Record data sizes of each of the above A, B, C
4. Determine if performance considerations mean we should generate a static file that contains a restricted set in columns.
5. Look into again using full tables functionality to write HDF5 files so that they can be read by column efficiently. This was previously not possible because of an error trying to write the thousands of columns in our full coadd catalogs. This is #158

### Logistics:

1. This notebook was run through the JupyterHub NERSC interface available here: https://jupyter-dev.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter-dev+at+NERSC
2. The full notebook takes several hours to run.  Thus we save the outputs here to document the performance.
3. The full Pandas load tends to get killed on the Cori Jupyter node due to memory usage.


In [1]:
## To use user's gcr-catalogs, uncomment the following
## These were useful in testing improvements to GCRCatalogs in the performance testing here.
#import sys
#sys.path.insert(0, '../../gcr-catalogs/')

In [2]:
import os

import numpy as np

## How fast is GCRCatalogs?

In [3]:
import GCRCatalogs

If you want to use the GCR reader outside of NERSC environment, you can override the `base_dir`.

In [4]:
config = {}

trim_config = config.copy()
trim_config['filename_pattern'] = r'trim_merged_tract_\d+\.hdf5$'
table_trim_config = config.copy()
table_trim_config['filename_pattern'] = r'table_trim_merged_tract_\d+\.hdf5$'

trim_onetract_config = config.copy()
trim_onetract_config['filename_pattern'] = 'trim_merged_tract_4850\.hdf5$'
table_trim_onetract_config = config.copy()
table_trim_onetract_config['filename_pattern'] = 'table_trim_merged_tract_4850\.hdf5$'

### Time loading of GCRCatalogs 

Loading the GCR Catalog is, in principle, just the initialization of the catalog.  In practice the GCRCatalog reader does need to read through all of the metadata in the HDF5 files to figure out what's in there and available (see Future Work above).  The onetract version is reading a 7.4 GB file that should fit in memory.  The full Run 1.1p is 78 GB, which does not fit in the average desktop memory.  This size could pontentially fit in the memory of various high-memory shared nodes, but is too large to fit in the available memory of the shared Jupyter execution node on NERSC.  This difference in size is conveniently roughly a factor of 10.  We should naively expect that operations will take 10x longer when all of the data fits in memory, and potentially 10-100x longer when the data do not.

The trim files are 1/10 of the size of the full files due to the removal of 90% of the columns.  The `load_catalog` doesn't load the data, but does need to open and touch each file to read the metadata.  This metadata reading step is only about 2 times faster for the trim files than for the full files.

In [5]:
%%timeit
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', config)

Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/global/common/software/lsst/common/miniconda/current/lib/python3.6/site-packages/tables/group.py", line 273, in __del__
    self._v_pathname in self._v_file._node_manager.registry and
AttributeError: 'File' object has no attribute '_node_manager'
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/global/common/software/lsst/common/miniconda/current/lib/python3.6/site-packages/tables/group.py", line 273, in __del__
    self._v_pathname in self._v_file._node_manager.registry and
AttributeError: 'File' object has no attribute '_node_manager'
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/global/common/software/lsst/common/miniconda/current/lib/python3.6/site-packages/tables/group.py", line 273, in __del__
    self._v_pathname in self._v_file._node_manager.registry and
AttributeError: 'File' object has no attribute '_node_m

1.34 s ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Note that the above generates a lot of exceptions in the file reading the first time.  These are fine for the integrity of the data and the reading but do suggest that we should clean something up about the reading.

In [6]:
%%timeit
gc_onetract_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', trim_onetract_config)

899 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', config)
gc_onetract_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', trim_onetract_config)
gc_onetract_table_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', table_trim_onetract_config)

In [8]:
%%timeit
gc = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', config)

14.2 s ± 138 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
%%timeit
gc_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', trim_config)

8.75 s ± 190 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
%%timeit
gc_table_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', table_trim_config)

16.2 s ± 520 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Note above that the table version of the trim catalogs takes slight **more** time than reading the full file.

In [11]:
# Here we actually run and save the objects for later use.
# In the %%timeit runs above, the resulting bound names and objects are discarded.
gc = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', config)
gc_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', trim_config)
gc_table_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', table_trim_config)

### Time calculation using loaded GCRCatalogs objects

Summary:
  * The trim catalog files are 10-30 faster to process than the full catalog files.
  * The 'table' format trim catalog files take 1-2 as long to process as the trim catalog files.
  * The full catalog files take 330 seconds to compute the average.

In [12]:
def compute_mean_color_slow(catalog):
    """Compute the mean g-r color of all objects in the 'catalog'.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    average_gmr = (catalog['mag_g'] - catalog['mag_r']).mean()
    return average_gmr

In [13]:
def compute_mean_color_slow_native(catalog):
    """Compute the mean g-r color of all objects in the 'catalog' using native g_mag.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    average_gmr = (catalog['g_mag'] - catalog['r_mag']).mean()
    return average_gmr

In [14]:
def compute_mean_color_faster(catalog):
    """Compute the mean g-r color of all objects in the 'catalog'.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    data = catalog.get_quantities(['mag_g', 'mag_r'])
    average_gmr = (data['mag_g'] - data['mag_r']).mean()
    return average_gmr

In [15]:
def compute_mean_color_faster_iter(catalog):
    """Compute the mean g-r color of all objects in the 'catalog' using iterator.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    sum_gmr = count = 0
    for data in catalog.get_quantities(['mag_g', 'mag_r'], return_iterator=True):
        sum_gmr += (data['mag_g'] - data['mag_r']).sum()
        count += len(data['mag_g'])
    return sum_gmr / count

We below clear the memory cache with `GCRCatalogs` with `clear_cache()` method on the load object to reset for performance tests.  But it's harder to control the underlying caching of the GPFS and kernel filesystem memory and we are not doing that here in this Notebook.

In [16]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_slow(gc_onetract)

22.5 s ± 3.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_slow_native(gc_onetract)

17.2 s ± 1.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
%%timeit
gc_onetract_trim.clear_cache()
compute_mean_color_slow(gc_onetract_trim)

1.12 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
%%timeit
gc_onetract_table_trim.clear_cache()
compute_mean_color_slow(gc_onetract_table_trim)

1.39 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [20]:
%%timeit
gc.clear_cache()
compute_mean_color_slow(gc)

5min 33s ± 20.7 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [21]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_faster(gc_onetract)

18.2 s ± 1.7 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [22]:
%%timeit
gc_trim.clear_cache()
compute_mean_color_faster(gc_trim)

11.3 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [23]:
%%timeit
gc_table_trim.clear_cache()
compute_mean_color_faster(gc_table_trim)

20.8 s ± 3.31 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
%%timeit
gc.clear_cache()
compute_mean_color_faster(gc)

5min 40s ± 3.42 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [25]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_faster_iter(gc_onetract)

18.6 s ± 1.3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [26]:
%%timeit
gc_trim.clear_cache()
compute_mean_color_faster_iter(gc_trim)

11.4 s ± 364 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [27]:
%%timeit
gc_table_trim.clear_cache()
compute_mean_color_faster_iter(gc_table_trim)

19 s ± 3.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [28]:
%%timeit
gc.clear_cache()
compute_mean_color_faster_iter(gc)

5min 12s ± 32.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Time subsequent access to other columns after caching

Next we will look at the performance of GCRCatalogs after having read through the data once for some other column, and then computing the performance on reading another column.

While the user does care about the first experience timing (the tests above), they really care about repeated access time which is what we will proceed to test here.

In [29]:
# Give chance to fill cache
_ = gc.get_quantities(['ra', 'dec'])

In [30]:
%%timeit
compute_mean_color_faster(gc)

609 ms ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [31]:
%%timeit
compute_mean_color_faster_iter(gc)

409 ms ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [32]:
# Give chance to fill cache
_ = gc_trim.get_quantities(['ra', 'dec'])

In [33]:
%%timeit
compute_mean_color_faster(gc_trim)

348 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [34]:
%%timeit
compute_mean_color_faster_iter(gc_trim)

287 ms ± 21.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [35]:
# Give chance to fill cache
_ = gc_table_trim.get_quantities(['ra', 'dec'])

In [36]:
%%timeit
compute_mean_color_faster(gc_table_trim)

183 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [37]:
%%timeit
compute_mean_color_faster_iter(gc_table_trim)

89.9 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [38]:
%%prun
compute_mean_color_faster_iter(gc_table_trim)

 

# How fast is Pandas?

In [2]:
import os
import pandas as pd

tract = 4850

datafile_pattern = 'merged_tract_{:d}.hdf5'
datafile_pattern_trim = 'trim_' + datafile_pattern
datafile_pattern_table_trim = 'table_trim_' + datafile_pattern

datafile_basename = datafile_pattern.format(tract)
datafile_basename_trim = datafile_pattern_trim.format(tract)
datafile_basename_table_trim = datafile_pattern_table_trim.format(tract)

base_dir = '/global/projecta/projectdirs/lsst/global/in2p3/Run1.1/object_catalog'

datafile = os.path.join(base_dir, datafile_basename)
datafile_trim = os.path.join(base_dir, datafile_basename_trim)
datafile_table_trim = os.path.join(base_dir, datafile_basename_table_trim)

key_prefix = 'coadd'
nx, ny = 8, 8
patches = ['%d%d' % (i, j) for i in range(nx) for j in range (ny)]  # Note '%d%d' instead of '%d,%d'
patch = patches[0]
key = '%s_%d_%s' % (key_prefix, tract, patch)

### Time loading of catalog using Pandas

Reading just one patch of the tract.

In [2]:
%%timeit
df_onepatch = pd.read_hdf(datafile, key=key)

50.4 ms ± 856 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [3]:
%%timeit
df_onepatch_trim = pd.read_hdf(datafile_trim, key=key)

13.6 ms ± 796 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [4]:
%%timeit
df_onepatch_table_trim = pd.read_hdf(datafile_table_trim, key=key)

25.3 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Now we'll load all of the patches in the tract

In [3]:
def load_tract_into_pandas(datafile, tract, key_prefix='coadd'):
    """Load all of the patches in one tract into Pandas
    
    Returns None if no data was successfully loaded.
    """
    nx, ny = 8, 8
    patches = ['%d%d' % (i, j) for i in range(nx) for j in range (ny)]  # Note '%d%d' instead of '%d,%d'

    dfs = []
    for patch in patches:
        key = '%s_%d_%s' % (key_prefix, tract, patch)
        try:
            df = pd.read_hdf(datafile, key=key)
        except:
            continue
        dfs.append(df)

    if dfs:
        df = pd.concat(dfs, ignore_index=True)
    else:
        df = None

    return df

In [6]:
%%timeit
df_onetract = load_tract_into_pandas(datafile, tract)

52.1 s ± 3.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%%timeit
df_onetract_trim = load_tract_into_pandas(datafile_trim, tract)

1.74 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%%timeit
df_onetract_table_trim = load_tract_into_pandas(datafile_table_trim, tract)

2.6 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
df_onetract = load_tract_into_pandas(datafile, tract)
df_onetract_trim = load_tract_into_pandas(datafile_trim, tract)
df_onetract_table_trim = load_tract_into_pandas(datafile_table_trim, tract)

In [7]:
%%timeit
df_onetract['g_mag'] - df_onetract['r_mag']

3.63 ms ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [8]:
%%timeit
(df_onetract['g_mag'] - df_onetract['r_mag']).mean()

6.14 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [9]:
%%timeit
df_onetract_trim['g_mag'] - df_onetract_trim['r_mag']

2.7 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [10]:
%%timeit
(df_onetract_trim['g_mag'] - df_onetract_trim['r_mag']).mean()

5.02 ms ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%%timeit
df_onetract_table_trim['g_mag'] - df_onetract_table_trim['r_mag']

750 µs ± 42.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [12]:
%%timeit
(df_onetract_table_trim['g_mag'] - df_onetract_table_trim['r_mag']).mean()

3.27 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Now load all 20 tracts (many are only partially complete)

In [4]:
fullpath_datafile_pattern = os.path.join(base_dir, datafile_pattern)
fullpath_datafile_pattern_trim = os.path.join(base_dir, datafile_pattern_trim)
fullpath_datafile_pattern_table_trim = os.path.join(base_dir, datafile_pattern_table_trim)

In [5]:
def load_all_tracts_into_pandas(
    datafile_pattern,
    tracts=(5066, 5065, 5064, 5063, 5062,
            4852, 4851, 4850, 4849, 4848,
            4640, 4639, 4638, 4637, 4636,
            4433, 4432, 4431, 4430, 4429),
    verbose=False,
    **kwargs):
    dfs = []
    for t in tracts:
        datafile = datafile_pattern.format(t)
        if verbose:
            print(datafile)
        df = load_tract_into_pandas(datafile, t, **kwargs)
        if df is not None:
            dfs.append(df)
    
    return pd.concat(dfs, ignore_index=True)

Actually loading the full set into Pandas tends to get killed for exceeding memory when there's high memory pressure on cori19.nersc.gov (the current node running Jupyter).  
Thus the following cell tends to never run:

In [None]:
%%timeit -n 1 -r 1
df = load_all_tracts_into_pandas(fullpath_datafile_pattern)

In [16]:
%%timeit -n 1 -r 1
df_trim = load_all_tracts_into_pandas(fullpath_datafile_pattern_trim)

1min 39s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [15]:
%%timeit -n 1
df_table_trim = load_all_tracts_into_pandas(fullpath_datafile_pattern_table_trim)

1min 26s ± 8.62 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
df = load_all_tracts_into_pandas(fullpath_datafile_pattern)

In [6]:
df_trim = load_all_tracts_into_pandas(fullpath_datafile_pattern_trim)

In [7]:
df_table_trim = load_all_tracts_into_pandas(fullpath_datafile_pattern_table_trim)

In [None]:
%%timeit
df['g_mag'] - df['r_mag']

In [None]:
%%timeit
(df['g_mag'] - df['r_mag']).mean()

In [8]:
%%timeit
df_trim['g_mag'] - df_trim['r_mag']

9.36 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [9]:
%%timeit
(df_trim['g_mag'] - df_trim['r_mag']).mean()

36.1 ms ± 796 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [10]:
%%timeit
df_table_trim['g_mag'] - df_table_trim['r_mag']

8.45 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%%timeit
(df_table_trim['g_mag'] - df_table_trim['r_mag']).mean()

35.5 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
%%prun
(df_table_trim['g_mag'] - df_table_trim['r_mag']).mean()

 

These calculation times are identical for the three different catalogs.  This is consistent with the logical need; each operation is reading two columns and subtracting them.

Note that subtracting the arrays takes only ten milliseconds.  Aggregating the result takes 3-4 times longer, although we're still only at tens of milliseconds.

In [None]:
df.info()

In [13]:
df_trim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6892380 entries, 0 to 6892379
Columns: 111 entries, base_Blendedness_abs_flux to z_modelfit_CModel_fluxSigma
dtypes: bool(10), float32(3), float64(84), int64(2), object(12)
memory usage: 5.2+ GB


In [14]:
df_table_trim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6892380 entries, 0 to 6892379
Columns: 111 entries, base_Blendedness_abs_flux to z_modelfit_CModel_fluxSigma
dtypes: bool(10), float32(3), float64(84), int64(2), object(12)
memory usage: 5.2+ GB


## How fast is Dask?

This section on Dask is very preliminary and will be improved upon in future work.

In [38]:
import numpy as np

import dask as da
import dask.dataframe as dd

from dask.distributed import Client
client = Client(processes=False)

In [39]:
tract = 4850

base_dir = '/global/projecta/projectdirs/lsst/global/in2p3/Run1.1/summary'

datafile = os.path.join(base_dir, 'table_trim_merged_tract_%d.hdf5' % tract)
datafile_pattern = os.path.join(base_dir, 'table_trim_merged_tract_*.hdf5')

### Time loading of catalog using Dask

In [51]:
da_df_onetract = dd.read_hdf(datafile, key='coadd_*', mode='r')
da_df = dd.read_hdf(datafile_pattern, key='coadd_*', mode='r')

In [52]:
df2_onetract = (da_df_onetract['g_mag'] - da_df_onetract['r_mag']).mean()
df2 = (da_df_all['g_mag'] - da_df_all['r_mag']).mean()

### Time computation using Dask

In [53]:
%%timeit
df2_onetract.compute()

2.77 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [54]:
%%timeit
df2.compute()

54.6 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Note that the full DASK calculations take 20 times longer than the onetract version.  This is consistent with the respetive time Pandas took to load the `table_trim_` files in our Pandas run above.

Dask takes 55 seconds to do the color average using the `table_trim_` files, while GCRCatalogs takes ~12 seconds using the `trim_` files and ~19 seconds using the `table_trim_` files.

This was a naive default configuration with two threads.  Further work is called for.

In [45]:
import os
print(os.getenv('OMP_NUM_THREADS'))

2


### Does it matter if we're on Lustre (SCRATCH)

No.

In [46]:
base_dir = '/global/cscratch1/sd/wmwv/DC2/Run1.1p/summary'

datafile_onetract_lustre = os.path.join(base_dir, 'table_trim_merged_tract_%d.hdf5' % tract)
datafile_pattern_lustre = os.path.join(base_dir, 'table_trim_merged_tract_*.hdf5')

In [47]:
da_df_onetract_lustre = dd.read_hdf(datafile_onetract_lustre, key='coadd_*', mode='r')
da_df_lustre = dd.read_hdf(datafile_pattern_lustre, key='coadd_*', mode='r')

In [48]:
df2_onetract_lustre = np.mean(da_df_onetract_lustre['g_mag'] - da_df_onetract_lustre['r_mag'])
df2_lustre = np.mean(da_df_lustre['g_mag'] - da_df_lustre['r_mag'])

In [49]:
%%timeit
df2_onetract_lustre.compute()

3.52 s ± 1.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [50]:
%%timeit
df2_lustre.compute()

1min 6s ± 6.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Time on the Luster file system for the (onetract, all) data set of (3.5, 76) sec seems about the same as the GPFS (2.8, 54.6) sec.