## Performance Tests of Coadd Catalog Access

Test the performance of data manipulations of the static coadd catalogs.

1. Identify trivial, moderate, and worst-case use case examples.
2. Measure performance on
    a. single patch
    b. a single tract
    c. the full dataset
    3. Record data sizes of each of the above a, b, c.
4. Determine if performance considerations mean we should generate a static file that contains a restricted set in columns.
5. Look into again using full tables functionality to write HDF5 files so that they can be read by column efficiently. This was previously not possible because of an error trying to write the thousands of columns in our full coadd catalogs. This is #158


In [1]:
import os

import numpy as np

import GCRCatalogs

If you want to use the GCR reader outside of NERSC environment, you can override the `base_dir`.

In [2]:
config = {}

trim_config = config.copy()
trim_config['filename_pattern'] = r'trim_merged_tract_\d+\.hdf5'
table_trim_config = config.copy()
table_trim_config['filename_pattern'] = r'table_trim_merged_tract_\d+\.hdf5'

trim_onetract_config = config.copy()
trim_onetract_config['filename_pattern'] = 'trim_merged_tract_4850.hdf5'
table_trim_onetract_config = config.copy()
table_trim_onetract_config['filename_pattern'] = 'table_trim_merged_tract_4850.hdf5'

In [5]:
%%timeit
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', config)

807 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The trim files are 1/10 of the size of the full files.  The `load_catalog` doesn't load the data, but does need to open and touch each file to read the metadata.  This is only about 4 times faster for the trim catalogs.

In [6]:
%%timeit
gc_onetract_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', trim_onetract_config)

503 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', config)
gc_onetract_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', trim_onetract_config)
gc_onetract_table_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', table_trim_onetract_config)

In [7]:
%%timeit
gc = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', config)

4.92 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%%timeit
gc_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', trim_config)

1.29 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
gc = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', config)
gc_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', trim_config)
gc_table_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', table_trim_config)

Loading the GCR Catalog is, in principle, just the initialization of the catalog.  In practice the GCRCatalog reader does need to read through all of the metadata in the HDF5 files to figure out what's in there and available.  The onetract version is reading a 7.4 GB file that should fit in memory.  The full Run 1.1p is 78 GB, which does not fit in the average desktop memory.  This size could pontentially fit in the memory of various high-memory shared nodes.  This difference in size is conveniently roughly a factor of 10.  We should expect 

We can control the memory caching within GCR to clear the cache to reset for performance tests.  It's harder to control the underlying caching of the GPFS and kernel filesystem memory.

In [10]:
def compute_mean_color_slow(catalog):
    """Compute the mean g-r color of all objects in the 'catalog'.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    average_gmr = np.mean(catalog['mag_g'] - catalog['mag_r'])
    return average_gmr

In [11]:
def compute_mean_color_faster(catalog):
    """Compute the mean g-r color of all objects in the 'catalog'.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    data = catalog.get_quantities(['mag_g', 'mag_r'])
    average_gmr = np.mean(data['mag_g'] - data['mag_r'])
    return average_gmr

In [12]:
def compute_mean_color_faster_iter(catalog):
    """Compute the mean g-r color of all objects in the 'catalog' using iterator.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    sum_gmr = count = 0
    for data in catalog.get_quantities(['mag_g', 'mag_r'], return_iterator=True):
        sum_gmr += np.sum(data['mag_g'] - data['mag_r'])
        count += len(data['mag_g'])
    return sum_gmr / count

In [13]:
# def compute_stellar_locus():

The average color calculation is 5 times faster with the trim files for one tract, using the slowest most naive way to access the quantities.

In [14]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_slow(gc_onetract)

4.66 s ± 985 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
%%timeit
gc_onetract_trim.clear_cache()
compute_mean_color_slow(gc_onetract_trim)

826 ms ± 40.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
%%timeit
gc.clear_cache()
compute_mean_color_slow(gc)

3min 6s ± 660 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_faster(gc_onetract)

4.31 s ± 1.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_faster_iter(gc_onetract)

3.64 s ± 91.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
%%timeit
gc.clear_cache()
compute_mean_color_faster(gc)

3min 43s ± 43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [20]:
%%timeit
gc.clear_cache()
compute_mean_color_faster_iter(gc)

3min 12s ± 1.58 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [34]:
%%timeit
gc_trim.clear_cache()
compute_mean_color_faster(gc_trim)

12.2 s ± 2.34 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
gc_trim.clear_cache()
compute_mean_color_faster_iter(gc_trim)

## Compare to using Pandas to read the data files directly.

In [8]:
import os
import pandas as pd

tract = 4850
datafile_basename = 'merged_tract_%d.hdf5' % tract
datafile_basename_trim = 'trim_' + datafile_basename
datafile_basename_table_trim = 'table_trim_' + datafile_basename

base_dir = gc_onetract_trim.base_dir

datafile = os.path.join(base_dir, datafile_basename)
datafile_trim = os.path.join(base_dir, datafile_basename_trim)
datafile_table_trim = os.path.join(base_dir, datafile_basename_table_trim)

key_prefix = 'coadd'
nx, ny = 8, 8
patches = ['%d%d' % (i, j) for i in range(nx) for j in range (ny)]  # Note '%d%d' instead of '%d,%d'
patch = patches[0]
key = '%s_%d_%s' % (key_prefix, tract, patch)

Reading just one patch of the tract.

In [9]:
%%timeit
df = pd.read_hdf(datafile, key=key)

47.5 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [10]:
%%timeit
df_trim = pd.read_hdf(datafile_trim, key=key)

11.1 ms ± 548 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%%timeit
df_table_trim = pd.read_hdf(datafile_table_trim, key=key)

17.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Now we'll load all of th patches in the tract

In [12]:
def load_tract_into_pandas(datafile, tract, key_prefix='coadd'):
    nx, ny = 8, 8
    patches = ['%d%d' % (i, j) for i in range(nx) for j in range (ny)]  # Note '%d%d' instead of '%d,%d'

    dfs = []
    for patch in patches:
        key = '%s_%d_%s' % (key_prefix, tract, patch)
        try:
            df = pd.read_hdf(datafile, key=key)
        except:
            continue
        dfs.append(df)

    df = pd.concat(dfs)
    return df

In [13]:
%%timeit
df_trim = load_tract_into_pandas(datafile_trim, tract=tract)

1.82 s ± 94.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
%%timeit
df_table_trim = load_tract_into_pandas(datafile_table_trim, tract=tract)

3.59 s ± 1.51 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Can we use Dask?

In [15]:
import dask as da
import dask.dataframe as dd

In [16]:
tract = 4850

base_dir = gc_onetract_trim.base_dir

datafile = os.path.join(base_dir, 'table_trim_merged_tract_%d.hdf5' % tract)
datafile_pattern = os.path.join(base_dir, 'table_trim_merged_tract_*.hdf5')

In [18]:
da_df = dd.read_hdf(datafile, key='coadd_*', mode='r')
da_df_all = dd.read_hdf(datafile_pattern, key='coadd_*', mode='r')

In [19]:
help(dd.read_hdf)

Help on function read_hdf in module dask.dataframe.io.hdf:

read_hdf(pattern, key, start=0, stop=None, columns=None, chunksize=1000000, sorted_index=False, lock=True, mode='a')
    Read HDF files into a Dask DataFrame
    
    Read hdf files into a dask dataframe. This function is like
    ``pandas.read_hdf``, except it can read from a single large file, or from
    multiple files, or from multiple keys from the same file.
    
    Parameters
    ----------
    pattern : string, list
        File pattern (string), buffer to read from, or list of file
        paths. Can contain wildcards.
    key : group identifier in the store. Can contain wildcards
    start : optional, integer (defaults to 0), row number to start at
    stop : optional, integer (defaults to None, the last row), row number to
        stop at
    columns : list of columns, optional
        A list of columns that if not None, will limit the return
        columns (default is None)
    chunksize : positive integer, optio

In [20]:
df2 = np.mean(da_df['g_mag'] - da_df['r_mag'])
df2_all = np.mean(da_df_all['g_mag'] - da_df_all['r_mag'])

In [21]:
%%timeit
df2.compute()

2.06 s ± 57.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [23]:
%%timeit
df2_all.compute()

51.7 s ± 20.7 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Note that the full DASK calculation takes 20 times longer than the onetract calculation for a data volume 10-times larger.

DASK takes ~38 seconds to do the color average using the `table_trim_` file, while GCRCatalogs takes ~12 seconds using the `trim_` file.

In [22]:
import os
print(os.getenv('OMP_NUM_THREADS'))

4


Things to try:
    1. Put on SCRATCH (Lustre)
    2. Put on ??? (burst buffer)

In [25]:
base_dir = '/global/cscratch1/sd/wmwv/DC2/Run1.1p/summary'

datafile_lustre = os.path.join(base_dir, 'table_trim_merged_tract_%d.hdf5' % tract)
datafile_pattern_lustre = os.path.join(base_dir, 'table_trim_merged_tract_*.hdf5')

In [26]:
da_df_lustre = dd.read_hdf(datafile_lustre, key='coadd_*', mode='r')
da_df_all_lustre = dd.read_hdf(datafile_pattern_lustre, key='coadd_*', mode='r')

In [27]:
df2_lustre = np.mean(da_df_lustre['g_mag'] - da_df_lustre['r_mag'])
df2_all_lustre = np.mean(da_df_all_lustre['g_mag'] - da_df_all_lustre['r_mag'])

In [28]:
%%timeit
df2_lustre.compute()

2.02 s ± 86.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [29]:
%%timeit
df2_all_lustre.compute()

42.4 s ± 23.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Time on the Luster file system (2, 42) sec seems about the same as the GPFS (2, 50) sec.

In [32]:
import dask
dask.config.set(scheduler='threads')
print(dask.config.config)

{'array': {'chunk-size': '128MiB', 'rechunk-threshold': 4}, 'scheduler': 'threads'}


In [33]:
%%timeit
df2_lustre.compute()

2.26 s ± 599 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
