## Performance Tests of Coadd Catalog Access

Test the performance of data manipulations of the static coadd catalogs.

1. Identify trivial, moderate, and worst-case use case examples.
2. Measure performance on
    a. single patch
    b. a single tract
    c. the full dataset
    3. Record data sizes of each of the above a, b, c.
4. Determine if performance considerations mean we should generate a static file that contains a restricted set in columns.
5. Look into again using full tables functionality to write HDF5 files so that they can be read by column efficiently. This was previously not possible because of an error trying to write the thousands of columns in our full coadd catalogs. This is #158


In [1]:
import os

import GCRCatalogs

If you want to use the GCR reader outside of NERSC environment, you can override the `base_dir`.

In [2]:
base_dir = '/Users/wmwv/DC2/coadd_catalog'
config = {'base_dir': base_dir}
trim_config = config.copy()
trim_config['filename_pattern'] = r'trim_merged_tract_\d+\.hdf5'
trim_onetract_config = config.copy()
trim_onetract_config['filename_pattern'] = 'trim_merged_tract_4850.hdf5'

In [3]:
%%timeit
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', config)

455 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The trim files are 1/10 of the size of the full files.

In [4]:
%%timeit
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', trim_onetract_config)

129 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [5]:
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', config)
gc_onetract_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850', trim_config)

In [6]:
%%timeit
gc = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', config)

5.38 s ± 8.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%%timeit
gc_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', trim_config)

1.69 s ± 76.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
gc = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', config)
gc_trim = GCRCatalogs.load_catalog('dc2_coadd_run1.1p', trim_config)

Loading the GCR Catalog is, in principle, just the initialization of the catalog.  In practice the GCRCatalog reader does need to read through all of the metadata in the HDF5 files to figure out what's in there and available.  The onetract version is reading a 7.4 GB file that should fit in memory.  The full Run 1.1p is 78 GB, which does not fit in the average desktop memory.  This size could pontentially fit in the memory of various high-memory shared nodes.  This difference in size is conveniently roughly a factor of 10.  We should expect 

We can control the memory caching within GCR to clear the cache to reset for performance tests.  It's harder to control the underlying caching of the GPFS and kernel filesystem memory.

In [9]:
def compute_mean_color_slow(catalog):
    """Compute the mean g-r color of all objects in the 'catalog'.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    average_gmr = np.mean(catalog['mag_g'] - catalog['mag_r'])
    return average_gmr

In [10]:
def compute_mean_color_faster(catalog):
    """Compute the mean g-r color of all objects in the 'catalog'.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    data = catalog.get_quantities(['mag_g', 'mag_r'])
    average_gmr = np.mean(data['mag_g'] - data['mag_r'])
    return average_gmr

In [11]:
def compute_mean_color_faster_iter(catalog):
    """Compute the mean g-r color of all objects in the 'catalog' using iterator.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    sum_gmr = count = 0
    for data in catalog.get_quantities(['mag_g', 'mag_r'], return_iterator=True):
        sum_gmr += np.sum(data['mag_g'] - data['mag_r'])
        count += len(data['mag_g'])
    return sum_gmr / count

In [12]:
# def compute_stellar_locus():

In [13]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_slow(gc_onetract)

5.71 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
%%timeit
gc_onetract_trim.clear_cache()
compute_mean_color_slow(gc_onetract_trim)

12.4 s ± 186 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
%%timeit
gc.clear_cache()
compute_mean_color_slow(gc)

2min 27s ± 3.56 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_faster(gc_onetract)

13.2 s ± 2.62 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
gc_onetract.clear_cache()
compute_mean_color_faster_iter(gc_onetract)

6.3 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
gc.clear_cache()
compute_mean_color_faster(gc)

In [None]:
%%timeit
gc.clear_cache()
compute_mean_color_faster_iter(gc)

In [None]:
len(gc['mag_g'])

## Compare to using Pandas to read the data files directly.

In [None]:
import os
import pandas as pd

tract = 4850
datafile_basename = 'merged_tract_%d.hdf5' % tract
datafile = os.path.join(base_dir, datafile_basename)

key_prefix = 'coadd'
nx, ny = 8, 8
patches = ['%d%d' % (i, j) for i in range(nx) for j in range (ny)]  # Note '%d%d' instead of '%d,%d'
patch = patches[0]
key = '%s_%d_%s' % (key_prefix, tract, patch)

In [None]:
%%timeit
df = pd.read_hdf(datafile, key=key)

In [None]:
def load_tract_into_pandas(tract, key_prefix='coadd'):
    nx, ny = 8, 8
    patches = ['%d%d' % (i, j) for i in range(nx) for j in range (ny)]  # Note '%d%d' instead of '%d,%d'

    dfs = []
    for patch in patches:
        key = '%s_%d_%s' % (key_prefix, tract, patch)
        try:
            df = pd.read_hdf(datafile, key=key)
        except:
            continue
        dfs.append(df)

    df = pd.concat(dfs)
    return df

In [None]:
%%timeit
df = load_tract_into_pandas(tract=tract)

In [None]:
df = load_tract_into_pandas(tract=tract)

## Can we use Dask?

In [None]:
import dask as da
import dask.dataframe as dd

In [None]:
da_df = dd.read_hdf(datafile, key=key)

Ha, ha.  DASK says don't use HDF5 'fixed' and we won't talk to you until you use 'table'.
```
TypeError: 
This HDFStore is not partitionable and can only be use monolithically with
pandas.  In the future when creating HDFStores use the ``format='table'``
option to ensure that your dataset can be parallelized
```