## Performance Tests of Coadd Catalog Access

Test the performance of data manipulations of the static coadd catalogs.

1. Identify trivial, moderate, and worst-case use case examples.
2. Measure performance on
    a. single patch
    b. a single tract
    c. the full dataset
    3. Record data sizes of each of the above a, b, c.
4. Determine if performance considerations mean we should generate a static file that contains a restricted set in columns.
5. Look into again using full tables functionality to write HDF5 files so that they can be read by column efficiently. This was previously not possible because of an error trying to write the thousands of columns in our full coadd catalogs. This is #158


In [1]:
import GCRCatalogs

Use the GCR reader outside of NERSC environment.
Want to use on specification of YAML file that lives here

In [2]:
custom_config_files = ('dc2_coadd_run1.1p_serenity', 'dc2_coadd_run1.1p_tract4850_serenity')
for name in custom_config_files:
    custom_config_file = '%s.yaml' % name
    config = GCRCatalogs.register.load_yaml(custom_config_file)
    GCRCatalogs.available_catalogs[name] = config

In [3]:
%%timeit
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850_serenity')

529 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [4]:
gc_onetract = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_tract4850_serenity')

In [5]:
%%timeit
gc = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_serenity')

6.18 s ± 71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
gc = GCRCatalogs.load_catalog('dc2_coadd_run1.1p_serenity')

Loading the GCR Catalog is, in principle, just the initialization of the catalog.  In practice the GCRCatalog reader does need to read through all of the metadata in the HDF5 files to figure out what's in there and available.  The onetract version is reading a 7.4 GB file that should fit in memory.  The full Run 1.1p is 78 GB, which does not fit in the average desktop memory.  This size could pontentially fit in the memory of various high-memory shared nodes.  This difference in size is conveniently roughly a factor of 10.  We should expect 

We can control the memory caching within GCR to clear the cache to reset for performance tests.  It's harder to control the underlying caching of the GPFS and kernel filesystem memory.

In [7]:
def compute_mean_color_slow(catalog):
    """Compute the mean g-r color of all objects in the 'catalog'.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    average_gmr = np.mean(catalog['mag_g'] - catalog['mag_r'])
    return average_gmr

In [8]:
def compute_mean_color_faster(catalog):
    """Compute the mean g-r color of all objects in the 'catalog'.
    
    This is a trivial performance case.
    This isn't particularly immediately interesting, but it's a simple arithmetic operation between two columns.
    """
    mag_g, mag_r = catalog.get_quantities(['mag_g', 'mag_r'])
    average_gmr = np.mean(catalog['mag_g'] - catalog['mag_r'])
    return average_gmr

In [9]:
# def compute_stellar_locus():

In [10]:
%%timeit
compute_mean_color_slow(gc_onetract)

37.3 ms ± 504 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
%%timeit
compute_mean_color_slow(gc)

The slowest run took 182.96 times longer than the fastest. This could mean that an intermediate result is being cached.
10.9 s ± 23.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
%%timeit
compute_mean_color_faster(gc_onetract)

The slowest run took 182.96 times longer than the fastest. This could mean that an intermediate result is being cached.
10.9 s ± 23.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
%%timeit
compute_mean_color_faster(gc)

The slowest run took 182.96 times longer than the fastest. This could mean that an intermediate result is being cached.
10.9 s ± 23.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
len(gc['mag_g'])

6892380

Compare to using Pandas to read the data files directly.

In [20]:
import os
import pandas as pd

tract = 4850
datafile_basename = 'merged_tract_%d.hdf5' % tract
datafile = os.path.join(gc.base_dir, datafile_basename)

key_prefix = 'coadd'
nx, ny = 8, 8
patches = ['%d%d' % (i, j) for i in range(nx) for j in range (ny)]  # Note '%d%d' instead of '%d,%d'
patch = patches[0]
key = '%s_%d_%s' % (key_prefix, tract, patch)

In [21]:
%%timeit
df = pd.read_hdf(datafile, key=key)

33.1 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [27]:
def load_tract_into_pandas(tract, key_prefix='coadd'):
    nx, ny = 8, 8
    patches = ['%d%d' % (i, j) for i in range(nx) for j in range (ny)]  # Note '%d%d' instead of '%d,%d'

    dfs = []
    for patch in patches:
        key = '%s_%d_%s' % (key_prefix, tract, patch)
        try:
            df = pd.read_hdf(datafile, key=key)
        except:
            continue
        dfs.append(df)

    df = pd.concat(dfs)
    return df

In [28]:
%%timeit
df = load_tract_into_pandas(tract=tract)

43.3 s ± 1.06 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [29]:
df = load_tract_into_pandas(tract=tract)