# Appendix: summary of performance using Apache Spark (Apache Parquet and FITS)

Author: **Julien Peloton [@JulienPeloton](https://github.com/JulienPeloton)**  
Last Run: **2018-11-22**  
See also: [issue/249](https://github.com/LSSTDESC/DC2-production/issues/249)

This notebook summarises the performance of data manipulations of the DC2 object catalogs with Apache Spark.
We focus on Parquet and FITS format.

The notebook is intended to be executed with the `desc-pyspark` kernel with **32 threads** (one full Cori Haswell node). See [LSSTDESC/desc-spark](https://github.com/LSSTDESC/desc-spark#working-at-nersc-jupyterlab) for more information.

In [None]:
from pyspark.sql import SparkSession

# Initialise our Spark session
spark = SparkSession.builder.getOrCreate()

## Dataset

We focus on "One Tract" (OT) and "All Tract" (AT) catalogs.

In [None]:
import os

base_dir = '/global/projecta/projectdirs/lsst/global/in2p3/Run1.1/summary'
print("Data will be read from: \n", base_dir)

# Path to data
parq_4850_OT = os.path.join(base_dir, 'dpdd_object_tract_4850_hive.parquet')
fits_4850_OT = os.path.join(base_dir, 'dpdd_object_tract_4850*.fits')

parq_hive_AT = os.path.join(base_dir, 'dpdd_object.parquet')
parq_simp_AT = os.path.join(base_dir, 'dpdd_object_simple.parquet')
fits_simp_AT = os.path.join(base_dir, 'dpdd_object_tract_*.fits')

## Results

Here is a summary of the results. The configuration for this run was:

- "One Tract" (OT) and "All Tract" (AT) catalogs
- 1 full Cori compute node (32 cores).


Numbers should be read as order of magnitude (it will slightly vary from run to run, depending also on the load of the machine as JupyterLab is a shared resource).
Details can be found below.

No cache:

| Data set            | #Rows (size GB) | Load time | statistics* |
|---------------------|-----------------|-------------|-----------------|
| Parquet (OT)        | 719,228 (0.43)   | 192 ms ± 66.5 ms      |      189 ms ± 84 ms     |
| FITS (OT)           | 719,228 (0.57)   | 3.74 s ± 107 ms      |      3.67 s ± 123 ms     |
| Parquet (AT, Hive)  | 6,892,380 (4.5) | 726 ms ± 117 ms      |      990 ms ± 521 ms     |
| Parquet (AT, Simple)| 6,892,380 (3.6) | 210 ms ± 28.2 ms      |      459 ms ± 46.5 ms     |
| FITS (AT)           | 6,892,380 (5.4) | 25.7 s ± 308 ms      |      24.4 s ± 1.24 s     |

_*statistics_ means computing: number of elements, mean, stddev, min, max.

With cache (overhead of < 1 second to add to put data in cache):

| Data set            | #Rows (size GB) | Load time | statistics |
|---------------------|-----------------|-------------|-----------------|
| Parquet (OT)        | 719,228 (0.43)   | 393 ms ± 86.2 ms      | 111 ms ± 45.4 ms |
| FITS (OT)           | 719,228 (0.57)   | 312 ms ± 59.2 ms      | 99.5 ms ± 49.4 ms |
| Parquet (AT, Hive)  | 6,892,380 (4.5) | 215 ms ± 102 ms      | 351 ms ± 150 ms
| Parquet (AT, Simple)| 6,892,380 (3.6) | 181 ms ± 74.1 ms      | 391 ms ± 52.9 ms
| FITS (AT)           | 6,892,380 (5.4) | 2.78 s ± 1.37 s      | 3.05 s ± 600 ms

Remarks:

- Results using Parquet are much faster than FITS. Reasons can be for e.g. columnar vs row-based or better implementation of the Spark Parquet connector. I believe this is related to what is seen with Dask between Parquet and HDF5.
- Note however that the size on disk of the datasets varies: 4.5 GB for Parquet Hive, 3.6 GB for Parquet simple, and 5.4 GB for FITS.
- For FITS, the number of input files matter. It is always better to have small number of large files rather than many small files.
- Once data in cache, everything is super fast.
- Note that given the small volume of data, most of the results below the seconds are basically dominated by Spark noise and not actual computation (which is why sometimes a simple count can be slower than computing full statistics).

## Details of the benchmarks

In [None]:
from pyspark.sql import DataFrame
from pyspark.sql.functions import col
import time

def readfits(path: str, hdu: int=1) -> DataFrame:
    """ Wrapper around Spark reader for FITS
    
    Parameters
    ----------
    path : str
        Path to the data. Can be a file, a folder, or a
        glob pattern.
    hdu : int, optional
        HDU number to read. Default is 1.
        
    Returns
    ----------
    df : DataFrame
        Spark DataFrame with the HDU data.

    """
    return spark.read.format("fits").option("hdu", hdu).load(path)

def readparq(path: str) -> DataFrame:
    """ Wrapper around Spark reader for Parquet
    
    Parameters
    ----------
    path : str
        Path to the data. Can be a file, a folder, or a
        glob pattern.
        
    Returns
    ----------
    df : DataFrame
        Spark DataFrame with the HDU data.
        
    """
    return spark.read.format("parquet").load(path)

def simple_count(df: DataFrame, cache=False, txt: str="") -> int:
    """ Return the number of rows in the DataFrame
    
    Parameters
    ----------
    df : DataFrame
        Spark DataFrame
    cache : bool, optional
        If True, put the Data in cache prior to the computation.
        Data will be unpersisted afterwards. Default is False.
    txt: str, optional
        Additional text to be printed.
        
    Returns
    ----------
    out : int
        Number of rows
        
    """
    if cache:
        start = time.time()
        df = df.cache()
        print("Cache took {:.1f} sec".format(time.time() - start))
    
    res = df.count()
    print("{} has length:".format(txt), res)
    
    # Time it!
    %timeit df.count()
        
    return res

def stat_diff_col(
        df: DataFrame, colname_1: str, colname_2: str, 
        cache=False, txt: str="") -> DataFrame:
    """ Return some statistics about the difference of 
    two DataFrame Columns.
    Statistics include: count, mean, stddev, min, max.
    
    Parameters
    ----------
    df : DataFrame
        Spark DataFrame
    colname_1 : str
        Name of the first Column
    colname_2 : str
        Name of the second Column
    cache : bool, optional
        If True, put the Data in cache prior to the computation.
        Data will be unpersisted afterwards. Default is False.
    txt: str, optional
        Additional text to be printed.
        
    Returns
    ----------
    out : DataFrame
        DataFrame containing statistics about the Columns difference.
    """
    if cache:
        df = df.cache()
    print("{} has length:".format(txt), df.count())
    
    # Time it!
    %timeit res = df.select(col(colname_1) - col(colname_2)).describe().collect()
    
    return df.select(col(colname_1) - col(colname_2)).describe()


def stat_one_col(df: DataFrame, colname: str, cache=False, txt: str="") -> DataFrame:
    """ Return some statistics about one DataFrame Column.
    Statistics include: count, mean, stddev, min, max.
    
    Parameters
    ----------
    df : DataFrame
        Spark DataFrame
    colname : str
        Name of the Column for which we want the statistics
    cache : bool, optional
        If True, put the Data in cache prior to the computation.
        Data will be unpersisted afterwards. Default is False.
    txt: str, optional
        Additional text to be printed.
        
    Returns
    ----------
    out : DataFrame
        DataFrame containing statistics about the Column.
    """
    if cache:
        df = df.cache()
    print("{} has length:".format(txt), df.count())
    
    # Time it!
    %timeit res = df.select(colname).describe().collect()
    
    return df.select(colname).describe()

In [None]:
# Accessing catalogs
cache = False
df = readparq(parq_4850_OT)
c = simple_count(df, cache, "OT (Parquet)")

df = readfits(fits_4850_OT)
c = simple_count(df, cache, "OT (FITS)")

df = readparq(parq_hive_AT)
c = simple_count(df, cache, "AT (P-Hive)")

df = readparq(parq_simp_AT)
c = simple_count(df, cache, "AT (P-simple)")

df = readfits(fits_simp_AT)
c = simple_count(df, cache, "AT (FITS)")

In [None]:
# Statistics: count, mean, stddev, min, max
c1 = "mag_g"
c2 = "mag_r"
cache = False
df = readparq(parq_4850_OT)
c = stat_diff_col(df, c1, c2, cache, "OT (Parquet)")

df = readfits(fits_4850_OT)
c = stat_diff_col(df, c1, c2, cache, "OT (FITS)")

df = readparq(parq_hive_AT)
c = stat_diff_col(df, c1, c2, cache, "AT (P-Hive)")

df = readparq(parq_simp_AT)
c = stat_diff_col(df, c1, c2, cache, "AT (P-simple)")

df = readfits(fits_simp_AT)
c = stat_diff_col(df, c1, c2, cache, "AT (FITS)")

In [None]:
# Accessing catalogs
cache = True
df = readparq(parq_4850_OT)
c = simple_count(df, cache, "OT (Parquet)")
df.unpersist()

df = readfits(fits_4850_OT)
c = simple_count(df, cache, "OT (FITS)")
df.unpersist()

df = readparq(parq_hive_AT)
c = simple_count(df, cache, "AT (P-Hive)")
df.unpersist()

df = readparq(parq_simp_AT)
c = simple_count(df, cache, "AT (P-simple)")
df.unpersist()

df = readfits(fits_simp_AT)
c = simple_count(df, cache, "AT (FITS)")
df.unpersist();

In [None]:
# Statistics: count, mean, stddev, min, max
c1 = "mag_g"
c2 = "mag_r"
cache = True
df = readparq(parq_4850_OT)
c = stat_diff_col(df, c1, c2, cache, "OT (Parquet)")
df.unpersist()

df = readfits(fits_4850_OT)
c = stat_diff_col(df, c1, c2, cache, "OT (FITS)")
df.unpersist()

df = readparq(parq_hive_AT)
c = stat_diff_col(df, c1, c2, cache, "AT (P-Hive)")
df.unpersist()

df = readparq(parq_simp_AT)
c = stat_diff_col(df, c1, c2, cache, "AT (P-simple)")
df.unpersist()

df = readfits(fits_simp_AT)
c = stat_diff_col(df, c1, c2, cache, "AT (FITS)")
df.unpersist();