# Comparison of Two ROS3 Backends when Accessing Single-Shot HDF5 files in AWS S3

The forthcoming HDF5 library version 2.0 (as of writing this) replaced the code for working with S3-compatible web object stores in its Read Only S3 (ROS3) file driver with the Amazon's C S3 [library](https://github.com/awslabs/aws-c-s3). This notebook compares benchmark results between the latest current library release 1.14.6 and a development version of future 2.0 release. The goal is to assess whether there are significant performance differences between the old and new code in the ROS3 driver.

The same single-shot HDF5 files with C-Mod data in S3 were used for the benchmarks with the new ROS3 code as with the "old" ROS3 from the 1.14.6 library release.

## TL;DR Conclusions

* New ROS3 driver never reported errors when accessing the HDF5 files while the old ROS3 had occasionally done so. This is likely the consequence of the Amazon's S3 library and its intelligent handling of non-fatal failed S3 requests.
* The new ROS3 benchmarks are comparable or better than the old ones. The biggest change is when reading all signals from all the files where a ~30% improvement was observed for the new ROS3. This case also did not have a performance degradation when the number of Dask workers significantly increases compared to the available CPUs (contention for resources case).

The overall conclusion is the new ROS3 does not degrade performance compared to the old ROS3 for the HDF5 files in this project.

---

## Benchmark Data Analysis

In [2]:
import pandas as pd
import hvplot.pandas  # noqa: F401
import holoviews as hv

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

hv.extension("bokeh")

Read benchmark data for the new ROS3 backend:

In [3]:
s3_data_new = pd.read_csv("./ec2-s3-libaws.csv")
s3_data_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11657 entries, 0 to 11656
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   worker#              11657 non-null  int64  
 1   obj-id               11657 non-null  object 
 2   open+read-data-time  11657 non-null  float64
 3   wrkr-num-objs        11657 non-null  int64  
 4   mean-obj-time        11657 non-null  float64
 5   num-dsets            11657 non-null  int64  
 6   mean-dset-time       11657 non-null  float64
 7   pb-size              11657 non-null  int64  
 8   num-workers          11657 non-null  int64  
 9   obj-type             11657 non-null  object 
 10  tot-num-obj          11657 non-null  int64  
 11  total-runtime        11657 non-null  float64
dtypes: float64(4), int64(6), object(2)
memory usage: 1.1+ MB


Read benchmark data for the old ROS3 backend:

In [4]:
s3_data_old = pd.read_csv("./ec2-s3.csv")
s3_data_old.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11655 entries, 0 to 11654
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   worker#              11655 non-null  int64  
 1   obj-id               11655 non-null  object 
 2   open+read-data-time  11655 non-null  float64
 3   wrkr-num-objs        11655 non-null  int64  
 4   mean-obj-time        11655 non-null  float64
 5   num-dsets            11655 non-null  int64  
 6   mean-dset-time       11655 non-null  float64
 7   pb-size              11655 non-null  int64  
 8   num-workers          11655 non-null  int64  
 9   obj-type             11655 non-null  object 
 10  tot-num-obj          11655 non-null  int64  
 11  total-runtime        11655 non-null  float64
dtypes: float64(4), int64(6), object(2)
memory usage: 1.1+ MB


Page cache sizes used in the benchmarks:

In [5]:
s3_data_old["pb-size"].unique()

array([268435456])

In [6]:
s3_data_new["pb-size"].unique()

array([268435456])

Replace:
* Page cache sizes with more memorable values.
* Correct a data error of reporting 500 for the total number of objects when reading by shot.

In [7]:
s3_data_old.replace(
    {
        "pb-size": {0: "off", 268435456: "264MB"},
        "tot-num-obj": {500: 1},
    },
    inplace=True,
)

s3_data_new.replace(
    {
        "pb-size": {0: "off", 268435456: "264MB"},
        "tot-num-obj": {500: 1},
    },
    inplace=True,
)

These were the benchmark parameter combinations:

In [8]:
s3_data_old[["obj-type", "pb-size", "num-workers"]].drop_duplicates()

Unnamed: 0,obj-type,pb-size,num-workers
0,shots,264MB,1
1,shots,264MB,2
3,shots,264MB,4
7,signals,264MB,8
495,shots,264MB,8
503,signals,264MB,16
1477,shots,264MB,16
1492,signals,264MB,24
2953,shots,264MB,24
2973,signals,264MB,32


In [9]:
s3_data_new[["obj-type", "pb-size", "num-workers"]].drop_duplicates()

Unnamed: 0,obj-type,pb-size,num-workers
0,shots,264MB,1
1,shots,264MB,2
3,shots,264MB,4
7,signals,264MB,8
495,shots,264MB,8
503,signals,264MB,16
1477,shots,264MB,16
1492,signals,264MB,24
2953,shots,264MB,24
2974,signals,264MB,32


Column `obj-type` describes data access type during one benchmark run:

* `obj-type = shots` means all signals from one shot file were read.
* `obj-type = signals` means all signals from all the files were read, one at a time.


Since the two ways of reading data by `shots` and `signals` are so different they cannot be compared to each other. Separate them into different DataFrames.

In [10]:
s3_old_shots = s3_data_old[s3_data_old["obj-type"] == "shots"]
s3_new_shots = s3_data_new[s3_data_new["obj-type"] == "shots"]
s3_old_signals = s3_data_old[s3_data_old["obj-type"] == "signals"]
s3_new_signals = s3_data_new[s3_data_new["obj-type"] == "signals"]

In [11]:
s3_new_shots.head()

Unnamed: 0,worker#,obj-id,open+read-data-time,wrkr-num-objs,mean-obj-time,num-dsets,mean-dset-time,pb-size,num-workers,obj-type,tot-num-obj,total-runtime
0,1,1160923014,0.620212,1,0.620212,124,0.005002,264MB,1,shots,1,0.638936
1,1,1160929030,0.267919,1,0.267919,61,0.004392,264MB,2,shots,1,0.386857
2,2,1160929030,0.366118,1,0.366118,60,0.006102,264MB,2,shots,1,0.386857
3,1,1160928002,0.396423,1,0.396423,32,0.012388,264MB,4,shots,1,0.479301
4,2,1160928002,0.425818,1,0.425818,32,0.013307,264MB,4,shots,1,0.479301


In [12]:
s3_new_signals.head()

Unnamed: 0,worker#,obj-id,open+read-data-time,wrkr-num-objs,mean-obj-time,num-dsets,mean-dset-time,pb-size,num-workers,obj-type,tot-num-obj,total-runtime
7,1,aeqdsk_aminor,31.374785,63,0.498012,126,0.249006,264MB,8,signals,61,945.749953
8,2,aeqdsk_aminor,31.749961,63,0.503968,126,0.251984,264MB,8,signals,61,945.749953
9,3,aeqdsk_aminor,30.819513,63,0.489199,126,0.244599,264MB,8,signals,61,945.749953
10,4,aeqdsk_aminor,32.351999,63,0.513524,126,0.256762,264MB,8,signals,61,945.749953
11,5,aeqdsk_aminor,32.317871,63,0.512982,126,0.256491,264MB,8,signals,61,945.749953


### Total Runtime and Peformance

Total benchmark runtime in the `tot-runtime` column is the elapsed time of the entire benchmark as measured by the main process. The total runtime encompasses:
1. Dividing data access plan across Dask workers and their intialization.
1. All Dask workers completing their jobs.
1. Collecting Dask worker benchmark data.

Below are four DataFrames with total runtimes separated for the signal and shot benchmarks. Their runtimes are so different that there is no point comparing them together. The new DataFrames include several original columns plus a new column `norm-tot-runtime`. This column holds computed performance ratios to the _baseline_ benchmark. The baseline benchmark is one of the available benchmarks selected because it represents the most common set of libhdf5 settings, compute resources, and data access. The baseline benchmarks are:

* S3 files:
    * Reading all signals for a shot: 1 Dask worker, 264 MB file page cache
    * Reading all signals for all the shots: 8 Dask worker, 264 MB file page cache

In [13]:
old_shots_runtime = s3_old_shots[
    ["pb-size", "num-workers", "total-runtime"]
].drop_duplicates(ignore_index=True)
old_shots_runtime["where"] = "Local"
old_shots_runtime["norm-tot-runtime"] = (
    old_shots_runtime.loc[0, "total-runtime"] / old_shots_runtime["total-runtime"]
)

old_signals_runtime = s3_old_signals[
    ["pb-size", "num-workers", "total-runtime"]
].drop_duplicates(ignore_index=True)
old_signals_runtime["where"] = "Local"
old_signals_runtime["norm-tot-runtime"] = (
    old_signals_runtime.loc[0, "total-runtime"] / old_signals_runtime["total-runtime"]
)

new_shots_runtime = s3_new_shots[
    ["pb-size", "num-workers", "total-runtime"]
].drop_duplicates(ignore_index=True)
new_shots_runtime["where"] = "S3"
new_shots_runtime["norm-tot-runtime"] = (
    new_shots_runtime.loc[0, "total-runtime"] / new_shots_runtime["total-runtime"]
)

new_signals_runtime = s3_new_signals[
    ["pb-size", "num-workers", "total-runtime"]
].drop_duplicates(ignore_index=True)
new_signals_runtime["where"] = "S3"
new_signals_runtime["norm-tot-runtime"] = (
    new_signals_runtime.loc[0, "total-runtime"] / new_signals_runtime["total-runtime"]
)

### Reading All Data from a Single Shot File

Plots of performance ratio and runtime when reading all data from a single S3 shot file:

In [14]:
plot_kwargs = {
    "x": "num-workers",
    # "by": ["pb-size"],
}
(
    old_shots_runtime.hvplot.line(y="norm-tot-runtime", **plot_kwargs)
    * old_shots_runtime.hvplot.scatter(y="norm-tot-runtime", **plot_kwargs)
    * hv.HLine(1).opts(line_width=0.7, color="pink")
    * new_shots_runtime.hvplot.line(y="norm-tot-runtime", **plot_kwargs)
    * new_shots_runtime.hvplot.scatter(y="norm-tot-runtime", **plot_kwargs)
).options(
    legend_position="top_right",
    title="Shot File: Old (blue) vs New (red) ROS3",
    xlabel="Number of Dask workers",
    ylabel="Performance ratio (>1 better)",
    xlim=(0, old_shots_runtime["num-workers"].max() + 1),
    ylim=(0, None),
    height=400,
    width=500,
) + (
    old_shots_runtime.hvplot.line(y="total-runtime", **plot_kwargs)
    * old_shots_runtime.hvplot.scatter(y="total-runtime", **plot_kwargs)
    * new_shots_runtime.hvplot.line(y="total-runtime", **plot_kwargs)
    * new_shots_runtime.hvplot.scatter(y="total-runtime", **plot_kwargs)
).options(
    legend_position="bottom_right",
    title="Shot File: Old (blue) vs New (red) ROS3",
    xlabel="Number of Dask workers",
    ylabel="Total runtime / [s]",
    xlim=(0, old_shots_runtime["num-workers"].max() + 1),
    ylim=(0, None),
    show_grid=True,
    height=400,
    width=500,
)

### Reading All Signals from All Shot Files

Plots of performance ratio and runtime when reading all signals from all the shot files in S3:

In [15]:
plot_kwargs = {
    "x": "num-workers",
    # "by": ["pb-size"],
}
(
    old_signals_runtime.hvplot.line(y="norm-tot-runtime", **plot_kwargs)
    * old_signals_runtime.hvplot.scatter(y="norm-tot-runtime", **plot_kwargs)
    * hv.HLine(1).opts(line_width=0.7, color="pink")
    * new_signals_runtime.hvplot.line(y="norm-tot-runtime", **plot_kwargs)
    * new_signals_runtime.hvplot.scatter(y="norm-tot-runtime", **plot_kwargs)
).options(
    legend_position="top_right",
    title="All Signals All Shots: Old (blue) vs New (red) ROS3",
    xlabel="Number of Dask workers",
    ylabel="Performance ratio (>1 better)",
    xlim=(0, old_signals_runtime["num-workers"].max() + 1),
    ylim=(0, None),
    height=400,
    width=500,
) + (
    old_signals_runtime.hvplot.line(y="total-runtime", **plot_kwargs)
    * old_signals_runtime.hvplot.scatter(y="total-runtime", **plot_kwargs)
    * new_signals_runtime.hvplot.line(y="total-runtime", **plot_kwargs)
    * new_signals_runtime.hvplot.scatter(y="total-runtime", **plot_kwargs)
).options(
    legend_position="bottom_right",
    title="All Signals All Shots: Old (blue) vs New (red) ROS3",
    xlabel="Number of Dask workers",
    ylabel="Total runtime / [s]",
    xlim=(0, old_signals_runtime["num-workers"].max() + 1),
    ylim=(0, None),
    show_grid=True,
    height=400,
    width=500,
)

### Display Some Worker Mean Runtimes

The `mean-obj-time` column holds mean read times of _objects_ in a shot file. Which _object_ is read depends on the `obj-type` column, with the values `shots` or `signals`. The `s3_signals` DataFrame 

In [16]:
(
    s3_old_signals.hvplot.box(
        y="mean-obj-time",
        by=["pb-size", "num-workers"],
    )
    * s3_new_signals.hvplot.box(
        y="mean-obj-time",
        by=["pb-size", "num-workers"],
    )
).options(
    # ylim=(8, 11),
    title="All Signals All Shots: Old (blue) vs New (red) ROS3",
    height=400,
    show_legend=False,
    xlabel="File page buffer size, Number of Dask workers",
    ylabel="Worker mean signal read time / [seconds]",
    show_grid=True,
)

In [17]:
s3_new_signals.groupby(["pb-size", "num-workers"])["mean-obj-time"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
pb-size,num-workers,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
264MB,8,488.0,0.251735,0.031577,0.237753,0.242595,0.245573,0.249351,0.513524
264MB,16,974.0,0.258045,0.011587,0.238073,0.250987,0.254876,0.26053,0.320893
264MB,24,1461.0,0.318738,0.028183,0.241442,0.303169,0.316548,0.331542,0.934952
264MB,32,1947.0,0.420436,0.054816,0.238095,0.38426,0.415481,0.453103,0.903244
264MB,48,2801.0,0.629601,0.126111,0.2401,0.546357,0.620591,0.703173,1.62152
264MB,64,3815.0,0.841275,0.192385,0.246335,0.704887,0.825595,0.96042,1.944003


---