# Statistical and Visualization Functions: Thicket Tutorial

Thicket is a python-based toolkit for Exploratory Data Analysis (EDA) of parallel performance data that enables performance optimization and understanding of applications’ performance on supercomputers. It bridges the performance tool gap between being able to consider only a single instance of a simulation run (e.g., single platform, single measurement tool, or single scale) and finding actionable insights in multi-dimensional, multi-scale, multi-architecture, and multi-tool performance datasets.

**NOTE: An interactive version of this notebook is available in the Binder environment.**

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/llnl/thicket-tutorial/develop)

***

## 1. Import Necessary Packages

To explore the structure and various capabilities of thicket components, we begin by importing necessary packages. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from IPython.display import display
from IPython.display import HTML
import hatchet as ht

import thicket as th

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

## 2. Read in Performance Profiles

For this notebook, we select profiles generated on Lawrence Livermore National Lab (LLNL) machine, lassen. We create two thicket objects, one generated with the same problem size of 1048576 and the other generated with different problem sizes (1048576 and 4194304).   

In [None]:
clang = "../data/quartz/clang14.0.6_BaseSeq_8388608/"
gcc = "../data/quartz/GCC_10.3.1_BaseSeq_08388608/O3"

# create thickets for each dataset originating from clang and gcc compilers
clang_th = th.Thicket.from_caliperreader(clang)
gcc_th = th.Thicket.from_caliperreader(gcc)

## 3. More Information on a Function

You can use the `help()` method within Python to see the information for a given object. You can do this by typing `help(object)`. 
This will allow you to see the arguments for the function, and what will be returned. An example is below.

In [None]:
help(th.median)

## 4. Creating a Combined Thicket

To demonstrate the functions on both a thicket and a combined thicket, we create a combined thicket.

In [None]:
combined_th = th.Thicket.concat_thickets(
    axis="columns",
    thickets=[clang_th, gcc_th],
    headers=["Clang", "GCC"]
)
combined_th.dataframe.head(5)

**NOTE**
- Single indexed statistical functions append columns to the right.
- Columnar joined thickets are ordered in alphabetical order by column index and also by the columns underneath a column index. 

## 5. Aggregated Statistic Functions

###  5.1 Maximum
The `maximum` function will determine the maximum value for each node in the performance data table. In other terms, the maximum is the highest observation for a node and its associated profiles. <br>

The maximum value will be appended to the aggregated statistics table and will be denoted with `_max` at the end of column name i.e. `column_max`.

**Single Index Thicket Example**

In [None]:
# define metrics to calculate the maximum on
metrics = ["time (exc)", "Machine clears"]

In [None]:
th.maximum(clang_th, columns=metrics)
# view the first 5 entries of the aggregated statistics table
clang_th.statsframe.dataframe.head(5)

**Multi-Indexed Thicket Example**

Example demonstrates how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears")

In [None]:
metrics = [("Clang", "time (exc)"), ("GCC", "Machine clears")]

In [None]:
th.maximum(combined_th, columns=metrics)
combined_th.statsframe.dataframe.head(5)

### 5.2 Minimum

The `minimum` function will determine the minimum value for each node in the performance data table. In other terms, the minimum is the lowest observation for a node and its associated profiles. <br>

The minimum value will be appended to the aggregated statistics table and will be denoted with `_min` at the end of column name i.e. `column_min`.

**Single Index Thicket Example**

In [None]:
metrics = ["time (exc)", "Machine clears"]

In [None]:
th.minimum(clang_th, columns=metrics)
clang_th.statsframe.dataframe.head(5)

**Multi-Indexed Thicket Example**

Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears")

In [None]:
metrics = [("Clang", "time (exc)"), ("GCC", "Machine clears")]

In [None]:
th.minimum(combined_th, columns=metrics)
combined_th.statsframe.dataframe.head(5)

### 5.3 Median

The `median` function will determine the median for each node in the performance data table. <br>

The median value will be appended to the aggregated statistics table and will be denoted with `_median` at the end of column name i.e. `column_median`.

**Single Index Thicket Example**

In [None]:
metrics = ["time (exc)", "Machine clears"]

In [None]:
th.median(clang_th, columns=metrics)
clang_th.statsframe.dataframe.head(5)

**Multi-Indexed Thicket Example**

Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears") 

In [None]:
metrics = [("Clang", "time (exc)"), ("GCC", "Machine clears")]

In [None]:
th.median(combined_th, columns=metrics)
combined_th.statsframe.dataframe.head(5)

### 5.4 Mean

The `mean` function will determine the mean for each node in the performance data table. <br>

The mean value will be appended to the aggregated statistics table and will be denoted with `_mean` at the end of column name i.e. `column_mean`.

**Single Index Thicket Example**

In [None]:
metrics = ["time (exc)", "Machine clears"]

In [None]:
th.mean(clang_th, columns=metrics)
clang_th.statsframe.dataframe.head(5)

**Multi-Indexed Thicket Example**

Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears") 

In [None]:
metrics = [("Clang", "time (exc)"), ("GCC", "Machine clears")]

In [None]:
th.mean(combined_th, columns=metrics)
combined_th.statsframe.dataframe.head(5)

### 5.5 Variance

The `variance` function will determine the variance for each node in the performance data table. Variance will allow a user to see the spread of tdata within a node and that nodes associated profiles. <br>

The variance value will be appended to the aggregated statistics table and will be denoted with `_var` at the end of column name i.e. `column_var`.

**Single Index Thicket Example**

In [None]:
metrics = ["time (exc)", "Machine clears"]

In [None]:
th.variance(clang_th, columns=metrics)
clang_th.statsframe.dataframe.head(5)

**Multi-Indexed Thicket Example**

Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears")  

In [None]:
metrics = [("Clang", "time (exc)"), ("GCC", "Machine clears")]

In [None]:
th.variance(combined_th, columns=metrics)
combined_th.statsframe.dataframe.head(5)

### 5.6 Standard Deviation

The `std` function will determine the standard deviation for each node in the performance data table. Standard deviation describes how dispersed the data is in relation to the mean. <br>

The standard deviation value will be appended to the aggregated statistics table and will be denoted with `_std` at the end of column name i.e. `column_std`.

**Single Index Thicket Example**

In [None]:
metrics = ["time (exc)", "Machine clears"]

In [None]:
th.std(clang_th, columns=metrics)
clang_th.statsframe.dataframe.head(5)

**Multi-Indexed Thicket Example**

Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears")   

In [None]:
metrics = [("Clang", "time (exc)"), ("GCC", "Machine clears")]

In [None]:
th.std(combined_th, columns=metrics)
combined_th.statsframe.dataframe.head(5)

### 5.7 Percentiles

The `percentiles` function will determine the q-th percentiles for each node in the performance data table. <br>

 - The 25th percentile is the lower quartile, and is the value at which 25% of the answers lie below that value.
 - The 50th percentile is the median and half othe values lie below the median and half lie above the median.
 - The 75th percentile is the upper quartiles, and is the value at which 25% of the answers lie above that value and 75% of the answer lie below that value.

The calculated percentiles will be appended to the aggregated statistics table and will be denoted with `_percentiles` at the end of column name i.e. `column_percentiles`.

**Single Index Thicket Example**

In [None]:
metrics = ["time (exc)", "Machine clears"]

In [None]:
th.percentiles(clang_th, columns=metrics)
clang_th.statsframe.dataframe.head(5)

**Multi-Indexed Thicket Example**
Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears")

In [None]:
metrics = [("Clang", "time (exc)"), ("GCC", "Machine clears")]

In [None]:
th.percentiles(combined_th, columns=metrics)
combined_th.statsframe.dataframe.head(5)

### 5.8 Check Normality

The `check_normality` function will determine if the data is normal or non-normal for each node in the performance data table. For this test, the more data the better. Perferably you would want to have 20 data points (20 files) in a dataset to have an accurate result. <br>

A `True` boolean will be appended to the aggregated statistics table if the data is normal and a `False` boolean will be appended to the aggregated statistics table if the data is non-normal. The appended column will be denoted with `_normality` at the end of column name i.e. `column_normality`.

**Single Index Thicket Example**

In [None]:
metrics = ["time (exc)", "Machine clears"]

In [None]:
th.check_normality(clang_th, columns=metrics)
clang_th.statsframe.dataframe.head(5)

**Multi-Indexed Thicket Example**
Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears")

In [None]:
metrics = [("Clang", "time (exc)"), ("GCC", "Machine clears")]

In [None]:
th.check_normality(combined_th, columns=metrics)
combined_th.statsframe.dataframe.head(5)

### 5.9 Nodewise Correlation

The `correlation_nodewise` function will perform nodewise correlation for each node in the performance data table. <br>

The correlation values will be appended to the aggregated statistics table and will be denoted with `correlation type` where `correlation type` can be `{pearson, spearman, kendall}`. <br>

When working with a multi-indexed thicket (columnar join) a new column index will be created titled: `Union statistics`. See the **Multi-Indexed Thicket Example** to see the implementation of this.

**Single Index Thicket Example**

In [None]:
th.correlation_nodewise(clang_th, column1="time (exc)", column2="Machine clears", correlation="spearman")

In [None]:
clang_th.statsframe.dataframe.head(5)

**Multi-Indexed Thicket Example**

Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears") 

In [None]:
th.correlation_nodewise(combined_th, column1=("Clang", "time (exc)"), column2=("GCC", "Machine clears"), correlation="spearman")

In [None]:
combined_th.statsframe.dataframe.head(5)

### 5.10 Calculate Boxplot

The `calc_boxplot_statistics` function will calculate a boxplots `{q1, q2, q3, iqr, lowerfence, upperfence}` for each node in the performance data table. <br>

Each column will have the values: `{q1,q2,q3,iqr,lowerfence, upperfence}` calculated. The appended columns to the aggregated statistics table will be denoted with either col + `{_q1, _q2, _q3,_iqr, _lowerfence, _upperfence}`. <br>

Lastly, `calc_boxplot_statistics` will calculate outliers as well. The outliers will be appended to the aggregated statistics table and denoted with `_outliers` at the end of the column name i.e. `column_outliers`.

**Single Indexed Thicket**

In [None]:
metrics = ["time (exc)", "Machine clears"]

In [None]:
th.calc_boxplot_statistics(clang_th, columns=metrics)
clang_th.statsframe.dataframe.head(5)

**Multi-indexed Thicket**

Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears")

In [None]:
metrics = [("Clang", "time (exc)"), ("GCC", "Machine clears")]

In [None]:
th.calc_boxplot_statistics(combined_th, columns=metrics)
combined_th.statsframe.dataframe.head(5)

## 6. Visualization Thicket Functions

### 6.1 Displaying Heatmap

The `display_heatmap` function will display a color encoded map with the color representing the magnitude of the value for that specific node and column cell. <br>

Columns must be from the aggregated statistics table.

**Single Index Thicket Example**

In [None]:
# filter nodes to first five entries in the aggregated statistics table
stats_nodes = [
    "RAJAPerf", 
    "Algorithm", 
    "Algorithm_MEMCPY", 
    "Algorithm_MEMSET", 
    "Algorithm_REDUCE_SUM"
]
th_stats_name = clang_th.filter_stats(lambda x: x["name"] in stats_nodes)

In [None]:
plt.figure(figsize=(30, 30))
metrics = ["time (exc)_std", "time (exc)_var"]
th.display_heatmap(th_stats_name, columns=metrics)

**Multi-Indexed Thicket Example**

Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears")

In [None]:
plt.figure(figsize=(30, 30))
metrics = [("Clang", "time (exc)_std"), ("Clang", "time (exc)_var")]
th.display_heatmap(combined_th, columns=metrics)

**Example of Not Passing the Same Column Index for Multi-Indexed Thicket**

In [None]:
#metrics = [("Clang", "time (exc)_std"), ("GCC", "Machine clears_var")]
#th.display_heatmap(combined_th, columns=metrics)

### 6.2 Displaying Histogram

The `display_histogram` function will display a histogram for a user passed node and columns. Node and column must come from the performance data table. <br>

A histogram allows for a user to see outliers and the overall distribution of their data.

In [None]:
# Getting nodes to pass
n = pd.unique(combined_th.dataframe.reset_index()["node"])[4]

**Single Index Thicket Example**

In [None]:
plt.figure(figsize=(30, 30))
th.display_histogram(clang_th, node=n, column="Machine clears")

**Multi-Indexed Thicket Example**

Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears")

In [None]:
plt.figure(figsize=(30, 30))
th.display_histogram(combined_th, node=n, column=("Clang", "Machine clears"))

**Example of Error if Not Passing a Hatchet Node**

In [None]:
#n = "not_hatchet_node"
#th.display_histogram(combined_th, node=n, column=("Clang", "Machine clears"))

### 6.3 Displaying Boxplot

The `display_boxplot` function will display a boxplot for each passed in node(s) and column(s). The passed nodes and columns must be from the performance data table. 

In [None]:
# Getting nodes to pass
n = pd.unique(combined_th.dataframe.reset_index()["node"])[0:2].tolist()

**Single Index Thicket Example**

In [None]:
plt.figure(figsize=(30, 30))
th.display_boxplot(clang_th, nodes=n, columns=["Machine clears", "Frontend latency"])

**Multi-Indexed Thicket Example**

Example will show how to pass a columnar joined thicket object. When passing a columnar joined thicket object, the columns argument will now take a list of tuples. Each tuple will consist of two elements. The first element will always be the column index, and the second element will be an associated column under the column index you passed.<br>

 - Example: (column_index, column_name) -> ("GCC", "Machine clears")

In [None]:
plt.figure(figsize=(30, 30))
th.display_boxplot(combined_th, nodes=n, columns=[("Clang", "time (exc)"), ("Clang", "Machine clears")])

**Example of Error if Not Passing Same Column Index**

In [None]:
#th.display_boxplot(combined_th, nodes=n, columns=[("Clang", "time (exc)"), ("GCC", "Machine clears")])