# Using Groupby-Aggregate to Compose Multi-Run Datasets: Thicket Tutorial

Thicket is a python-based toolkit for Exploratory Data Analysis (EDA) of parallel performance data that enables performance optimization and understanding of applications’ performance on supercomputers. It bridges the performance tool gap between being able to consider only a single instance of a simulation run (e.g., single platform, single measurement tool, or single scale) and finding actionable insights in multi-dimensional, multi-scale, multi-architecture, and multi-tool performance datasets.

**NOTE: An interactive version of this notebook is available in the Binder environment.**

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/llnl/thicket-tutorial/develop)

***

## 1. Import Necessary Packages

In [None]:
from glob import glob
import numpy as np
from IPython.display import display
from IPython.display import HTML

import thicket as th

display(HTML("<style>.container { width:80% !important; }</style>"))

In [None]:
# Disable the Pandas 3 Future Warnings for now
import warnings
warnings.filterwarnings("ignore", category=FutureWarning) 

## 2. Define Dataset Paths and Names

In this example, we load two repeat runs generated on lassen. We can use glob to find all of the caliper files in a given directory.

In [None]:
data = glob("../data/lassen/clang10.0.1_nvcc10.2.89_1048576/**/*.cali", recursive=True)
tk = th.Thicket.from_caliperreader(data)

## 3. Groupby

Groupby the unique combinations of `variant` and `tuning` from the metadata table. In general, these will be the parameters you varied in your runs.

After performing the groupby, we can see that each thicket contains multiple profiles. In order to perform certain composition operations in Thicket, we need to aggregate the performance data (`Thicket.dataframe`).

In [None]:
gb = tk.groupby(["variant", "tuning"])

In [None]:
for key, ttk in gb.items():
    print(f"key {key} contains {len(ttk.profile)} profiles")

## 4. Aggregation

Using the `aggregate_thicket` function, we can aggregate each Thicket in the groupby object individually.

In [None]:
gb_agg = {}
for key, ttk in gb.items():
    gb_agg[key] = gb.aggregate_thicket(ttk, np.mean)

display(gb_agg[('Base_CUDA', 'block_128')].dataframe)

We can call `agg` to aggregate and create a composed dataframe in one step

In [None]:
tk_agg = gb.agg(np.mean)

display(tk_agg.dataframe)