#  Trend Analysis for the RAJA Performance Suite: Data Transformation to CSV

With high-performance computing (HPC) systems becoming increasingly heterogeneous and complex, understanding code performance on these systems becomes increasingly important. Oftentimes, researchers achieve this using benchmark suites that represent key aspects of full applications. One such benchmark suite is the RAJA Performance Suite (RAJAPerf)[1], which consists of computational kernels of interest to Lawrence Livermore National Laboratory (LLNL).

This dataset captures the performance of the different kernels in RAJAPerf when running on single nodes of LLNL's [Lassen supercomputer](https://hpc.llnl.gov/hardware/compute-platforms/lassen). To generate this data, we ran RAJAPerf with varying problem sizes and numbers of MPI ranks. We run the kernels on both Lassen's POWER9 CPUs and V100 GPUs. The performance data produced by RAJAPerf is generated by LLNL's [Caliper profiler](https://software.llnl.gov/Caliper/).

The performance data is in the `.cali` files in the `data_generation` directory. Users can analyze these files directly using LLNL's [Thicket library](https://thicket.readthedocs.io/en/latest/). However, some users may prefer the data in a more conventional format.

To that end, we provide this notebook to convert the RAJAPerf performance data into `.csv` files. Each generated `.csv` file represents the targeted performance metric for a given kernel across all MPI rank problem sizes and numbers. These `.csv` files also serve as input to `plot_analysis.ipynb`, a notebook that will generate surface plots of kernel performance across all problem sizes and numbers of MPI ranks.

## 1. Import Necessary Packages

In [None]:
# Standard  imports
from warnings import simplefilter

# Third-party imports
import pandas as pd
from tqdm import tqdm

# Performance analysis
import thicket as th

# Ignore pandas performance warnings
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

## 2. Helper Functions

In [None]:
def get_csv_data(th_dict, focus_kernel, problem_sizes, n_ranks, metric="Avg time/rank"):
    idx = pd.IndexSlice

    out_data = {
        "ranks" : [],
        "problem_sizes" : [],
        metric : [],
    }

    for rank in n_ranks:
        for problem_size in problem_sizes:
            th_x = th_dict[rank][problem_size]
            node = th_x.get_node(focus_kernel)

            out_data["ranks"].append(rank)
            out_data["problem_sizes"].append(problem_size)
            out_data[metric].append(th_x.dataframe.loc[idx[node, :], metric].iloc[0])


    out_df = pd.DataFrame(out_data)

    return out_df

In [None]:
def kernel_query(kernel_list):
    return th.query.Query().match(
        ".",
        lambda row: row["name"].apply(
            lambda n: n in kernel_list
        ).all()
    ).rel("*")

In [None]:
def prune_thickets(gh_th):
    new_gb = {}
    for k in gb_th.keys():
        if k[0] in lassen_gpu_kernels:
            new_gb[k] = gb_th[k].query(kernel_query(lassen_gpu_kernels[k[0]]), multi_index_mode="all")

    return new_gb

## 3. CSV Generation for CPU Runs

This section generates the CSVs for runs on Lassen's POWER9 CPUs. This data was generated by running RAJAPerf with the `RAJA_Seq` variant and `default` tuning. More information on these settings can be found in Section 5.

### 3.1 Define focus ranks and problem sizes


The `n_ranks_cpu` and `problem_sizes_cpu` variables can be used to subselect `.cali` files based on the number of MPI ranks and problem sizes. The default values of these variables can be used to select all `.cali` files for CPU runs.

In [None]:
n_ranks_cpu = list(range(2, 161, 2))
problem_sizes_cpu = [i * 1024 * 1024 for i in range(1, 41)]

### 3.2 Read Performance Data into Thicket

These cells read the `.cali` files selected in Section 3.1 using Thicket. Users should not change these cells unless they move the `.cali` files.

In [None]:
ROOT_PATH_CPU = "data_generation/lassen_data_cpu/"

In [None]:
th_dict_cpu = {}
th_lst_cpu = []
th_names_cpu = []
for rank in tqdm(n_ranks_cpu):
    thickets_per_rank = {}
    for problem_size in problem_sizes_cpu:
        thickets_per_rank[problem_size] = th.Thicket.from_caliperreader(f"{ROOT_PATH_CPU}/raja-perf_suite_caliper_custom_{rank}_{problem_size}/RAJA_Seq-default.cali", disable_tqdm=True)
        th_lst_cpu.append(thickets_per_rank[problem_size])
        th_names_cpu.append(f"n_{rank}_p_{problem_size}")
    th_dict_cpu[rank] = thickets_per_rank 

### 3.3 Converting datapoints to csv

These cells convert data contained in the `Thicket` objects from Section 3.2 into `.csv` files. Users can modify the `cpu_csv_out_dir` variable to configure where the `.csv` files are written.

In [None]:
cpu_csv_out_dir = "csv_files/raw/lassen/cpu/"

In [None]:
excluded_nodes = ["RAJAPerf", "Algorithm", "Apps", "Basic", "Comm", "Lcals", "Polybench", "Stream"]

nodes = []
for node in th_lst_cpu[0].graph.traverse():
    nodes.append(node.frame["name"])

[nodes.remove(i) for i in excluded_nodes]

In [None]:
for focus_kernel in tqdm(nodes, desc="Writing csv files"):
    out_file_name = f"{cpu_csv_out_dir}/lassen-{focus_kernel}.csv"
    per_kernel_data = get_csv_data(th_dict_cpu, focus_kernel, problem_sizes_cpu, n_ranks_cpu)
    per_kernel_data.to_csv(out_file_name, index=None)

## 4. CSV Generation for GPU Runs

This section generates the CSVs for runs on Lassen's V100 GPUs. This data was generated by running RAJAPerf with the `RAJA_CUDA` variant and various tunings. More information on these settings can be found in Section 5.

### 4.1 Selecting Kernels for Extraction from Specific Tunings

Here, we define which kernels are to be extracted from specific tunings. The current configuration is per the recommendation of the RAJA Performance Suite team.

In [None]:
lassen_gpu_kernels = {
    "block_256": [
        "Algorithm_ATOMIC",
        "Algorithm_MEMCPY",
        "Algorithm_MEMSET",
        "Apps_DEL_DOT_VEC_2D",
        "Apps_EDGE3D",
        "Apps_ENERGY",
        "Apps_FIR",
        "Apps_LTIMES",
        "Apps_LTIMES_NOVIEW",
        "Apps_MATVEC_3D_STENCIL",
        "Apps_NODAL_ACCUMULATION_3D",
        "Apps_PRESSURE",
        "Apps_VOL3D",
        "Apps_ZONAL_ACCUMULATION_3D",
        "Basic_ARRAY_OF_PTRS",
        "Basic_COPY8",
        "Basic_DAXPY",
        "Basic_DAXPY_ATOMIC",
        "Basic_IF_QUAD",
        "Basic_INDEXLIST",
        "Basic_INDEXLIST_3LOOP",
        "Basic_INIT3",
        "Basic_INIT_VIEW1D",
        "Basic_INIT_VIEW1D_OFFSET",
        "Basic_MAT_MAT_SHARED",
        "Basic_MULADDSUB",
        "Basic_NESTED_INIT",
        "Basic_PI_ATOMIC",
        "Comm_HALO_EXCHANGE",
        "Comm_HALO_PACKING",
        "Comm_HALO_SENDRECV",
        "Lcals_DIFF_PREDICT",
        "Lcals_EOS",
        "Lcals_FIRST_DIFF",
        "Lcals_FIRST_SUM",
        "Lcals_GEN_LIN_RECUR",
        "Lcals_HYDRO_1D",
        "Lcals_HYDRO_2D",
        "Lcals_INT_PREDICT",
        "Lcals_PLANCKIAN",
        "Lcals_TRIDIAG_ELIM",
        "Polybench_2MM",
        "Polybench_3MM",
        "Polybench_ADI",
        "Polybench_ATAX",
        "Polybench_FDTD_2D",
        "Polybench_FLOYD_WARSHALL",
        "Polybench_GEMM",
        "Polybench_GEMVER",
        "Polybench_GESUMMV",
        "Polybench_HEAT_3D",
        "Polybench_JACOBI_1D",
        "Polybench_JACOBI_2D",
        "Polybench_MVT",
        "Stream_ADD",
        "Stream_COPY",
        "Stream_MUL",
        "Stream_TRIAD",
    ], # For block_256
    "default": [
        "Algorithm_SORT",
        "Algorithm_SORTPAIRS",
    ], # For default
    "blkatm_occgs_256": [
        "Algorithm_REDUCE_SUM",
        "Basic_PI_REDUCE",
        "Basic_REDUCE3_INT",
        "Basic_REDUCE_STRUCT",
        "Basic_TRAP_INT",
        "Stream_DOT",
    ], # For blkatm_occgs_256
    "block_64": [
        "Apps_CONVECTION3DPA",
        "Apps_DIFFUSION3DPA",
        "Apps_MASS3DEA",
    ], # For block_64
    "block_25": [
        "Apps_MASS3DPA",
    ], # For block_25
    "funcptr_256": [
        "Comm_HALO_EXCHANGE_FUSED",
        "Comm_HALO_PACKING_FUSED",
    ], # For funcptr_256
    "atomic_occgs_256": [
        "Algorithm_HISTOGRAM",
        "Basic_MULTI_REDUCE",
    ], # For atomic_occgs_256
    "blkdev_occgs_256": [
        "Lcals_FIRST_MIN",
    ], # For blkdev_occgs_256
    "cub": [
        "Algorithm_SCAN",
    ], # For cub
}

### 4.2 Define focus ranks and problem sizes

The `n_ranks_gpu` and `problem_sizes_gpu` variables can be used to subselect `.cali` files based on the number of MPI ranks and problem sizes. The default values of these variables can be used to select all `.cali` files for CPU runs.

In [None]:
n_ranks_gpu = list(range(1, 5))
problem_sizes_gpu = list(range(1*1024*1024, 40*1024*1024, 131072))

### 4.3 Read Performance Data into Thicket

These cells read the `.cali` files selected in Section 4.3 using Thicket. Users should not change these cells unless they move the `.cali` files.

In [None]:
ROOT_PATH_GPU = "data_generation/lassen_data_gpu/"

In [None]:
tuning_names = list(lassen_gpu_kernels.keys())

cali_files = [f"x/RAJA_CUDA-{i}.cali" for i in tuning_names]

In [None]:
th_dict_gpu = {}
th_lst_gpu = []
th_names_gpu = []

gb_params = ["tuning"]

for rank in tqdm(n_ranks_gpu):
    thickets_per_rank = {}
    for problem_size in problem_sizes_gpu:
        problem_dir = f"{ROOT_PATH_GPU}/raja-perf_suite_caliper_custom_{rank}_{problem_size}/"

        cali_files = [f"{problem_dir}/RAJA_CUDA-{i}.cali" for i in tuning_names]
        problem_th = th.Thicket.from_caliperreader(cali_files, fill_perfdata=False, disable_tqdm=True)

        
        gb_th = problem_th.groupby(gb_params)

        new_gb = prune_thickets(gb_th)

        thickets_per_rank[problem_size] = th.Thicket.concat_thickets(list(new_gb.values()), fill_perfdata=False)
        th_lst_gpu.append(thickets_per_rank[problem_size])
        th_names_gpu.append(f"n_{rank}_p_{problem_size}")
    th_dict_gpu[rank] = thickets_per_rank 

### 4.4 Converting datapoints to csv

These cells convert data contained in the `Thicket` objects from Section 4.4 into `.csv` files. Users can modify the `gpu_csv_out_dir` variable to configure where the `.csv` files are written.

In [None]:
gpu_csv_out_dir = "csv_files/raw/lassen/gpu/"

In [None]:
nodes = []
for node in th_lst_gpu[0].graph.traverse():
    nodes.append(node.frame["name"])

In [None]:
for focus_kernel in tqdm(nodes):
    out_file_name = f"{gpu_csv_out_dir}/lassen-{focus_kernel}.csv"
    per_kernel_data = get_csv_data(th_dict_gpu, focus_kernel, problem_sizes_gpu, n_ranks_gpu)
    per_kernel_data.to_csv(out_file_name, index=None)

## 5. RAJAPerf Configuration for the Dataset

The table below shows the configurations of RAJAPerf used to generate this dataset. More details about these settings can be found in the [RAJAPerf documentation](https://rajaperf.readthedocs.io/en/develop/).

<table>
    <tr>
        <th>Processor</th>
        <th>Variant</th>
        <th>Tunings</th>
        <th>Problem Sizes (as a "range" tuple)</th>
        <th># MPI Ranks (as a "range" tuple)</th>
    </tr>
    <tr>
        <td>POWER9</td>
        <td>RAJA_Seq</td>
        <td>default</td>
        <td>(1M, 41M, 1M)</td>
        <td>(2, 161, 2)</td>
    </tr>
    <tr>
        <td>V100</td>
        <td>RAJA_CUDA</td>
        <td>block_256, blkatm_occgs_256, block_64, block_25, funcptr_256, cub, default, atomic_occgs_256, blkdev_occgs_256</td>
        <td>(1M, 41M, 128K)</td>
        <td>(1, 5, 1)</td>
    </tr>
</table>

## 6. References

[1] O. Pearce, J. Burmark, R. Hornung, Befikir Bogale, I. Lumsden, M. McKinsey, D. Yokelson, D. Boehme, S. Brink, M. Taufer, and T. Scogland, “Raja performance suite: Performance portability analysis with caliper and thicket,” in Proceedings of SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA: IEEE Computer Society, Nov. 2024.