#  Trend Analysis for the RAJA Performance Suite: CSV Data Visualization

Understanding the performance characteristics of computational kernels is critical for optimizing scientific applications on high-performance computing (HPC) systems. The RAJA Performance Suite (RAJAPerf)[1] provides a collection of key computational kernels relevant to Lawrence Livermore National Laboratory (LLNL). This notebook performs a visual analysis of RAJAPerf kernel performance across different execution configurations on LLNL’s [Lassen supercomputer](https://hpc.llnl.gov/hardware/compute-platforms/lassen).  

The performance data used in this analysis is derived from the `.csv` files generated by `csv_generation.ipynb`, which processes `.cali` files collected from RAJAPerf runs through the [Caliper profiler](https://software.llnl.gov/Caliper/) on both Lassen’s POWER9 CPUs and V100 GPUs. These `.csv` files capture performance metrics for different problem sizes and numbers of MPI ranks.  

This notebook visualizes performance trends across kernels by generating surface plots and other comparative visualizations. By analyzing these plots, users can identify performance scaling trends, pinpoint bottlenecks, and gain insights into how problem sizes and parallel execution strategies affect kernel execution time.


## 1. Import Necessary Packages

In [None]:
# Standard imports
from glob import glob
from warnings import simplefilter
import os

# Third-party imports
import numpy as np
import pandas as pd
import plotly.graph_objects as go

# Ignore pandas performance warnings
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

## 2.Function Definitions

The plot_kernel function generates 3D surface plots to visualize the performance of RAJAPerf kernels, mapping problem size, rank, and runtime. It allows customization of the plot, including the colorscale, contour lines, axis labels, and optional logarithmic scaling for any of the axes. The function retrieves the runtime data for the specified kernel and, if enabled, overlays contour lines to highlight varying levels of runtime. Additional customization options, such as contour size, color, and plot titles, can be provided via keyword arguments. The function outputs an interactive plot using Plotly, making it an effective tool for analyzing kernel performance across different configurations.

In [None]:
# The main plotting function
def plot_kernel(focus_kernel, colorscale='Viridis', enable_contours=False, logscale=["none"], **kwargs):
    global X
    global Y
    
    # Contour Configuration
    contours = None
    contour_size = kwargs.get('contour_size', 40)
    contour_color = kwargs.get('contour_color', "black")

    # Axis naming configuration
    xaxis_title = kwargs.get('xaxis_title', "Problem Size")
    yaxis_title = kwargs.get('yaxis_title', "Ranks")
    zaxis_title = kwargs.get('zaxis_title', "Runtime")

    title = kwargs.get('title', f'Kernel: {focus_kernel} Surface Map')
    
    z = get_z_matrix(focus_kernel)

    custom_colorscale = colorscale

    x = X
    y = Y

    # Apply logscale to if specified
    for i in logscale:
        if i.lower() == "x":
            x = X.copy()
            x = np.log10(x)
        elif i.lower() == "y":
            y = Y.copy()
            y = np.log10(y)
        elif i.lower() == "z":
            z = np.log10(z)
     
    if enable_contours:
        contours={
            "z": {
                "show": True,          
                "usecolormap": False, 
                "color": contour_color,      
                "start": z.min(),     
                "end": z.max(),       
                "size": (z.max() - z.min()) / contour_size, 
                "highlightwidth": 2  
            }
        }

    fig = go.Figure(
        data=[
            go.Surface(
                z=z, x=x, y=y,
                colorscale=custom_colorscale,
                contours=contours
            )
        ]
    )

    fig.update_layout(title=dict(text=title), autosize=False,
                      width=800, height=800,
                      margin=dict(l=65, r=50, b=65, t=90),
                      scene=dict(
                          xaxis_title=xaxis_title,
                          yaxis_title=yaxis_title,
                          zaxis_title=zaxis_title
                      )
                     )
    fig.show()

In [None]:
# Constructs the z matrix using runtime data
def get_z_matrix(focus_kernel):
    return np.array(kernel_agg_df[(focus_kernel, "Avg time/rank")]).reshape(X.shape)    

In [None]:
# Allows the plotting of a list of kernels
def plot_kernels(focus_kernels, **kwargs):
    for kernel in focus_kernels:
        plot_kernel(kernel, **kwargs)

## 3. Reading and Processing All CSV Files for Kernel Analysis

In this section, we load and process the relevant CSV files for kernel analysis. The type of execution—either CPU or GPU—is determined by the `TYPE_OF_RUN` variable. Each CSV file is read into a Pandas DataFrame, and all DataFrames are then concatenated column-wise into a single aggregated DataFrame for comprehensive analysis.

In [None]:
TYPE_OF_RUN = "CPU"

In [None]:
if TYPE_OF_RUN == "CPU":
    DATA_ROOT = "csv_files/raw/lassen/cpu/"
elif TYPE_OF_RUN == "GPU":
    DATA_ROOT = "csv_files/raw/lassen/gpu/"

In [None]:
# Read in the csv files
csv_files = glob(DATA_ROOT + "*")
kernel_names = [os.path.basename(i)[7:-4] for i in csv_files]

# Create a dataframe for each kernel and concatonate into a single multiindex dataframe
kernel_dataframes = {}
for idx, file in enumerate(csv_files):
    kernel_df = pd.read_csv(file)
    kernel_dataframes[kernel_names[idx]] = kernel_df

kernel_agg_df = pd.concat(kernel_dataframes, axis=1, names=["kernel_name"])  

## 4. Defining the the X and Y meshgrids.

To construct the meshgrid needed for surface plots, we use the `Basic_INIT3` kernel to extract the range of problem sizes and MPI ranks. Since these values are consistent across all kernels, Basic_INIT3 serves as a representative reference. This information is then used to generate the X and Y meshgrids for visualization.

In [None]:
problem_sizes = pd.unique(kernel_agg_df[("Basic_INIT3", "problem_sizes")])
n_ranks = pd.unique(kernel_agg_df[("Basic_INIT3", "ranks")])

In [None]:
X, Y = np.meshgrid(problem_sizes, n_ranks)

## 5. Plotting the kernels based on top down bottlenecks.

We conduct an initial plot analysis of selected RAJAPerf kernels based on their top-down breakdown categories [2], including Memory Bound, Core Bound, Bad Speculation, Retiring, and Mixture (both Memory and Core Bound). This classification helps in understanding the underlying performance characteristics of each kernel. Additionally, we perform a focused analysis on a subset of kernels that exhibited deviant behavior, deviating from the typical performance trends observed across the overall collection of kernels.


### 5.1 Analysis of Memory Bound Kernels

Memory-bound kernels experience performance limitations primarily due to delays in accessing data from memory, which causes execution units to remain idle while waiting for data to arrive. These stalls typically result from cache misses at various levels, including L1, L2, L3, or external memory.

In [None]:
memory_bound_kernels = [
    "Stream_TRIAD",
    "Stream_DOT",
    "Basic_INIT3",
    "Algorithm_MEMCPY",
    "Lcals_FIRST_SUM",
]

In [None]:
plot_kernels(memory_bound_kernels)

### 5.2 Analysis of Core Bound Kernels

Core-bound kernels are constrained by execution resources within the CPU core, such as arithmetic logic units (ALUs) or vector processing units, leading to stalls when execution units are unable to keep up with the demand. These bottlenecks can be caused by inefficient instruction scheduling, dependencies between instructions, or underutilization of available execution ports.

In [None]:
core_bound_kernels = [
    "Basic_MAT_MAT_SHARED",
    "Algorithm_REDUCE_SUM",
    "Basic_TRAP_INT",
]

In [None]:
plot_kernels(core_bound_kernels)

### 5.3 Analysis of Bad Speculation Bound Kernels

Bad speculation-bound kernels suffer performance degradation due to incorrect speculative execution, where the processor executes instructions along a mispredicted path and later discards them. This category includes penalties from branch mispredictions and machine clears caused by incorrect memory ordering speculation.

In [None]:
bad_speculation_bound_kernels = [
    "Lcals_FIRST_MIN",
    "Algorithm_SORT", 
    "Algorithm_SORTPAIRS",
    "Basic_INDEXLIST_3LOOP",
]

In [None]:
plot_kernels(bad_speculation_bound_kernels)

### 5.4 Analysis of Retiring Bound Kernels

Retiring-bound kernels efficiently utilize available execution resources, with a high proportion of issued micro-operations successfully completing and retiring. Ideally, maximizing the retiring fraction leads to higher instruction-per-cycle (IPC) performance, but further optimizations such as vectorization or reducing microcode assists may still improve efficiency.

In [None]:
retiring_bound_kernels = [
    "Apps_LTIMES",
    "Apps_MASS3DEA",
    "Polybench_HEAT_3D",
]

In [None]:
plot_kernels(retiring_bound_kernels, runtime_logscale=True)

### 5.5 Analysis of both Memory and Core Bound

Kernels that are both memory and core bound exhibit performance limitations from both memory stalls and inefficient execution unit utilization, indicating that improvements are needed in both data access patterns and instruction scheduling. These workloads may require optimizations in cache usage, data locality, and better exploitation of parallelism in computation to alleviate bottlenecks in both areas.

In [None]:
mixture_kernels = [
    "Basic_MULTI_REDUCE",
    "Apps_FIR",
    "Basic_REDUCE_STRUCT",
    "Basic_TRAP_INT",
    "Apps_VOL3D",
    "Basic_PI_ATOMIC",
    "Polybench_ADI",
    "Stream_DOT",
    "Apps_DEL_DOT_VEC_2D"
    
]

In [None]:
plot_kernels(mixture_kernels)

### 5.6 Analysis of Deviant Kernels

The following kernels exhibited deviant surface behavior, distinguishing them from the majority of RAJAPerf kernels. They encompass a diverse range of top-down bottlenecks and may provide deeper insights into specific performance patterns, particularly in areas such as memory access and cache behavior.

In [None]:
deviant_kernels = [
    "Comm_HALO_PACKING",
    "Comm_HALO_EXCHANGE",
    "Basic_IF_QUAD",
    "Basic_NESTED_INIT",
    "Basic_INIT_VIEW1D",
    "Basic_INIT_VIEW1D_OFFSET",
    "Lcals_DIFF_PREDICT",
    "Polybench_ATAX",
    "Polybench_MVT",
]

In [None]:
plot_kernels(deviant_kernels)

## 6. References

[1] O. Pearce, J. Burmark, R. Hornung, Befikir Bogale, I. Lumsden, M. McKinsey, D. Yokelson, D. Boehme, S. Brink, M. Taufer, and T. Scogland, “Raja performance suite: Performance portability analysis with caliper and thicket,” in Proceedings of SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA: IEEE Computer Society, Nov. 2024.

[2] A. Yasin, “A top-down method for performance analysis and counters architecture,” in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, Monterey, CA, USA: IEEE Computer Society, Mar. 2014.