# I/O Patterns and Bottlenecks in Deep Learning Workloads

**Author:** Pablo Alessandro Santos Hugen  
**Institution:** Institute of Informatics -- UFRGS  
**Course:** Computer Systems Performance Analysis 2025/2

---

- Environment setup and configuration
- Loading the experimental design
- Running DLIO benchmarks
- Collecting and analyzing results

**Prerequisites:**
- Allocate an interactive node: `salloc --partition=<partition> --nodes=1 --ntasks=8 --time=4:00:00`
- Launch Jupyter from the allocated node

## 1. Introduction

### 1.1 Context

Recent years have seen growing interest in optimizations for Machine Learning and Deep Learning training and inference methods. These techniques are now used across various fields, including Large Language Models (LLMs), image recognition and classification, and many other applications.

Large models often require substantial HPC infrastructures to process the enormous amounts of training data involved. In this context, **the performance of the storage and I/O subsystem is critical**.

#### Traditional HPC vs. ML Workloads

| Aspect | Traditional HPC | ML Workloads |
|--------|-----------------|---------------|
| Access Pattern | Large, sequential reads/writes | Small, random reads across numerous files |
| Typical Use Case | Simulations with periodic checkpoints | Iterative training over dataset epochs |
| I/O Characteristics | Predictable, burst-oriented | Continuous, irregular access patterns |

### 1.2 The I/O Bottleneck Problem

At large-scale distributed DL workloads:
- **I/O can take roughly 85% of the training time** (Mohan et al., 2021)
- Training is often one of the most expensive parts of the ML pipeline (Chowdhury et al., 2023)

## 2. Objectives

### 2.1 General Objective

Understand **patterns in I/O operations and possible bottlenecks** in common Machine Learning workloads.

### 2.2 Specific Objectives

1. **Disk Throughput:** Understand how disk throughput varies during training between epochs, checkpoints, and when the number of training processes varies.

2. **GPU Usage:** Analyze how GPU usage (%) behaves in those scenarios.

## 3. Environment Setup

### 3.1 Configuration

Configure the environment variables for your cluster below.

In [None]:
import os
import subprocess
import json
import glob
import shutil
import tempfile
from pathlib import Path
from datetime import datetime
from IPython.display import display, HTML, clear_output

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

In [None]:
MODULES = "arch_gpu_sc/current openmpi/4.1.6.15.1"
BASE_DIR = Path("..").resolve()
CONFIG_DIR = BASE_DIR / "config"
RESULTS_DIR = BASE_DIR / "results"
EXPERIMENT_FILE = Path("experimental_design.csv")
SCRATCH_DIR = BASE_DIR / f"dlio_data_{os.getpid()}"

print(f"Modules: {MODULES}")
print(f"Base directory: {BASE_DIR}")
print(f"Config directory: {CONFIG_DIR}")
print(f"Results directory: {RESULTS_DIR}")
print(f"Scratch directory: {SCRATCH_DIR}")

### 3.2 Helper Functions

Functions for running shell commands and managing the benchmark environment.

In [None]:
def run_command(cmd: str, cwd: Path = None, verbose: bool = True, load_modules: bool = True) -> tuple[int, str, str]:
    if load_modules and MODULES:
        cmd = f"module load {MODULES} && {cmd}"
    
    if verbose:
        print(f"$ {cmd}")
        print("-" * 60)
    
    process = subprocess.Popen(
        cmd,
        shell=True,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        cwd=cwd,
        text=True
    )
    
    output_lines = []
    for line in process.stdout:
        output_lines.append(line)
        if verbose:
            print(line, end="")
    
    process.wait()
    stdout = "".join(output_lines)
    
    if verbose:
        print("-" * 60)
        print(f"Return code: {process.returncode}")
    
    return process.returncode, stdout, ""

In [None]:
def setup_environment():
    print("=" * 60)
    print("ENVIRONMENT SETUP")
    print("=" * 60)
    
    venv_dir = BASE_DIR / ".venv"
    venv_python = venv_dir / "bin" / "python"
    
    if not venv_dir.exists():
        print("\n[1/4] Creating virtual environment...")
        ret, _, _ = run_command(f"uv venv --python $(which python3)", cwd=BASE_DIR, load_modules=False)
        if ret != 0:
            print("  ERROR: Failed to create venv")
            return False
        
        print("\n[2/4] Installing dlio-benchmark from submodule...")
        ret, _, _ = run_command(f"uv pip install --python {venv_python} ./dlio_benchmark/", cwd=BASE_DIR, load_modules=False)
        if ret != 0:
            print("  ERROR: Failed to install dlio-benchmark")
            return False
        
        print("\n[3/4] Installing analysis dependencies...")
        ret, _, _ = run_command(f"uv pip install --python {venv_python} jupyter pandas matplotlib seaborn", cwd=BASE_DIR, load_modules=False)
        if ret != 0:
            print("  WARNING: Some dependencies may have failed")
    else:
        print("\n[1/4] Virtual environment already exists, skipping creation...")
        print("[2/4] Skipping dlio-benchmark install...")
        print("[3/4] Skipping dependencies install...")
    
    SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
    RESULTS_DIR.mkdir(parents=True, exist_ok=True)
    
    print("\n" + "=" * 60)
    print("Environment setup complete!")
    print("=" * 60)
    return True


def cleanup_scratch():
    if SCRATCH_DIR.exists():
        shutil.rmtree(SCRATCH_DIR)
        print(f"Cleaned up: {SCRATCH_DIR}")

### 3.3 Initialize Environment

Run this cell to setup the environment. This will:
- Load required modules
- Create necessary directories
- Sync uv dependencies

In [None]:
setup_environment()

### 3.4 System Information

Dynamically collect system specifications from the current node.

In [None]:
import socket
import platform
import re

def get_system_info() -> dict:
    info = {
        "hostname": socket.gethostname(),
        "platform": platform.platform(),
        "processor": platform.processor(),
    }
    
    try:
        with open("/proc/cpuinfo", "r") as f:
            cpuinfo = f.read()
        
        physical_ids = set(re.findall(r"physical id\s*:\s*(\d+)", cpuinfo))
        cores_per_socket = len(set(re.findall(r"core id\s*:\s*(\d+)", cpuinfo)))
        total_cores = len(re.findall(r"^processor\s*:", cpuinfo, re.MULTILINE))
        model_match = re.search(r"model name\s*:\s*(.+)", cpuinfo)
        if model_match:
            info["cpu_model"] = model_match.group(1).strip()
        else:
            cpu_part = re.search(r"CPU part\s*:\s*(.+)", cpuinfo)
            info["cpu_model"] = f"ARM {cpu_part.group(1).strip()}" if cpu_part else "Unknown"
        
        info["cpu_sockets"] = len(physical_ids) if physical_ids else 1
        info["cpu_cores_total"] = total_cores
    except Exception as e:
        info["cpu_error"] = str(e)
    
    try:
        with open("/proc/meminfo", "r") as f:
            meminfo = f.read()
        
        mem_match = re.search(r"MemTotal:\s*(\d+)\s*kB", meminfo)
        if mem_match:
            mem_kb = int(mem_match.group(1))
            info["memory_gb"] = round(mem_kb / 1024 / 1024, 1)
    except Exception as e:
        info["memory_error"] = str(e)
    
    try:
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=name,memory.total,count", "--format=csv,noheader,nounits"],
            capture_output=True, text=True, timeout=10
        )
        if result.returncode == 0:
            gpu_lines = result.stdout.strip().split("\n")
            gpu_count = len(gpu_lines)
            if gpu_lines and gpu_lines[0]:
                parts = gpu_lines[0].split(", ")
                info["gpu_model"] = parts[0].strip()
                info["gpu_memory_mb"] = int(parts[1].strip()) if len(parts) > 1 else 0
                info["gpu_count"] = gpu_count
    except Exception as e:
        info["gpu_info"] = "Not available"
    
    try:
        result = subprocess.run(
            ["df", "-h", "--output=size,avail,target", "/"],
            capture_output=True, text=True, timeout=10
        )
        if result.returncode == 0:
            lines = result.stdout.strip().split("\n")
            if len(lines) > 1:
                parts = lines[1].split()
                info["storage_total"] = parts[0]
                info["storage_available"] = parts[1]
    except Exception as e:
        info["storage_error"] = str(e)
    
    return info


def display_system_info(info: dict):
    print("=" * 60)
    print("SYSTEM INFORMATION")
    print("=" * 60)
    print(f"Hostname: {info.get('hostname', 'Unknown')}")
    print(f"Platform: {info.get('platform', 'Unknown')}")
    print()
    print("CPU:")
    print(f"  Model: {info.get('cpu_model', 'Unknown')}")
    print(f"  Sockets: {info.get('cpu_sockets', 'Unknown')}")
    print(f"  Total Cores: {info.get('cpu_cores_total', 'Unknown')}")
    print()
    print("Memory:")
    print(f"  Total: {info.get('memory_gb', 'Unknown')} GiB")
    print()
    if "gpu_model" in info:
        print("GPU:")
        print(f"  Model: {info.get('gpu_model', 'Unknown')}")
        print(f"  Count: {info.get('gpu_count', 'Unknown')}")
        print(f"  Memory per GPU: {info.get('gpu_memory_mb', 0) / 1024:.0f} GB")
    else:
        print("GPU: Not available")
    print()
    print("Storage:")
    print(f"  Total: {info.get('storage_total', 'Unknown')}")
    print(f"  Available: {info.get('storage_available', 'Unknown')}")
    print("=" * 60)
    
    return info

system_info = get_system_info()
display_system_info(system_info)

In [None]:
def system_info_table(info: dict) -> pd.DataFrame:
    gpu_spec = "Not available"
    if "gpu_model" in info:
        gpu_mem_gb = info.get('gpu_memory_mb', 0) / 1024
        gpu_spec = f"{info.get('gpu_count', 1)}x {info.get('gpu_model', 'Unknown')} ({gpu_mem_gb:.0f}GB each)"
    
    data = [
        ("CPU", f"{info.get('cpu_sockets', 1)}x {info.get('cpu_model', 'Unknown')} ({info.get('cpu_cores_total', 'Unknown')} cores total)"),
        ("Memory (RAM)", f"{info.get('memory_gb', 'Unknown')} GiB"),
        ("GPU", gpu_spec),
        ("Storage", f"{info.get('storage_total', 'Unknown')} (Available: {info.get('storage_available', 'Unknown')})"),
    ]
    
    return pd.DataFrame(data, columns=["Component", "Specification"])

system_table = system_info_table(system_info)
display(system_table.style.hide(axis='index').set_properties(**{'text-align': 'left'}))

## 4. Experimental Design

### 4.1 Load Experiment Configuration

The experimental design defines all benchmark runs to execute. The `run` column indicates completion status:
- `N` = Not yet run (pending)
- `Y` = Completed

In [None]:
def load_experiment_design() -> pd.DataFrame:
    df = pd.read_csv(EXPERIMENT_FILE, comment='#')
    if 'overrides' in df.columns:
        df['overrides'] = df['overrides'].fillna('')
    return df


def save_experiment_design(df: pd.DataFrame):
    df.to_csv(EXPERIMENT_FILE, index=False)
    print(f"Saved experiment design to {EXPERIMENT_FILE}")


def get_pending_experiments(df: pd.DataFrame) -> pd.DataFrame:
    return df[df['run'] == 'N'].copy()


def get_experiment_id(row: pd.Series) -> str:
    overrides = row.get('overrides', '')
    if overrides:
        param = overrides.split('=')[0].split('.')[-1]
        value = overrides.split('=')[1] if '=' in overrides else ''
        return f"{row['processes']}_{param}_{value}"
    return str(row['processes'])


def mark_experiment_complete(df: pd.DataFrame, model: str, processes: int, overrides: str = '') -> pd.DataFrame:
    mask = (df['model'] == model) & (df['processes'] == processes)
    if 'overrides' in df.columns:
        mask = mask & (df['overrides'] == overrides)
    df.loc[mask, 'run'] = 'Y'
    return df

experiment_df = load_experiment_design()

print("=" * 60)
print("EXPERIMENTAL DESIGN")
print("=" * 60)
print(f"Total experiments: {len(experiment_df)}")
print(f"Completed: {(experiment_df['run'] == 'Y').sum()}")
print(f"Pending: {(experiment_df['run'] == 'N').sum()}")
print("\nExperiment matrix:")
experiment_df

In [None]:
def plot_experiment_status(df: pd.DataFrame):
    df = df.copy()
    df['category'] = df['overrides'].apply(
        lambda x: x.split('.')[-1].split('=')[0] if x else 'baseline'
    )
    
    summary = df.groupby(['category', 'run']).size().unstack(fill_value=0)
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    status_counts = df['run'].value_counts()
    colors = ['#ccffcc' if s == 'Y' else '#ffcccc' for s in status_counts.index]
    axes[0].pie(status_counts, labels=['Completed' if s == 'Y' else 'Pending' for s in status_counts.index],
                colors=colors, autopct='%1.1f%%', startangle=90)
    axes[0].set_title('Overall Experiment Status', fontsize=14, fontweight='bold')
    if 'Y' not in summary.columns:
        summary['Y'] = 0
    if 'N' not in summary.columns:
        summary['N'] = 0
    
    x = range(len(summary))
    width = 0.35
    axes[1].bar([i - width/2 for i in x], summary['N'], width, label='Pending', color='#ffcccc')
    axes[1].bar([i + width/2 for i in x], summary['Y'], width, label='Completed', color='#ccffcc')
    axes[1].set_xticks(x)
    axes[1].set_xticklabels(summary.index, rotation=45, ha='right')
    axes[1].set_ylabel('Number of Experiments')
    axes[1].set_title('Experiments by Parameter Category', fontsize=14, fontweight='bold')
    axes[1].legend()
    
    plt.tight_layout()
    plt.show()


plot_experiment_status(experiment_df)

## 5. Benchmark Execution

### 5.1 Benchmark Runner

Functions to execute DLIO benchmarks based on the experimental design.

In [None]:
def generate_data(model: str, num_procs: int = 8, overrides: str = '') -> bool:
    print(f"\n{'='*60}")
    print(f"GENERATING DATA: {model}")
    if overrides:
        print(f"Overrides: {overrides}")
    print(f"{'='*60}")
    
    cmd = f"""
    mpirun -np {num_procs} \
        uv run --project {BASE_DIR} dlio_benchmark \
        --config-dir {CONFIG_DIR} \
        workload={model} \
        ++workload.workflow.generate_data=True \
        ++workload.workflow.train=False \
        ++workload.workflow.evaluation=False \
        {overrides}
    """.strip()
    
    ret, _, _ = run_command(cmd, cwd=SCRATCH_DIR)
    return ret == 0


def run_benchmark(model: str, num_procs: int, overrides: str = '', experiment_id: str = None) -> bool:
    print(f"\n{'='*60}")
    print(f"RUNNING BENCHMARK: {model} with {num_procs} processes")
    if overrides:
        print(f"Overrides: {overrides}")
    print(f"{'='*60}")
    
    cmd = f"""
    mpirun -np {num_procs} \
        uv run --project {BASE_DIR} dlio_benchmark \
        --config-dir {CONFIG_DIR} \
        workload={model} \
        ++workload.workflow.generate_data=False \
        ++workload.workflow.train=True \
        ++workload.workflow.evaluation=True \
        {overrides}
    """.strip()
    
    ret, _, _ = run_command(cmd, cwd=SCRATCH_DIR)
    
    if ret == 0:
        result_src = SCRATCH_DIR / "hydra_log" / model
        folder_name = experiment_id if experiment_id else str(num_procs)
        result_dst = RESULTS_DIR / model / folder_name
        
        if result_src.exists():
            if result_dst.exists():
                shutil.rmtree(result_dst)
            shutil.copytree(result_src, result_dst)
            print(f"\nResults copied to: {result_dst}")
            shutil.rmtree(result_src)
    
    return ret == 0

### 5.2 Run Single Experiment

Use this cell to run a single experiment. Modify the parameters as needed.

In [None]:

RUN_MODEL = "unet3d_h100_custom"
RUN_PROCS = 4
RUN_OVERRIDES = ""
GENERATE_DATA_FIRST = True

exp_row = experiment_df[
    (experiment_df['model'] == RUN_MODEL) & 
    (experiment_df['processes'] == RUN_PROCS) &
    (experiment_df['overrides'] == RUN_OVERRIDES)
]

if not exp_row.empty and exp_row.iloc[0]['run'] == 'Y':
    print(f"Experiment {RUN_MODEL} with {RUN_PROCS} procs already completed.")
    print("Set run='N' in the CSV to re-run, or modify parameters above.")
else:
    print(f"Will run: {RUN_MODEL} with {RUN_PROCS} processes")
    if RUN_OVERRIDES:
        print(f"Overrides: {RUN_OVERRIDES}")
    print(f"Generate data first: {GENERATE_DATA_FIRST}")
    print("\nExecute the next cell to start the benchmark.")

In [None]:
experiment_id = get_experiment_id(pd.Series({
    'processes': RUN_PROCS,
    'overrides': RUN_OVERRIDES
}))

if GENERATE_DATA_FIRST:
    if not generate_data(RUN_MODEL, RUN_PROCS, RUN_OVERRIDES):
        raise RuntimeError(f"Data generation failed for {RUN_MODEL}")

if run_benchmark(RUN_MODEL, RUN_PROCS, RUN_OVERRIDES, experiment_id):
    print(f"\n{'='*60}")
    print("SUCCESS!")
    print(f"{'='*60}")
    
    experiment_df = mark_experiment_complete(experiment_df, RUN_MODEL, RUN_PROCS, RUN_OVERRIDES)
    save_experiment_design(experiment_df)
    
    print(f"\nUpdated experiment status for {RUN_MODEL} ({experiment_id})")
else:
    print(f"\n{'='*60}")
    print("FAILED!")
    print(f"{'='*60}")

### 5.3 Run All Pending Experiments

This will run all experiments marked as `N` (pending) in the experimental design.

In [None]:
def run_all_pending_experiments(df: pd.DataFrame) -> pd.DataFrame:
    pending = get_pending_experiments(df)
    
    if pending.empty:
        print("No pending experiments to run!")
        return df
    
    print(f"Found {len(pending)} pending experiments")
    print("\nPending experiments:")
    display(pending)
    
    models = pending['model'].unique()
    
    for model in models:
        print(f"\n{'#'*60}")
        print(f"# Processing model: {model}")
        print(f"{'#'*60}")
        
        model_pending = pending[pending['model'] == model]
        baseline_pending = model_pending[~model_pending['overrides'].str.contains('format|record_length', na=False)]
        if not baseline_pending.empty:
            max_procs = baseline_pending['processes'].max()
            if not generate_data(model, num_procs=max_procs):
                print(f"ERROR: Data generation failed for {model}")
                continue
        
        for idx, row in model_pending.iterrows():
            procs = row['processes']
            overrides = row.get('overrides', '')
            experiment_id = get_experiment_id(row)
            
            if 'format' in overrides or 'record_length' in overrides:
                if not generate_data(model, num_procs=procs, overrides=overrides):
                    print(f"ERROR: Data generation failed for {model} with {overrides}")
                    continue
            
            if run_benchmark(model, procs, overrides, experiment_id):
                df = mark_experiment_complete(df, model, procs, overrides)
                save_experiment_design(df)
                print(f"Marked {model} ({experiment_id}) as complete")
            else:
                print(f"ERROR: Benchmark failed for {model} ({experiment_id})")
    
    return df


pending = get_pending_experiments(experiment_df)
print(f"Pending experiments: {len(pending)}")
if not pending.empty:
    display(pending)
    print("\nExecute the next cell to run all pending experiments.")

In [None]:
experiment_df = run_all_pending_experiments(experiment_df)

print("\n" + "="*60)
print("ALL EXPERIMENTS COMPLETE")
print("="*60)
plot_experiment_status(experiment_df)

## 6. Results Analysis

### 6.1 Load Benchmark Results

In [None]:
def load_benchmark_results(results_dir: Path) -> pd.DataFrame:
    data = []
    
    for file_path in glob.glob(str(results_dir / "**/summary.json"), recursive=True):
        with open(file_path, "r") as f:
            summary = json.load(f)
        
        path_parts = Path(file_path).parts
        try:
            results_idx = path_parts.index('results')
            model_name = path_parts[results_idx + 1]
            experiment_id = path_parts[results_idx + 2]
        except (ValueError, IndexError):
            model_name = Path(file_path).parent.parent.name
            experiment_id = Path(file_path).parent.name
        
        parts = experiment_id.split('_')
        try:
            processes = int(parts[0])
        except ValueError:
            processes = 0
        
        if len(parts) > 1:
            param_name = parts[1]
            param_value = '_'.join(parts[2:]) if len(parts) > 2 else ''
        else:
            param_name = 'baseline'
            param_value = ''
        
        num_accelerators = summary.get("num_accelerators", 0)
        metrics = summary.get("metric", {})
        
        data.append({
            'model': model_name,
            'experiment_id': experiment_id,
            'processes': processes,
            'parameter': param_name,
            'value': param_value,
            'accelerator_usage': metrics.get("train_au_mean_percentage", 0),
            'accelerator_usage_std': metrics.get("train_au_stdev_percentage", 0),
            'io_throughput': metrics.get("train_io_mean_MB_per_second", 0),
            'io_throughput_std': metrics.get("train_io_stdev_MB_per_second", 0)
        })
    
    df = pd.DataFrame(data)
    if not df.empty:
        before_count = len(df)
        df = df[~((df['accelerator_usage'] == 0) & (df['io_throughput'] == 0))]
        filtered_count = before_count - len(df)
        if filtered_count > 0:
            print(f"Filtered out {filtered_count} failed runs (AU=0 and I/O=0)")
        
        df = df.sort_values(by=["model", "parameter", "processes"]).reset_index(drop=True)
    
    return df

results_df = load_benchmark_results(RESULTS_DIR)
print(f"Loaded {len(results_df)} valid benchmark results")
results_df

### 6.2 Accelerator Usage vs. Number of Processes

In [None]:
def plot_accelerator_usage(df: pd.DataFrame):
    if df.empty:
        print("No data to plot")
        return
    
    parameters = df['parameter'].unique()
    n_params = len(parameters)
    
    fig, axes = plt.subplots(1, min(n_params, 3), figsize=(6 * min(n_params, 3), 6), squeeze=False)
    axes = axes.flatten()
    
    for idx, param in enumerate(parameters[:3]):
        ax = axes[idx]
        param_df = df[df['parameter'] == param]
        
        if param == 'baseline':
            param_df = param_df.sort_values('processes')
            ax.errorbar(
                param_df['processes'], param_df['accelerator_usage'],
                yerr=param_df['accelerator_usage_std'],
                marker='o', markersize=8, linewidth=2, capsize=5
            )
            ax.set_xlabel('Number of Processes')
        else:
            ax.bar(param_df['value'].astype(str), param_df['accelerator_usage'],
                   yerr=param_df['accelerator_usage_std'], capsize=5)
            ax.set_xlabel(param)
            plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
        
        ax.set_ylabel('Accelerator Usage (%)')
        ax.set_title(f'AU: {param}', fontweight='bold')
        ax.grid(True, alpha=0.3)
        ax.set_ylim(0, 100)
    
    plt.tight_layout()
    plt.savefig("accelerator_usage.png", dpi=150, bbox_inches='tight')
    plt.show()
   
    if n_params > 3:
        remaining = parameters[3:]
        fig2, axes2 = plt.subplots(1, len(remaining), figsize=(6 * len(remaining), 6), squeeze=False)
        axes2 = axes2.flatten()
        for idx, param in enumerate(remaining):
            ax = axes2[idx]
            param_df = df[df['parameter'] == param]
            ax.bar(param_df['value'].astype(str), param_df['accelerator_usage'],
                   yerr=param_df['accelerator_usage_std'], capsize=5)
            ax.set_xlabel(param)
            ax.set_ylabel('Accelerator Usage (%)')
            ax.set_title(f'AU: {param}', fontweight='bold')
            ax.grid(True, alpha=0.3)
            ax.set_ylim(0, 100)
            plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
        plt.tight_layout()
        plt.show()


plot_accelerator_usage(results_df)

### 6.3 I/O Throughput vs. Number of Processes

In [None]:
def plot_io_throughput(df: pd.DataFrame):
    if df.empty:
        print("No data to plot")
        return
    
    parameters = df['parameter'].unique()
    n_params = len(parameters)
    
    fig, axes = plt.subplots(1, min(n_params, 3), figsize=(6 * min(n_params, 3), 6), squeeze=False)
    axes = axes.flatten()
    
    for idx, param in enumerate(parameters[:3]):
        ax = axes[idx]
        param_df = df[df['parameter'] == param]
        
        if param == 'baseline':
            param_df = param_df.sort_values('processes')
            ax.errorbar(
                param_df['processes'], param_df['io_throughput'],
                yerr=param_df['io_throughput_std'],
                marker='s', markersize=8, linewidth=2, capsize=5, color='#2ecc71'
            )
            ax.set_xlabel('Number of Processes')
        else:
            ax.bar(param_df['value'].astype(str), param_df['io_throughput'],
                   yerr=param_df['io_throughput_std'], capsize=5, color='#2ecc71')
            ax.set_xlabel(param)
            plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
        
        ax.set_ylabel('I/O Throughput (MB/s)')
        ax.set_title(f'I/O: {param}', fontweight='bold')
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig("io_throughput.png", dpi=150, bbox_inches='tight')
    plt.show()
    
    if n_params > 3:
        remaining = parameters[3:]
        fig2, axes2 = plt.subplots(1, len(remaining), figsize=(6 * len(remaining), 6), squeeze=False)
        axes2 = axes2.flatten()
        for idx, param in enumerate(remaining):
            ax = axes2[idx]
            param_df = df[df['parameter'] == param]
            ax.bar(param_df['value'].astype(str), param_df['io_throughput'],
                   yerr=param_df['io_throughput_std'], capsize=5, color='#2ecc71')
            ax.set_xlabel(param)
            ax.set_ylabel('I/O Throughput (MB/s)')
            ax.set_title(f'I/O: {param}', fontweight='bold')
            ax.grid(True, alpha=0.3)
            plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
        plt.tight_layout()
        plt.show()


plot_io_throughput(results_df)

### 6.4 Summary Table

In [None]:
if not results_df.empty:
    summary = results_df.copy()
    summary['accelerator_usage'] = summary['accelerator_usage'].round(2)
    summary['io_throughput'] = summary['io_throughput'].round(2)
    
    display_cols = ['model', 'parameter', 'value', 'processes', 'accelerator_usage', 'io_throughput']
    summary = summary[display_cols]
    summary.columns = ['Model', 'Parameter', 'Value', 'Procs', 'AU (%)', 'I/O (MB/s)']
    
    display(summary.style.background_gradient(subset=['AU (%)'], cmap='RdYlGn', vmin=0, vmax=100)
                        .background_gradient(subset=['I/O (MB/s)'], cmap='Blues'))
    
    output_csv = RESULTS_DIR / "benchmark_results.csv"
    summary.to_csv(output_csv, index=False)
    print(f"\nResults saved to: {output_csv}")
else:
    print("No results to display")

## 7. Cleanup

Run this cell to clean up the scratch directory when done.

In [None]:
cleanup_scratch()

## 8. Conclusions

### Key Findings

Based on the analysis of the UNet3D benchmark results on the NVIDIA GH200 SuperChip:

#### 1. Process Scaling

| Processes | AU (%) | I/O (MB/s) |
|-----------|--------|------------|
| 1 | 33.67 | 974 |
| 2 | 36.37 | 2,001 |
| 4 | 58.85 | 5,699 |
| 6 | **86.52** | **10,475** |
| 8 | 77.94 | 9,437 |

- **Optimal scaling at 6 processes**, achieving peak AU (86.5%) and I/O throughput (10.5 GB/s)
- Performance degrades at 8 processes, likely due to resource contention or memory bandwidth saturation
- I/O throughput scales nearly linearly from 1-6 processes (~10x improvement)

#### 2. Data Format Impact

| Format | AU (%) | I/O (MB/s) |
|--------|--------|------------|
| NPZ (baseline) | 58.85 | 5,699 |
| PNG | 21.08 | 2,042 |
| HDF5 | **97.53** | **9,447** |

- **HDF5 is the optimal format**, delivering 97.5% AU and 66% higher I/O than NPZ
- PNG performs poorly (21% AU) due to decompression overhead

#### 3. Read Threads

| Threads | AU (%) | I/O (MB/s) |
|---------|--------|------------|
| 1 | 27.51 | 2,664 |
| 8 | **80.87** | **7,833** |
| 16 | 80.50 | 7,797 |

- Increasing read threads from 1→8 improves AU by **3x** (27% → 81%)
- Diminishing returns beyond 8 threads — 16 threads shows no improvement

#### 4. Batch Size Trade-offs

| Batch Size | AU (%) | I/O (MB/s) |
|------------|--------|------------|
| 1 | **98.44** | 1,661 |
| 4 | 62.36 | 3,834 |
| 14 | 62.26 | **7,538** |

- **Small batches (1)**: Maximize AU (98%) but limit I/O throughput — compute-bound
- **Large batches (14)**: Higher I/O throughput but similar AU to medium batches
- Trade-off: Choose batch size based on whether workload is compute-bound or I/O-bound

#### 5. Shuffling Impact

| Shuffle Type | AU (%) | I/O (MB/s) |
|--------------|--------|------------|
| File shuffle off | 57.64 | 5,583 |
| File shuffle random | 57.19 | 5,539 |
| Sample shuffle off | 57.58 | 5,577 |
| Sample shuffle random | 58.00 | 5,618 |

- **Shuffling has negligible impact** on performance (~1% difference)
- Random access patterns do not significantly degrade I/O on this storage system
- Safe to enable shuffling for training quality without performance penalty

#### 6. Record Size

| Record Size | AU (%) | I/O (MB/s) |
|-------------|--------|------------|
| 10 MB | 72.87 | 505 |
| 512 MB | 19.83 | 7,034 |

- **Larger records favor I/O throughput** (14x higher with 512MB vs 10MB records)
- **Smaller records favor AU** (72% vs 20%)
- Storage system performs better with large sequential reads

### Summary

1. **Scale to 6 processes** for optimal resource utilization on GH200
2. **Use HDF5 format** — provides best AU and I/O performance
3. **Configure 8 read threads** per process for parallel I/O
4. **Batch size selection** depends on workload characteristics (compute vs I/O bound)
5. **Shuffling is safe** to use without performance degradation

---

**Repository:** https://github.com/HpcResearchLaboratory/perf_2025