# GPU Performance Engineering

This tutorial will explore how to evaluate and optimize the performance of GPU-accelerated codes.
The goal is to:
- Understand the expected optimal performance of your code.
- Identify how far the actual performance deviates from the optimal.
- Explore reasons for performance gaps and practical methods to bridge them.

### What This Tutorial Covers


- **GPU architecture fundamentals:** Learn how GPUs are designed and how they execute code.
- **Simplistic performance modeling:** Use basic models to predict performance characteristics.
- **Micro-benchmarks:** Isolate specific hardware effects and understand performance limits.
- **Performance engineering workflow:** Follow a structured process to optimize real-world applications.
- **Tools and techniques:** Use NVIDIA tools such as Nsight Systems and Nsight Compute, and NVTX markers for profiling.

### What This Tutorial Does NOT Cover

- **Algorithm engineering:** While algorithm choice and tuning are critical, they are outside the scope of this tutorial.

## What is Performance

Performance is generally considered as the amount of useful work done per unit time.
This can be captured as work done in a given time interval, or the time required to perform a fixed amount of work.

Different performance *metrics* help evaluate and categorize performance:

### 1. Time-Based Metrics

**Time to Solution (TTS)**: Measures total execution time.

* ✅ Very easy to set up
* ✅ Captures all effects at once
* ❌ Captures all effects at once
* ❌ Comparing different applications (or their parameterization) is challenging
* ❌ Assessment of potential performance improvements is almost impossible

**Iterations per Second (It/s)**: Normalizes execution time across a flexible iteration count.

### 2. Application-Specific Metrics

**Mega Lattice Site Updates per Second (MLUPS)**: Measures updates per grid cell.

* ✅ Allows comparing performance of different problem/ work sizes
* ✅ Can be related to other metrics (see below) more easily
* ❌ Limited insight into optimization potential

### 3. Hardware-Centric Metrics

**Floating Point Operations per Second (FLOPS)**,
**Integer Operations per Second (IOPS)**,
**Instructions per Second (IPS)**, and
**Memory Bandwidth (BW) - Bytes per Second**.

* ✅ Allow estimation of performance limits
* ✅ Enable performance *prediction* for different hardware platforms
* ❌ Requires profiling tools or assumptions about workload (bytes transferred, etc.)
* ❌ Relating profiling results and application can sometimes be challenging

### Alternative Considerations

Performance can also be measured in terms of *work per energy consumed*, which is crucial for power-efficient computing.
And of course arbitrary other metrics are also conceivable, such as investment cost over energy to solution in USD per Joules.

## Next Step

We start by investigating some of this metrics at the example of a straight-forward test case.
Head over to the [Stencil Test Case](./stencil-test-case.ipynb) notebook to get started.