# 02 GPU Fundamentals

In this section, we'll learn more about modern NVIDIA GPUs, how they work, and the important implications this has on realtime performance.

<img src="images/rtx-gpus.png" width="400"> <img src="images/image053.png" width="600">

## 02a - Graphics Pipeline
GPUs are super parallel work distributors and distributing all the work of drawing triangles and pixels on the screen can be very complex. In the old days, each stage of the graphics pipeline was part of a physical pipeline and would run serially.

This is just not very efficient. By moving towards a logical pipeline as opposed to a physical pipeline, we were able to get a unified architecture that is fully parallel, allowing for multiple parts of the GPU (called engines) to be reused. 

<img src="images/image033.png" width="800"> 

At a high level, the GPU consists of a set of Graphics Processing Clusters (GPCs) that are connected to a set of Frame Buffer Partitions (FBP) through an on-chip interconnect network called the Crossbar (XBAR). 

The GPC itself is a nearly complete graphics pipeline that integrates Texture Processing Clusters (TPCs) that contain fixed-function rasterization and primitive processing units.

The TPC integrates a number of Texture Units (TEX) and SIMT processors called Streaming Multiprocessors (SMs). SIMT stands for "Single Instruction, Multiple Threads", which should give you a hint as to how those SMs work.

The SMs themselves contain the ALUs (Arithmetic Logic Units) that execute graphics and computer shader programs. SMs process work using Threads and we call groups of Threads a Warp.

Finally, the FBP contains the memory controller that communicates with the physical memory outside of the GPU as well as the Raster Output (ROP) units and the L2 cache.

<img src="images/image035.png" width="800"> 

To get a triangle render on the screen, you move between the different parts of the logical pipeline.

For instance, at the beginning, API calls provide data to the driver that then makes this accessible to the GPU. These get translated into commands that are sent to the Front End (FE) unit.

<img src="images/image037.png" width="800"> 

The Primitive Distributor (PD) unit starts fetching the data and assembles that into units of work for the GPU to process in efficient batches.

The Primitive Engine (PE) unit issues instructions to start retrieving vertex attributes from memory using the Vertex Attribute Fetch (VAF) unit. This gets fed into the Vertex Shader which is scheduled to execute on an SM.

<img src="images/image039.png" width="800"> 

Let's take a moment to see how this data is represented in Nsight Graphics. Look at the events list and filter for vkCmdDrawIndexed. Clicking on one of these events should show you this:

<img src="images/api-inspector-pipeline.png" width="600"> 

As you can see, we actually show you the state of the pipeline so you can inspect all the values that are being fed to the GPU.

When it's time to rasterize the triangles, the Pre-Raster Operation (PROP) unit receives depth and coverage and determines which pixels need to be shaded based on their visibility. Pixels that pass are then sent to an SM to be shaded.

After the SM finishes processing this work, the Color Raster Operation (CROP) unit writes the final color to the framebuffer.

<img src="images/image041.png" width="800"> 

Throughout this pipeline there are multiple memory units and caches that help to keep data coherent and quickly accessible between local and shared memory.

These caches have different latencies and purposes:
- The Level 1 Data cache (L1TEX) is per-SM and is important for texture and surface memory reads and writes. 
- The Level 2 Data cache (LTS) is connected to the XBAR and writes CROP data out to the final framebuffer output targets.

<img src="images/image043.png" width="800"> 

When it comes to Ray Tracing, there is an additional RT Core that accelerates ray tracing operations but everything is pretty much the same. Modern Real Time Ray Tracing APIs were engineered to interop easily with existing graphics applications, engines and middlewares. Fast triangle intersection testing is a key part of what the makes the RTCore so integral to that goal.

For more information on how the logical pipeline works on a modern NVIDIA GPU, be sure to check out the detailed explanation here: https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline

## 02b - Metrics
Metrics help to relate the state of these hardware units to actual performance. The GPU has different parts that compose the graphics pipeline and these can be observed using metrics. There are thousands of metrics available so we try to curate some specific metrics based around the critical hardware units that have the most liklihood of influencing performance.

<img src="images/metrics1.png" width="400"> 

These categories include:
- TEX
- PE
- L1$
- RTCORE
- FE
- CROP, ZROP

Metrics measure a number of performance critical properties of that pipeline:
- Throughput
- Request/response counts
- Duration/Timing
- Stall reasons and counts
- Input/output counts
- Active and elapsed cycles

Some important metric categories are:
- Front End Pipeline Stalling Commands and Stall Cycles
- Occupancy limiters
- Memory

Let's open up Nsight Graphics again and have a look at the different metrics inside GPU Trace.

## 02c - Quiz

In [4]:
from mcq import create_multipleChoice_widget
Q1 = create_multipleChoice_widget('Which part of the GPU is responsible for executing instructions?', ['Texture Processing Cluster (TPC)', 'Graphics Processing Cluster (GPC)', 'Streaming Multiprocessor (SM)'], 'Streaming Multiprocessor (SM)', "")
Q1

VBox(children=(Output(layout=Layout(width='auto')), RadioButtons(options=(('Texture Processing Cluster (TPC)',…

In [5]:
from mcq import create_multipleChoice_widget
Q1 = create_multipleChoice_widget('The Frontend (FE) is where instructions are scheduled on SMs', ['True', 'False'], 'False', "")
Q1

VBox(children=(Output(layout=Layout(width='auto')), RadioButtons(options=(('True', 0), ('False', 1)), value=0)…

In [6]:
from mcq import create_multipleChoice_widget
Q1 = create_multipleChoice_widget('Modern GPUs used a Physical Pipeline to render geometry', ['True', 'False'], 'False', "")
Q1

VBox(children=(Output(layout=Layout(width='auto')), RadioButtons(options=(('True', 0), ('False', 1)), value=0)…

Next, let's dig into how Real-Time Ray Tracing works so we can better understand the kinds of problems we may face and how Nsight Tools can help you.

[Continue to the **Ray Tracing Basics** section](03_ray_tracing_basics.ipynb)