# 04 Common Problems

In this section, we'll talk about some of the common problems you may experience when developing a realtime ray tracing application and how Nsight Graphics can help you solve these problems.

## 04a - Profiling Basics

Nsight Graphics contains a number of great profiling features but there are a few basics aspects that you should consider no matter which feature you use. Let's go over some of these.

Before you begin:
- Don't run anything in the background: you need a 'clean room' environment
- Close Chrome/Spotify/Youtube/Netflix etc...
- Disable any background application that may be contending with GPU resources
- Figure out if you are GPU or CPU limited
    - Start with Nsight Systems, move to Nsight Graphics if you determine that GPU is not fully utilized
- Use 'remote target' functionality when possible
- Use C++ Captures to maintain reproducible results
- Make sure your GPU clock frequency is locked
- Plug laptop in to avoid battery throttling

Choosing between Nsight Systems and Nsight Graphics:
- Nsight Systems helps identify performance issues such as:
    - GPU starvation
    - Unnecessary CPU and GPU synchronizations
    - Insufficient CPU parallelization or pipelining as well as overly expensive CPU or GPU algorithms
- It collects process and thread activity and correlates profiling data across CPU cores and GPU queues

The basic rules:
- If you see lots of gaps between command execution & the GPU is sitting idle
    - This is a sign you're CPU limited, continue in Nsight Systems
- If GPU work gets behind the CPU or takes longer than expected to complete
    - You're likely GPU limited, continue in Nsight Graphics
- Stutters occur when the GPU is sporadically interrupted and results in a horrible user experience. Identify these using Nsight Systems early in the process

A basic workflow you can easily adopt:
- Look at Top Stalls
- Fix the top stall that results in biggest performance impact
- Repeat with each hot spot until target frame rate (or scene budger) is reached

Be sure to check out the Peak-Performance-Percentage (P3) Analysis Method here: https://developer.nvidia.com/blog/the-peak-performance-analysis-method-for-optimizing-any-gpu-workload/

<img src="images/p3-method.png" width="600"> 

## 04b - Problem 1: Performance

Performance is probably the most significant issue you'll run into as it is based on many variables that are constantly changing. Graphic Programmres are constantly checking in new code that adds performance heavy new features while artists are creating new assets that may or may not fit within the scene budget. Nsight Graphics provides a number of tools to help identify the reasons for poor performance and ways in which you can improve performance.

### GPU Trace

GPU Trace works around the concept of Performance Metrics that are laid out in tables or on a timeline.

<img src="images/gpt4.png" width="800"> 

Unlike C++ captures, Traces can be saved out and referenced later without requiring the original system configuration or replay. You can even open a Windows Trace on a Linux machine.

One important thing to note is that GPU Trace has multiple modes that should be considered based on your needs.

The default mode is called the Throughput metric set. This mode works by captures metrics in a single pass using special hardware on the Turing Architecture and Ampere Archicture GPUs. It's very fast and efficient, but doesn't give you all the metrics you might need.

<img src="images/gputrace-mode-0.png" width="500"> 

The other mode you should consider is the Advanced metric set. This mode is interesting in that it runs gathers data over multiple frames and using a statistical model, combines the data to get more metrics. This mode is incredibly useful and we recommend using it, with the caveat that you have to freeze your scene in order to prevent divergence due to scene variability. It also takes a little longer to finalize the trace.

<img src="images/gputrace-mode-1.png" width="500"> 

The last mode to consider is One-Shot traces. Use this for applications that run once and finish. For instance, perhaps you have an application that is rendering all the normals for a character and backing them out to a normal map. You can still use GPU Trace to profile your code even though you may not define a clear begin and end to your frame.

<img src="images/gputrace-mode-2.png" width="500"> 

<font color="yellow">
Let's try something with our sample application. First, take a trace with the defaults in place. Then, increase the ambient occlusion and shadow samples to maximum and take a trace. What do you notice about the two traces?
</font>

### GPU Trace

GPU Trace is an incredibly powerful tool but with this power comes complexity. In order to help developers better interpret this data, we're introducing a new feature called Trace Analysis.

Trace Analysis provides:
- Hot spots sorted by perf marker and severity
- The percentage of frame time taken 
- Severity heatmap on the frame timeline
- Detailed explanations on multiple categories of performance issues
- Suggestions on how to resolve the performance hot spot

These recommendations come directly from NVIDIA DevTechs and are based on years of experience going through a similar performance triage process on dozens of major game titles.

<img src="images/gpt5.png" width="800"> 

### Shader Profiler

GPUs are massively parallel and so stalls can result in a massive impact to performance. For reference, the GA102 architecture features 84 SMs and each is divided into four processing blocks (or partitions). Each of these partitions have a 64 KB register file, an L0 instruction cache, one warp scheduler, one dispatch unit, and sets of math and other units. The four partitions share a combined 128 KB L1 data cache, resulting in a shared memory subsystem.

Shader Instructions are scheduled on an SM (Streaming Multiprocessor) via a Warp Scheduler.
If an instruction is waiting for another instruction to finish, this dependency creates a stall, since future instructions can't be scheduled on eligible warps.

Stalls limit potential performance by decreasing the number of instructions that can be simultaneously scheduled and executed on SMs and SM sub-partitions. This additional warp latency can directly lead to poor frame rates.

<img src="images/instruction-stalls.png" width="800"> 

Nsight Graphics has a built in Shader Profiler that lets you visualize where these stalls are occurring and what you should do to address them. 

<font color="yellow">
Let's open up the shader profiler. To start, you select the ray tracing API event, right click and select Profiler Shaders.
</font>

<img src="images/image072.png" width="800">

You can see the different stall reasons by hoving over the values or check out the documentation for more info. Specifically, these are the stall types:
Stall reasons explain why a warp was unable to issue an instruction. Each stall reason is provoked by a distinct set of conditions or instructions; by eliminating those conditions or transforming code from one set of instructions to another, you can reduce stalls.

<img src="images/shaderprofiler_stallreasons_table.png" width="800">

<font color="yellow">
Take a few moments to read through the stall reasons below. Open up the Shader Profiler for a trace rays call in our application and see if you can identify some of these stalls happening in our frame.
</font>

- __Barrier: Compute warps are waiting for sibling warps at a GroupSync.__
    - If the thread group size is 512 threads or greater, consider splitting it into smaller groups. This can increase eligible warps without affecting occupancy, unless shared memory becomes a new occupancy limiter.
    - Review whether all GroupSyncs are really necessary.
- __Dispatch Stall: A pipeline interlock prevented instruction dispatch for a selected warp.__
- __Drain : Exited warp is waiting to drain memory writes and pixel export.__
    - LG Throttle : Input FIFO to the LSU pipe for local and global memory instructions is full.
    - Avoid using thread-local memory.
    - Are dynamically indexed arrays declared in local scope?
    - Does the shader have excess register pressure causing spills?
    - Eliminate redundant global memory accesses (UAV accesses).
    - Data organization: pack UAV or SRV data to allow 64-bit or 128-bit accesses in place of multiple 32-bit accesses.
- __Long Scoreboard : Waiting on data dependency for local, global, texture, or surface load.__
    - Find the instruction or line of code that produces the data being waited upon; that instruction is the culprit.
    - Consider transforming a lookup table into a calculation.
    - Consider transforming global reads in which all threads read the same address into constant buffer reads.
    - If L1 hit rate is low, try to improve spatial locality (coalesced accesses).
    - If VRAM Throughput is high, try to improve spatial locality (coalesced accesses).
- __Math Pipe Throttle : A math pipe input FIFO is full (FMA, ALU, FP16+Tensor).__
    - This stall reason implies being computationally bound. Use the Range Profiler to best determine how to move computation to a different execution unit.
- __Membar : Waiting for a memory barrier to return.__
    - Memory barriers are issued by GroupMemoryBarrier, DeviceMemoryBarrier, AllMemoryBarrier, and their GroupSync variants.
    - Review whether the specified scope of each barrier in the shader is really needed. Group-level barriers resolve much faster than Device-level.
    - Review whether a memory barrier is needed at all. A compute shader where each thread writes to a unique UAV location does not require a memory barrier.
- __MIO Throttle : The input FIFO to MIO is full.__
    - May be triggered by local, global, shared, attribute, IPA, indexed constant loads (LDC), and decoupled math.
- __Misc : A stall reason not covered elsewhere.__
- __Not Selected : Warp was eligible but not selected, because another warp was.__
    - High “not selected” could indicate an opportunity to increase register or shared memory usage (lowering occupancy) without impacting performance. Opening the doors to greater shader complexity or improved quality.
- __Selected : Warp issued an instruction. Technically not a stall.__
- __Short Scoreboard : Waiting for short latency MIO or RTCORE data dependency.__
    - Includes 3D attribute load/store, pixel attribute interpolation, compute shared memory load/store, indexed constant loads, transcendentals (rcp, rsqrt, …) through the SFU pipe (aka XU pipe), VOTE, and a few other infrequent instructions.
- __TEX Throttle : The TEXIN input FIFO is full.__
    - Try issuing fewer texture fetches, surface loads, surface stores, or decoupled math operations.
    - Check whether the shader is using decoupled math (usually to be avoided).
    - Consider converting texture lookups or surface loads into global memory lookups (UAVs). Texture can accept 4 threads’ requests per cycle, whereas global accepts 32 threads.
- __Wait : Waiting for coupled math data dependency (FMA, ALU, FP16+Tensor).__

Another benefit of the Shader Profiler is being able to see the entire compilation chain:
- High Level -> Intermediate -> Assembly
    - DirectX:
        - HLSL -> DXIL -> SASS
    - Vulkan:
        - GLSL|HLSL -> SPIR-V -> SASS

This can be very useful if you're trying to understand how your code is being optimized under the hood an what is actually getting to the GPU. Note that SASS is available in the Pro version of Nsight Graphics (available to select developers that we have an NDA with).

## 04c - Problem 2: API misconfiguration

Modern APIs like DirectX and Vulkan ar every complex, and with the addition of DXR and Vulkan Ray Tracing, it's even easier to misconfigure the API, resulting in hard to spot bugs. With Nsight Graphics, you can inspect the API configuration to ensure that things are as expected.

Let's look at the basic configuration for the Trace Rays call in our ray tracing application.

<img src="images/opaque1.png" width="800">

Ensuring that you are passing proper data to shaders is another good place to verify that you are seeing the right things.

<img src="images/opaque2.png" width="500">

In order to ensure that applications exhibit proper behavior by default, ray tracing APIs make it the default that all objects are considered non-opaque. The reason being, if you terminate a ray on an object that should be translucent, you will get incorrect rendering results. In general, this behavior is a good thing, however, it comes with it's own problem. Because rays never terminate on intersection, you may incur an additional performance penalty when the SM has to read back the results of a ray intersection result. To prevent this, you should ensure that all opaque objects in the scene are proprely marked with this flag.

In DXR, this flag is: D3D12_RAYTRACING_GEOMETRY_FLAG_OPAQUE 
In Vulkan Ray Tracing, this flag is: VK_GEOMETRY_OPAQUE_BIT_KHR

Here is what the Acceleration Structures look like by default in the viewer:

<img src="images/opaque3.png" width="800">

And here is a special filter you can apply to visualize if you are using the opaque flag correctly.

<img src="images/opaque4.png" width="800">



## 04d - Problem 3: Graphics Bugs

Another issue you likely have experienced before are graphics bugs. Find these can be very difficult without the proper tools, but Nsight Graphics this much easier.

<font color="yellow">
Let's have a look at EndeavRTX. If you bring up the output from the Trace Rays call, you'll see that it's actually pretty dark.
</font>

<img src="images/gamma1.png" width="800">

One thing we could do is just bring up the brightness of the lights in the world but that does not resolve the underlying issue, which is that the contrast also looks off. It's easy to come to the conclusion that we are not properly gamma correcting. There is an easy way to test this theory - let's add gamma correction to the raygen shader.

While we _could_ disconnect the tool, update the shader, recompile then run the tool again, this is obviously a bit time intensive. There is a much better way that will allow you to more rapidly iterate on issues like this in the future. Let's test out dynamic shader editing.

Open up the shader and click 'edit'. Scroll down to around line 187. You should see this:

<img src="images/gamma2.png" width="800">

Add this code:
`hitValues = pow(hitValues, 1.0f / 2.2f);`

This is a very basic gamma correction and we recommend something that may better suit your application, but for the purpose of this demonstration, it works for us. After clicking on 'compile', let's see what it looks like now:

<img src="images/gamma3.png" width="800">

The contrast is dramatically improved and the distribution of bright and dark regions is much more visually pleasing. We can now port this change back to the main application since we've confirmed that the fix works.

## 04e - Problem 4: GPU Crashes

One of the most common and frustrating ray tracing problems are crashes that occur on the GPU. This can be due to misconfiguration or even run time changes that can cause a shader to take too long to execute and cause the GPU to hang. Normally, these issues are somewhat difficult to identify and resolve, especially for users out in the wild. To help you with this, we created Nsight Aftermath. It provides:
- GPU mini-dumps
- Integration into your application
- GPU state information
- Breadcrumb Markers
- Source code correlation 

<img src="images/aftm1.png" width="500">

Packaged as a straightforward to integrate SDK, you can enable it on a users machine and once a crash occurs, a GPU Crash Dump file is generated that you save and inspect later.

To get the most use of the SDK, make sure to insert custom markers in your code that are very descriptive and provide good breadcrumbs that you can follow.

Nsight Aftermath dump files can be accessed via the API we provide to you in the SDK, but we've simplified things for you and provide a full dump file viewer directly in Nsight Graphics. This makes it easy to see application information, including the information provided to the description callback, a basic overview of the GPU state and crash info.

Let's have a quick look at one we packaged in the reports folder.

In this example, we can see the active warps showing the status of all threads that were running on the GPU and the nearest marker to the crash. 

<img src="images/aftm2.png" width="600">

When shader debug info is present, we are also able to correlate the crash to a specific line of code. The crash here is known as a TDR (timeout detection and recovery), which was the result of an accidental infinite loop.

Check out the SDK on the NVIDIA Developer Website for more information on how to integrate this into your own application.


------

This concludes our tour of Nsight Graphics and Ray Tracing concepts. Let's take a few moments for coffee and questions then move on to the next part of the lesson.

<img src="images/cake.png" width="80">

Now let's take a deep dive on how to identify and resolve a performance issue with a game that uses Path Tracing. 

[Continue to the **Deep Dive** section](../systems/nsight-systems-deepdive.ipynb)