# Deep Dive on improving Ray Tracing performance by keeping the GPU Utilized

## What is Nsight Systems?

Nsight Systems is a performance analysis tool focusing on profiling how many aspects of your system are interacting, including your CPU, GPU, and OS. It features:
- System-wide application algorithm tuning
- Multi-process tree support
- Presenting activity on the CPU, GPU, and how they correlate
- Visualization of gaps of unused GPU time
- Long multi-frame stutter analysis
- Hot and cold spot analysis
- Scaling for multiple GPUs
- Even improving your CPU usage
- Support for many APIs such as: D3D11/12/DXR, Vulkan/Vulkan Ray Tracing, OpenGL, OptiX, CUDA
- Support for multiple platforms: Linux, Windows, x86, ARM
- Target can be local or remote

Command line interfaces are available (see documentation), though today we will focus on profiling from the GUI.

Here's what it looks like:

<img src="images/nsight_systems.png" width="700">


## About this lesson

- Understand what a report and its timeline look like, to learn the tool's features
- Identify a common problem in CPU-GPU interactions
- Learn how to setup a project and capture a report like this
- See how removing the issue changes the timeline
- See how a Linux report differs from a Windows report

To make it easier to follow we will:
- <mark>**Mark actions we want you to take in highligh**t</mark>
- <span style="color:purple">Mark hints with this font color</span>

The link to the VNC for this lab can be generated by running the following section: 

In [1]:
 %%js
var port = ((window.location.port == 80) ? "" : (":"+window.location.port));
var url = 'http://' + window.location.hostname + port + '/nsight/vnc.html?resize=scale';
element.append(url);
let p = document.createElement("p");
p.append("Password: nvidia");
element.appendChild(p);

<IPython.core.display.Javascript object>

## About the target application

We will look at a modified version of Quake II RTX, with a performance problem introduced for the sake of this training. We wanted to use a game complex enough to be a proxy for your application, yet still simple enough to allow you to learn the basics.  Quake II RTX is NVIDIA's attempt at implementing a fully functional version of Id Software's 1997 hit game Quake II with RTX path-traced global illumination.  

A launcher script is already placed in the container's PATH, so you may run the game (with a demo recording and FPS counter) outside the tool, open a terminal window and run "q2rtx". The demo will exit automatically after a set amount of time.

## Launching the tool
The Nsight Systems DEB and RPM packages would install icons into your application launcher, whether this be Linux or Windows. <mark>**We have placed an Nsight Systems icon on your desktop. Please double-click it.**</mark>



<img src="images/nsight_systems_desktop_icon.png" width="150">


## Opening an existing report

Nsight Systems will open. On the left side you can see the Project Explorer tree. It contains all the projects and report files captured or imported to the tool. Projects act as folders, remembering profiling settings, as well as holding the reports generated from the project.

<img src="images/nsight_systems_project_explorer.png" width="1000">

Report files can be shared with others! If you open them directly (File/Open) they would appear flat here like the windows report.

The primary project is already open. In the Project Explorer, click on the triangle next to the "Quake II RTX" project to expand the list of reports captured by the project. <mark>**Now double-click on the "Before" report.**</mark> It should take a few seconds to load.
 

## A tour of a report
By now you should be looking at a report. Later in this lab, you will be generating a report that looks like this one; "Before" addressing the common CPU-GPU interaction problem. But first, let's analyze this report to figure out the source of our issue. <mark>**In a later section we will investigate an performance problem, so pay attention to any rows or statistics that may indicate a issue.**</mark> Don’t worry, we’ll help point them out along the way!  Now, let's discuss the sections and rows of the report.

<img src="images/nsight_systems_before_first.png" width="1000">

We can see on the top is a timeline. On the bottom is an empty box with instructions on how to view events.

This event box can also be filled with sampling data. <mark>**Click the combo box on the left just below the splitter in the middle of the report,**</mark> currently set to "Event View". <mark>**Select "Bottom-Up View"**</mark> and you should see it fill in a function call tree that looks like this.

<img src="images/nsight_systems_bottom-up.png" width="1000">

    
These are thread call-stack samples. They are collapsed by function calls so that you can see how they break down statistically over the entire report. You can take a moment to explore this, expanding rows to see the function calls from the call stacks that resulted in sample location. Since this view focuses on CPU call stacks, it is mostly relevant for CPU-bound applications. Since we do not expect this application to be CPU-bound, let's minimize the view. <mark>**Grab the splitter between the timeline and event viewer and now drag it down**</mark> so that we can concentrate on the timeline.

<img src="images/nsight_systems_timeline_initial.png" width="1000">

### Zooming

The timeline will be a lot of data and may look noisy when zoomed out so far. Zoom in with any of the following:

- CTRL + mouse wheel scroll up/down(this is our favorite)
- Keyboard + or - keys after clicking on the timeline for focus
- Mouse left-click+drag followed by shift+z or right click for context menu.
-- Pressing Z without shift will zoom in as well as filter the call-stacks view below to only show statistics from the selected range. Handy!
- Backspace to undo any of these zoom actions

<mark>**Zoom in until we're looking at a few frames (a range of ~100ms) in the middle of the report**</mark> or less of information as it will be easier to see the frame patterns.

<img src="images/nsight_systems_timeline_100ms.png" width="1000">

### CPU HW section
At the very top we have the CPU section. The overview is the aggregate utilization of all cores. If you <mark>**expand the row by clicking the triangle icon in the row header,**</mark> you can see how much activity is on each CPU core. If your CPU has hyperthreading enabled, the cores displayed will be the virtual cores, not the physical ones.

You can see the CPU isn't very active. Still, this does not mean that the problem is entirely on the GPU side!!! The issue could be system-wide, such as an incorrect CPU-GPU interaction.

### GPU HW section
This section contains information not tied to a specific process. This includes GPU performance metrics samples, which are a new feature in this version of Nsight Systems.  It may also containGPU Context switch information, if collected as well as . vertical Sync (VSYNC) timing, for machines that support it. This report contains only the GPU performance metrics samples, since the container configuration does not support VSYNC, and GPU context switches were not traced since they typically would not be relevant to the performance of a game exclusively controlling the GPU.

Here you can see metrics that we can collectin a single pass for long periods of time. <mark>**Scroll down and observe the various metrics captured in this report**<mark>:
- Clock frequencies
- Graphics processor overall active percentage
- Numbers of shader threads in flight, broken down by type (compute, graphics draw and dispatch)
- Graphics work starting points
- Percentage of the Stream Multiprocessors (SMs) that are active
- Instruction throughputs
- Warp occupancies
- DRAM (graphics memory) Bandwidth
- PCIe (bus) Bandwidth

For a GPU-intensive application, ideally <span style="color:purple">"GR active" and "SM active" rows should be as full and tall as possible.</span> indicating you have filled GPU in time and width.  The latter is less likely to be always possible for SM active depending on quanitity and dimensions of work.  <span style="color:purple">Gaps in these rows indicate GPU idle times, while low fill indicates sub-optimal occupancy at this zoom level, which considers the averages of the samples within the time that pixel represent.</span> In this report, you can see in this area there are some gaps on the GPU where there is no activity. That's going to be an interesting area to focus on. <span style="color:purple">Bubbles on a GPU could mean there is a problem in keeping the GPU busy by submitting to little work, not frequently enough, or synchronizing.</span>

### Processes and threads

Below the system-wide hardware we have process information and their threads. <mark>**Scroll down a bit so that the "Processes" header is at the top of the timeline,**</mark> like this picture:

<img src="images/nsight_systems_timeline_processes_threads.png" width="1000">

Now we will dive into these sub-rows.

#### Frame duration
This is based on how fast the game is presenting on the CPU and GPU. It's colored based on a target frame-rate that defaults to 60 FPS (frames per second). This is a great indicator to find problematic individual frames when you are trying to eliminate those annoying stutters before you ship a game.
Note that the frames on the CPU and GPU are numbered by their index from the report's start. <span style="color:purple">A healthy graphics application that correctly parallelizes work between the CPU and GPU will usually have the CPU frame precede its GPU counterpart by 1 or even 2 or more frames.</span> Strive to make sure that the GPU always has work to do and is never waiting for more work from the CPU. Part of accomplishing this is also avoiding having the CPU wait for the GPU to finish.  

#### Vulkan HW
This is a per-process representation and trace of Vulkan workloads on the GPU hardware. It is broken down into rows representing the Vulkan API queues, which are usually corresponding to the hardware's scheduler queues. The ranges we see are the command buffers as they are being executed on the GPU. Click the triangle button next to one of the queue names to expand its sub-row, showing the CPU-side API calls related to this queue, aggregated from all the process's threads. <span style="color:purple">Comparing the row and its sub-row, especially for the busier queues, can indicate whether the Vulkan API is properly parallelizing the CPU and GPU work.</span>

Finally, you can see a command creation row which shows the time ranges when all the application threads were creating command buffers. For applications that use multiple worker threads to create command buffers (which is a best-practice for applications that feed a lot of data to the GPU), this may help you spot command buffer that took longer to create than expected. <span style="color:purple">This application only uses one thread to create command buffers, so this row does not show much relevant information.</span>

#### Threads
<mark>**We recommend scroll down to look at the application "Threads" header,**</mark> but it is technically in view at approximately the middle of the picture above.

Each thread has a lot of information. We want you to be able to track GPU issues back to their CPU origins and understand what may have led up to situations like work being submitted later than expected.

First, you can see the thread overview and child rows are API calls. The thread overview is broken down into several lanes:
- Utilization - the top green bars. The areas which are not filled to the full height indicate that on a sub-pixel level, that area of the graph is not fully covered by time the thread was running. You can zoom in to see the details.
- CPU core occupancy - under each utilization bar, a color-coded strip indicates which core it is occupying, matching the colors in the CPU Hardware sub-rows from the top of the report. If it jumps around, it may thrash the caches, resulting in more cache misses.
- State - The strip under the core indicators shows the thread state, usually running or blocked. You can sometimes get blocked-state call stacks if you look at tooltips.  On Linux this requires OS Runtime Trace to also be enabled.
- Call stacks samples - The orange notches under the state strip. If one or more of the call stack sampling options are enabled, a notch will appear on the timeline whenever a call stack was captured. <mark>**Hover the mouse over some of the notches to see the detail call stack for that thread at that time in a tooltip.**</mark> <span style="color:purple">This can be helpful in identifying and contextualizing the code locations of issues once you find them on the timeline.</span>

Within the threads, rows may vary based on the collection settings.  For this report we can see:
- HW markers - the graphics performance markers an app emits while building a graphics command buffer (such as DirectX PIX markers, VK_EXT_debug_utils markers, etc.). These are-user provided and can be helpful to understanding which command buffers (and what sections of each buffer) are being created during the "Command Creation" bars. Adding these markers to command buffers also shows them on the GPU side under the main GPU Workload bars.
- OS runtime libraries - trace of key Linux system calls, helping you understand how your thread may be performing synchronizing/blocking actions. Examples include sleeps, waiting on signals, or even file I/O.
- Vulkan API - <span style="color:purple">the calls made on this thread, performing actions like creating objects, submitting command buffers to queues for rendering, or waiting on various sync objects.</span>

For select APIs, clicking a timeline bar will highlight it and correlated events. For example, clicking a call to vkQueueSubmit will highlight the Vulkan GPU workloads that it launched, as well as showing its counterpart in the process-wide aggregate row for the queue where it was invoked. The opposite is also true - clicking a GPU workload will also highlight the call to vkQueueSubmit that launched it in both locations. <mark>**Click one of the GPU workloads now to see its corresponding vkQueueSubmit call in both the queue sub-row and the individual thread's Vulkan API row.**</mark>  If the queue's sub-row is collapsed, a small arrow indicator will appear over the row header to show that a highlighted item can be found under it.  Additionally, if highlights are outside the current view, either vertically or horizontally, then clickable arrows may appear in the corners of the timeline. 


## Identify a common problem in CPU-GPU interactions



Now that you have seen the report and understood the details of the various rows, <mark>**take a moment to review it and see if you can find any problems that stand out.**</mark>  A few general tips, when looking at an Nsight Systems report:
- Try to look at the occupancies of the various parts of the system. Are any of them under-utilized? When? Is there anything happening at the same time that could hint to why?
- Try lining up the various rows at the times of interesting or unusual things you see - this can provide an invaluable system-wide view, which is Nsight Systems's greatest strength. <mark>**Right-click a row in the timeline and select "Pin Row" to make a row always visible, even if you scroll the view up or down.**</mark>
Can two events happening at the same time be interfering with one another? Should two events happening one after the other be parallelized instead happening serially?
- Try looking for events, such as API calls or system waits, that take a long time. Using the call stacks (either during the events or right before or after them), could help in identifying what caused them.

If you are unsure as to what you should be looking for, a few hints were <span style="color:purple">marked as hints</span> in the previous part of this notebook. Have a look at what those hints point out and see if anything is not as it should be.
    
Below this paragraph are explanations of some of the issues in this report. <mark>**If you wish to explore on your own, do so now before you continue reading beyond this point.**</mark>    

### First indicator: CPU and GPU occupancies.
As a rule of thumb, a good point to start looking at the report is by comparing the CPU and GPU occupancy rows.

For the CPU side, this could be either the top CPU Utilization row or the individual threads' utilization row. For the GPU side, look at either the GR Active row, the SM Active / SM Warp Occupancy rows, or the individual queues' GPU Workload rows.

- If the <span style="color:darkorange">CPU rows are mostly occupied</span> and the <span style="color:red">GPU rows have many idle times,</span> the application is **CPU-bound**. This can be solved by either applying any classic CPU hot-spot optimization technique (such as improving the algorithms, adding more multi-threading, ...), or by offloading more work to the GPU.
- If the <span style="color:darkorange">CPU rows have many idle times</span> and the <span style="color:green">GPU rows are mostly occupied,</span> the application is **GPU-bound**. This isn't necessarily bad and instead you shoud ask, are you hitting your desired frame-rate?  In this case, check the your Vulkan or Direct3D workload ranges or GPU metrics sampling GR Active and SM Active / Warp Occupancy rows to see if the application is fully loading the GPU. If it isn't, improve GPU parallelism. If it is, improve the efficiency of your shaders. Techniques to doing both of these and more to improve your GPU work's efficiency can be found in the Nsight Graphics part of this lab (and in other resources on the Nsight Graphics and Nsight Compute tools)
- If the <span style="color:darkorange">CPU and GPU rows are both mostly occupied,</span> the application is maximizing usage of the system. Normally, this indicates a balanced application. If you are still not achieving your framerate then the work shifts more from cold-spot analysis to hot spot analysis.  but if you want to optimize it further, either of the previous two methods (or a combination of both) should help.
- If the <span style="color:red">CPU and GPU rows both have many idle times,</span> there is a system-wide problem, not related to the computation on one processor or the other.

In this report, we see that the CPU and GPU rows both have large gaps, so we are facing a system-wide problem. The two most common culprits for these issues are:
- Transfer delays (check the PCIe bandwidth row, the workloads on Transfer queues, and look for API calls like vkMapMemory)
- Excessive synchronization (this will usually manifest as the CPU and GPU work being serialized instead of parallelized, as well as threads spending long times blocked, especially inside API calls related to synchronization).
<mark>**Compare the Vulkan GPU Workload rows under the application's main thread with the same thread's CPU utilization.**</mark> It's very evident that the CPU and GPU are "taking turns" in processing and are not parallelizing correctly - the idle times in the CPU line up with the busy times in the GPU and vice-versa. This is a very clear indicator of excessive synchronization between the two processors.

### Second indicator: CPU-to-GPU frame offset

As mentioned before, the purpose of a GPU is to offload and parallelize the task of rendering graphics away from the CPU. If a GPU is being utilized correctly, it should be drawing the current frame at the same time the CPU is preparing the next frame, or even a frame beyond that.

<mark>**Look at the two "Frame Duration" rows and observe the indices of the CPU and GPU frames and how they overlap.**</mark> The GPU and CPU timing each frame overlap completely, and there is no offset between them. This indicates either the GPU is being starved for work and is finishing its tasks faster than the CPU can create them; or that the work on both processors is being serialized or synchronized in some way. Since we know this application should not be CPU-intensive at all, the latter seems more likely. This is, in fact, the correct answer in this case - but for your own real-world application you might want to double-check that assumption by seeing if the application is CPU-bound (see first indicator)

### Third indicator: Long sync calls

<mark>**Look at the OS Runtime and Vulkan API rows under the application's thread.**</mark> What API calls appear to be happening the most? What ones take up most of the CPU threads' timeline?

In this report, there are 2 correct answers: If you said "poll", you would still be right, but "poll" can indicate any sort of wait on a signal or communication pipe. If you look at the call stack notches right before the "poll" calls (or, simply, if you look at the row beneath it and see the concurrent event happening in the Vulkan API row), you will see that poll() is being called by the other correct answer: The Vulkan API vkWaitOnFences. **Note how the two rows showing the concurrent events indicate that the call to vkWaitOnFences is actively waiting on the signal using the system call poll.**

### Putting it all together
Going over the problems we found, from the top rows of the timeline to the bottom
- The CPU was pretty empty
- The GPU metric samples showed idle spots
    - The idle times on the CPU and GPU line up so that each of them is working while the other
- The CPU Frames do not start pushing commands until the previous GPU frame has finished rendering
- Vulkan HW trace showed 2 queues but no overlaps between them (did you catch that one? Transfer queues are normally used to parallelize the work INSIDE the GPU, not just between the GPU and CPU)
- The CPU thread has a LOT of blocking calls. It spends most of its time inside the system call poll, itself being called from the Vulkan API function vkWaitForFences.
    - These wait times also line up with the GPU workloads and the GPU times.

This is a fairly typical real-life scenario: It is quite common, in early milestones of an app or game, for a developer to accidentally leave in bits of code for debugging, or to avoid race conditions in a way that might not be necessary, or should not be necessary if the issue is noticed and could be avoided via optimizations.

We can see long calls to vkWaitForFences() which in turn calls poll(), which in turn has the thread go into a blocked state. Those are not all bad though. They can sometimes be important to algorithms, but we encourage you to ask each time if there was another option to enable GPU-side synchronization instead and enable parallelism.

<mark>**Look at the call stacks that happen right as these poll calls begin by hovering the mouse of the orange notches aligned with the start of the GPU work / the end of the CPU work.**</mark> You should see, as already indicated by the timeline, the function poll, below it in the stack the function vkWaitForFences. Under that you can see the specific function that called the API (since vkWaitForFences might be used in multiple places in your code) - a user function named vkpt_submit_command_buffer. This would be a good point to actually look at the source code. On the desktop there is a folder named q2rtx_src, where you can find the modified version of the Vulkan engine's main.c file used in building this application. <mark>**Open the source file and search for the definition of the vkpt_submit_command_buffer function.**</mark> It should be around lines 3730-3825.

At line 3802, just below the call to vkQueueSubmit, the developer (okay, it was us, the lab developers, planting a bug in the code for the sake of this lesson) waited on fences unnecessarily. As mentioned above, often, people do this when they can't get things right like memory barriers, or they want to force the system to see/verify intermediate work and forget to take it out later. Make sure you know that each fence is needed! You don't want the CPU waiting on the GPU if it could be first doing work early to prepare for the next frame. Take advantage of the parallelism. The application might look CPU bound at times, but it is not in the traditional "I'm doing too much CPU work sense so there should be hot-spot" sense. Instead we are stuck on CPU-GPU interactions that can turn places that should be parallel into sequences.

If we zoom in we can see it here. Selecting the GPU work shows us the CPU launch point as well. In a well-parallelized application, the launch of a GPU workload should appear long before its execution (one to several frames back) instead of almost immediately like it is here.

Conveniently, we have also set-up  an environment variable that disables this excessive synchronization. If you set Q2RTX_WAIT_FOR_EXECUTION=0 (or, any value other than 1, really) - the calls to vkWaitForFences will stop happening. Next, let's capture a report with the environment variable set, and see if the problems we observed go away.


<img src="images/nsight_systems_wait_for_transfer_queue.png" width="1000">


## Setting up a project and capturing your own report
<mark>**Double click in the Project Explorer on the project "Quake II RTX".**</mark> This will switch to the tab,which  should already be open (or re-open it if you closed it). You have to select a target device before you can configure the project settings. <mark>**Click the dropdown and select the entry under localhost.**<mark> It should look something like this, before the page is filled with relevant options from the device:

<img src="images/nsight_systems_connect.png" width="1000">

Alternatively, you can click the "Select" link under the drop-down menu to re-connect to the last used target (which is the local target, in this case).

The first section is setting up application launch and thread call stack sampling. This is already setup for you. You can see that the command line is simply q2rtx. This is in the /bin but for your own applications you might need to specify a full path and working directory.
    
    
    
The first section is setting up applciation launch and thread call stack sampling.
This is already setup for you.  You can see that the commandline is q2rtx.  This is in the /bin but in your case you might need to specify a full path and working directory for your own apps in the future.

<img src="images/nsight_systems_settings1.png" width="1000">

<mark>**Expand the "Environment Variables" control by clicking the triangle button, and change the value of the variable Q2RTX_WAIT_FOR_EXECUTION from 1 to 0.**</mark>    
    
<img src="images/nsight_systems_remove_wait.png" width="600">

Below the target process settings, going over the checkboxes already selected in this project file, Nsight Systems will collect CPU thread states, OS runtime library trace (like pthread_mutex_lock() calls), GPU metrics samples and Vulkan API.
    
<mark>**You can expand each of the selected collection types by clicking its triangle and take a moment to see which configuration options are selected for it.**<mark> Here you can see inside the "Collect GPU metrics" and "Collect Vulkan trace", showing for example that the GPU metrics will be collected for all supported GPUs using the default Nsight Systems metrics set, and that Vulkan will capture GPU workloads as well as the VK_EXT_debug_utils annotation markers.

<img src="images/nsight_systems_settings2.png" width="1000">

We are almost ready to start profiling!

On the right side we have the start button and a few last controls to limit when and how long we will record the report information. The default behavior, if all boxes are unchecked, is to capture the entire application run, from when it is launched until it terminates.
    
The project file you have opened should have "Start profiling after... 15.0 seconds" and "Limit profiling to... 5.0 seconds" already configured. The 15 seconds will allow enough time for the game's loading to finish and the demo to start player. The 5 second capture should be enough, since we are not looking for a rare or random occurrence, but rather a consistent behavior.
    
Other options include manually starting the capture by pressing a button in the Nsight Systems GUI, or by pressing one of the function keys (F12 by default). <span style="color:darkgreen">The latter is especially useful if you are profiling a game that captures the mouse and pauses when losing focus; or if you are profiling a full-screen application on a single-monitor set-up.</span>

<mark>**Now press the big start button in the upper right!**</mark>
    
The application will pop up and load a demo recording. It will pop over Nsight Systems. If Nsight Systems were brought into focus, it you will see it recording like this:

<img src="images/nsight_systems_recording.png" width="1000">

Depending on the options we selected, the capture will stop after the designated time has elapsed, you click the stop button, or hit the hotkey a second time. In this case, the 5-second timer will count down once the capture starts and the report will start processing once it reaches 0.

## See how removing the issue changes the timeline

We have found our problem and you may have even looked at the source in the last step around vkWaitForFences(). If you had any trouble following along you can simply open the report in the Project Explorer labeled "After".

How has the report changed vs the one we saw before? In the least, the process's frame duration has improved (this was also reflected in the game's own FPS counter).  You can also see now that the work in the Vulkan HW transfer queue is now parallel with that of the primary graphics/compute queue.

<img src="images/nsight_systems_optimized.png" width="1000">


## See how a Windows report differs from a Linux report

<mark>**Open the report labeled "q2rtx-on-windows" to see!**</mark>

This report was captured on the un-modified q2rtx, so it should be comparable to the report you captured yourself, or the provided "After" report. 

Starting with the GPU HW section, you will see **WDDM queues.** This is the Windows Display Driver Model, which is responsible for context switching a GPU between apps.

Down in the processes we can also see WDDM queue rows. These are the CPU queues, as well as process-filtered views of the GPU queue. With Quake II RTX these queues are not very tall. On other games and graphic applications, they can grow quite large due to doing rather small bits of work per queue submit or transferring small bits of data, extending the concurrency of the graphic pipeline in the driver. By default the rows are compressed. You can tell by both the double-arrow compress/expand button in the row-header, as well as the grey bar near the top of the row that shows when there is overflow, and it has clipped or compressed it. <mark>**Press the expand button on one of the WDDM rows to show the full WDDM queues.**</mark>

<img src="images/nsight_systems_wddm.png" width="1000">

You will also notice within the threads, and at the bottom of the processes there are **ETW events.** These windows events that can indicate important things like memory evictions and paging from oversubscribed memory. They can be great indicators to explaining stutter and clue on how to prevent it. <mark>**Have a look at some of these ETW rows and their events.**</mark> You can also right-click these rows' header (or any other row's) and choose "Show in Events View" to see a list of the events contained within.

Nsight Systems reports on Windows also include trace data from **important system processes** that also play a role in context switching and paging, which you can see under the "System (4)" process. <mark>**Check out the information shown under it and see how it correlates to the event timing for the target application.**</mark>

## Other resources

Basic information and downloads are availble at:

https://developer.nvidia.com/nsight-systems

You can find videos, blog posts, and conference presentation (as well as being inside the product documentation) at:

https://docs.nvidia.com/nsight-systems/UserGuide/index.html#other_resources

--------

Click [here to conclude the lesson](../conclusion.ipynb)