<h1><div align="center">Managing Accelerated Application Memory with CUDA C/C++ Unified Memory</div></h1>

![CUDA](./images/CUDA_Logo.jpg)

The [*CUDA Best Practices Guide*](http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#memory-optimizations), a highly recommended followup to this and other CUDA fundamentals labs, recommends a design cycle called **APOD**: **A**ssess, **P**arallelize, **O**ptimize, **D**eploy. In short, APOD prescribes an iterative design process, where developers can apply incremental improvements to their accelerated application's performance, and ship their code. As developers become more competent CUDA programmers, more advanced optimization techniques can be applied to their accelerated code bases.

This lab will support such a style of iterative development. You will be using the Nsight Systems command line tool **nsys** to qualitatively measure your application's performance, and to identify opportunities for optimization, after which you will apply incremental improvements before learning new techniques and repeating the cycle. As a point of focus, many of the techniques you will be learning and applying in this lab will deal with the specifics of how CUDA's **Unified Memory** works. Understanding Unified Memory behavior is a fundamental skill for CUDA developers, and serves as a prerequisite to many more advanced memory management techniques.

---
## Prerequisites

To get the most out of this lab you should already be able to:

- Write, compile, and run C/C++ programs that both call CPU functions and launch GPU kernels.
- Control parallel thread hierarchy using execution configuration.
- Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free Unified Memory.

---
## Objectives

By the time you complete this lab, you will be able to:

- Use the Nsight Systems command line tool (**nsys**) to profile accelerated application performance.
- Leverage an understanding of **Streaming Multiprocessors** to optimize execution configurations.
- Understand the behavior of **Unified Memory** with regard to page faulting and data migrations.
- Use **asynchronous memory prefetching** to reduce page faults and data migrations for increased performance.
- Employ an iterative development cycle to rapidly accelerate and deploy applications.

---
## Iterative Optimizations with the NVIDIA Command Line Profiler

The only way to be assured that attempts at optimizing accelerated code bases are actually successful is to profile the application for quantitative information about the application's performance. `nsys` is the Nsight Systems command line tool. It ships with the CUDA toolkit, and is a powerful tool for profiling accelerated applications.

`nsys` is easy to use. Its most basic usage is to simply pass it the path to an executable compiled with `nvcc`. `nsys` will proceed to execute the application, after which it will print a summary output of the application's GPU activities, CUDA API calls, as well as information about **Unified Memory** activity, a topic which will be covered extensively later in this lab.

When accelerating applications, or optimizing already-accelerated applications, take a scientific and iterative approach. Profile your application after making changes, take note, and record the implications of any refactoring on performance. Make these observations early and often: frequently, enough performance boost can be gained with little effort such that you can ship your accelerated application. Additionally, frequent profiling will teach you how specific changes to your CUDA code bases impact its actual performance: knowledge that is hard to acquire when only profiling after many kinds of changes in your code bases.

### Exercise: Profile an Application with nsys

[01-vector-add.cu](01-vector-add/01-vector-add.cu) (<------ you can click on this and any of the source file links in this lab to open them for editing) is a naively accelerated vector addition program. Use the two code execution cells below (`CTRL` + `ENTER`). The first code execution cell will compile (and run) the vector addition program. The second code execution cell will profile the executable that was just compiled using `nsys profile`.

`nsys profile` will generate a report file which can be used in a variety of manners, including for use in visual profiling with Nsight Systems, which we will look at in more detail in the following section.

Here we use the `--stats=true` flag to indicate we would like summary statistics printed. In this section this summary will be the focus of our attention. There is quite a lot of information printed:

- Operating System Runtime Summary (`osrt_sum`)
- **CUDA API Summary (`cuda_api_sum`)**
- **CUDA Kernel Summary (`cuda_gpu_kern_sum`)**
- **CUDA Memory Time Operation Summary (`cuda_gpu_mem_time_sum`)**
- **CUDA Memory Size Operation Summary (`cuda_gpu_mem_size_sum`)**

In this section you will primarily be using the 4 summaries in **bold** above. In the next section, you will be using the generated report files to give to the Nsight Systems GUI for visual profiling.

After profiling the application, answer the following questions using information displayed in the `cuda_gpu_kern_sum` section of the profiling output:

- What was the name of the only CUDA kernel called in this application?
- How many times did this kernel run?
- How long did it take this kernel to run? Record this time somewhere: you will be optimizing this application and will want to know how much faster you can make it.

In [4]:
!nvcc -o single-thread-vector-add 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [5]:
!nsys profile --stats=true ./single-thread-vector-add

Success! All values calculated correctly.
Generating '/tmp/nsys-report-a20e.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
     90.3       6054131773        317  19098207.5  10070523.0      2190  100163795   27334856.3  poll                  
      8.9        599393037        283   2117996.6   2064625.0       200   20453018    1592301.4  sem_timedwait         
      0.5         31165059        499     62455.0     10850.0       380    8444285     397256.9  ioctl                 
      0.3         19559006         24    814958.6      4700.0       920    7488640    2202600.9  mmap                  
      0.0           912051      

Worth mentioning is that by default, `nsys profile` will not overwrite an existing report file. This is done to prevent accidental loss of work when profiling. If for any reason, you would rather overwrite an existing report file, say during rapid iterations, you can provide the `-f` flag to `nsys profile` to allow overwriting an existing report file.

### Exercise: Optimize and Profile

Take a minute or two to make a simple optimization to [01-vector-add.cu](01-vector-add/01-vector-add.cu) by updating its execution configuration so that it runs on many threads in a single thread block. Recompile and then profile with `nsys profile --stats=true` using the code execution cells below. Use the profiling output to check the runtime of the kernel. What was the speed up from this optimization? Be sure to record your results somewhere.

In [6]:
!nvcc -o multi-thread-vector-add 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [7]:
!nsys profile --stats=true ./multi-thread-vector-add

Success! All values calculated correctly.
Generating '/tmp/nsys-report-5817.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report3.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
     90.4       6054257553        317  19098604.3  10068857.0      2730  100151224   27331801.8  poll                  
      8.8        591188074        283   2089003.8   2064748.0       190   20440756    1303766.2  sem_timedwait         
      0.5         31730396        499     63588.0     11350.0       370    8162366     387434.5  ioctl                 
      0.3         19059177         24    794132.4      6565.0       920    7143388    2138517.4  mmap                  
      0.0           974986      

### Exercise: Optimize Iteratively

In this exercise you will go through several cycles of editing the execution configuration of [01-vector-add.cu](01-vector-add/01-vector-add.cu), profiling it, and recording the results to see the impact. Use the following guidelines while working:

- Start by listing 3 to 5 different ways you will update the execution configuration, being sure to cover a range of different grid and block size combinations.
- Edit the [01-vector-add.cu](01-vector-add/01-vector-add.cu) program in one of the ways you listed.
- Compile and profile your updated code with the two code execution cells below.
- Record the runtime of the kernel execution, as given in the profiling output.
- Repeat the edit/profile/record cycle for each possible optimization you listed above

Which of the execution configurations you attempted proved to be the fastest?

In [8]:
!nvcc -o iteratively-optimized-vector-add 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [9]:
!nsys profile --stats=true ./iteratively-optimized-vector-add

Success! All values calculated correctly.
Generating '/tmp/nsys-report-5f1f.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report4.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
     90.4       6044050256        316  19126741.3  10072791.0      2180  100149655   27372945.8  poll                  
      8.8        587988047        282   2085064.0   2064385.0       180   20433163    1266714.3  sem_timedwait         
      0.5         31292437        499     62710.3     10351.0       380    8418636     397102.5  ioctl                 
      0.3         19620511         24    817521.3      4805.0       900    7262542    2201871.8  mmap                  
      0.0           988102      

---
## Streaming Multiprocessors and Querying the Device

This section explores how understanding a specific feature of the GPU hardware can promote optimization. After introducing **Streaming Multiprocessors**, you will attempt to further optimize the accelerated vector addition program you have been working on.

The following video presents upcoming material visually, at a high level. Click watch it before moving on to more detailed coverage of their topics in following sections.

<script>console.log('hi');</script>

In [10]:
from IPython.display import HTML

video_url = "https://d36m44n9vdbmda.cloudfront.net/assets/s-ac-04-v1/task2/NVPROF_UM_1.mp4"

video_html = f"""
<video controls width="640" height="360">
    <source src="{video_url}" type="video/mp4">
    Your browser does not support the video tag.
</video>
"""

display(HTML(video_html))

### Streaming Multiprocessors and Warps

The GPUs that CUDA applications run on have processing units called **streaming multiprocessors**, or **SMs**. During kernel execution, blocks of threads are given to SMs to execute. In order to support the GPU's ability to perform as many parallel operations as possible, performance gains can often be had by *choosing a grid size that has a number of blocks that is a multiple of the number of SMs on a given GPU.*

Additionally, SMs create, manage, schedule, and execute groupings of 32 threads from within a block called **warps**. A more [in depth coverage of SMs and warps](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation) is beyond the scope of this course, however, it is important to know that performance gains can also be had by *choosing a block size that has a number of threads that is a multiple of 32.*

### Programmatically Querying GPU Device Properties

In order to support portability, since the number of SMs on a GPU can differ depending on the specific GPU being used, the number of SMs should not be hard-coded into a code bases. Rather, this information should be acquired programatically.

The following shows how, in CUDA C/C++, to obtain a C struct which contains many properties about the currently active GPU device, including its number of SMs:

```cpp
int deviceId;
cudaGetDevice(&deviceId);                  // `deviceId` now points to the id of the currently active GPU.

cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceId); // `props` now has many useful properties about
                                           // the active GPU device.
```

### Exercise: Query the Device

Currently, [01-get-device-properties.cu](04-device-properties/01-get-device-properties.cu) contains many unassigned variables, and will print gibberish information intended to describe details about the currently active GPU.

Build out [01-get-device-properties.cu](04-device-properties/01-get-device-properties.cu) to print the actual values for the desired device properties indicated in the source code. In order to support your work, and as an introduction to them, use the [CUDA Runtime Docs](http://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html) to help identify the relevant properties in the device props struct. Refer to [the solution](04-device-properties/solutions/01-get-device-properties-solution.cu) if you get stuck.

In [11]:
!nvcc -o get-device-properties 04-device-properties/01-get-device-properties.cu -run











Device ID: 21940
Number of SMs: 685375104
Compute Capability Major: 32765
Compute Capability Minor: 0
Warp Size: 0


### Exercise: Optimize Vector Add with Grids Sized to Number of SMs

Utilize your ability to query the device for its number of SMs to refactor the `addVectorsInto` kernel you have been working on inside [01-vector-add.cu](01-vector-add/01-vector-add.cu) so that it launches with a grid containing a number of blocks that is a multiple of the number of SMs on the device.

Depending on other specific details in the code you have written, this refactor may or may not improve, or significantly change, the performance of your kernel. Therefore, as always, be sure to use `nsys profile` so that you can quantitatively evaluate performance changes. Record the results with the rest of your findings thus far, based on the profiling output.

In [12]:
!nvcc -o sm-optimized-vector-add 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [13]:
!nsys profile --stats=true ./sm-optimized-vector-add

Success! All values calculated correctly.
Generating '/tmp/nsys-report-1a9a.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report5.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
     90.3       6054286581        317  19098695.8  10072814.0      2180  100147955   27332147.4  poll                  
      8.9        593321275        283   2096541.6   2065453.0       140   20425739    1364023.0  sem_timedwait         
      0.5         32033830        499     64196.1     10610.0       380    8296662     411322.6  ioctl                 
      0.3         19411649         24    808818.7      5530.0       880    7200449    2179167.3  mmap                  
      0.0           891775      

---
## Unified Memory Details

You have been allocating memory intended for use either by host or device code with `cudaMallocManaged` and up until now have enjoyed the benefits of this method - automatic memory migration, ease of programming - without diving into the details of how the **Unified Memory** (**UM**) allocated by `cudaMallocManaged` actual works.

`nsys profile` provides details about UM management in accelerated applications, and using this information, in conjunction with a more-detailed understanding of how UM works, provides additional opportunities to optimize accelerated applications.

The following video presents upcoming material visually, at a high level. Click watch it before moving on to more detailed coverage of their topics in following sections.

<script>console.log('hi');</script>

In [14]:
from IPython.display import HTML

video_url = "https://d36m44n9vdbmda.cloudfront.net/assets/s-ac-04-v1/task2/NVPROF_UM_2.mp4"

video_html = f"""
<video controls width="640" height="360">
    <source src="{video_url}" type="video/mp4">
    Your browser does not support the video tag.
</video>
"""

display(HTML(video_html))

### Unified Memory Migration

When UM is allocated, the memory is not resident yet on either the host or the device. When either the host or device attempts to access the memory, a [page fault](https://en.wikipedia.org/wiki/Page_fault) will occur, at which point the host or device will migrate the needed data in batches. Similarly, at any point when the CPU, or any GPU in the accelerated system, attempts to access memory not yet resident on it, page faults will occur and trigger its migration.

The ability to page fault and migrate memory on demand is tremendously helpful for ease of development in your accelerated applications. Additionally, when working with data that exhibits sparse access patterns, for example when it is impossible to know which data will be required to be worked on until the application actually runs, and for scenarios when data might be accessed by multiple GPU devices in an accelerated system with multiple GPUs, on-demand memory migration is remarkably beneficial.

There are times - for example when data needs are known prior to runtime, and large contiguous blocks of memory are required - when the overhead of page faulting and migrating data on demand incurs an overhead cost that would be better avoided.

Much of the remainder of this lab will be dedicated to understanding on-demand migration, and how to identify it in the profiler's output. With this knowledge you will be able to reduce the overhead of it in scenarios when it would be beneficial.

### Exercise: Explore UM Migration and Page Faulting

`nsys profile` provides output describing UM behavior for the profiled application. In this exercise, you will make several modifications to a simple application, and make use of `nsys profile` after each change, to explore how UM data migration behaves.

[01-page-faults.cu](06-unified-memory-page-faults/01-page-faults.cu) contains a `hostFunction` and a `gpuKernel`, both which could be used to initialize the elements of a `2<<24` element vector with the number `1`. Currently neither the host function nor GPU kernel are being used.

For each of the 4 questions below, given what you have just learned about UM behavior, first hypothesize about what kind of page faulting should happen, then, edit [01-page-faults.cu](06-unified-memory-page-faults/01-page-faults.cu) to create a scenario, by using one or both of the 2 provided functions in the code bases, that will allow you to test your hypothesis.

In order to test your hypotheses, compile and profile your code using the code execution cells below. Be sure to record your hypotheses, as well as the results, obtained from `nsys profile --stats=true` output. In the output of `nsys profile --stats=true` you should be looking for the following:

- Is there a _CUDA Memory Operation Statistics_ section in the output?
- If so, does it indicate host to device (HtoD) or device to host (DtoH) migrations?
- When there are migrations, what does the output say about how many _Operations_ there were? If you see many small memory migration operations, this is a sign that on-demand page faulting is occurring, with small memory migrations occurring each time there is a page fault in the requested location.

Here are the scenarios for you to explore, along with solutions for them if you get stuck:

- Is there evidence of memory migration and/or page faulting when unified memory is accessed only by the CPU? ([solution](06-unified-memory-page-faults/solutions/01-page-faults-solution-cpu-only.cu))
- Is there evidence of memory migration and/or page faulting when unified memory is accessed only by the GPU? ([solution](06-unified-memory-page-faults/solutions/02-page-faults-solution-gpu-only.cu))
- Is there evidence of memory migration and/or page faulting when unified memory is accessed first by the CPU then the GPU? ([solution](06-unified-memory-page-faults/solutions/03-page-faults-solution-cpu-then-gpu.cu))
- Is there evidence of memory migration and/or page faulting when unified memory is accessed first by the GPU then the CPU? ([solution](06-unified-memory-page-faults/solutions/04-page-faults-solution-gpu-then-cpu.cu))

In [15]:
!nvcc -o page-faults 06-unified-memory-page-faults/01-page-faults.cu -run

In [16]:
!nsys profile --stats=true ./page-faults

Generating '/tmp/nsys-report-d1fe.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report6.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)           Name         
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------
     66.8        130301202         15  8686746.8  1933418.0      2690  51086925   13199000.7  poll                  
     16.2         31581071         13  2429313.2    38071.0       200  20453891    6100803.0  sem_timedwait         
     15.7         30598037        482    63481.4    11495.5       380   8232425     395190.1  ioctl                 
      0.5           937490         27    34721.9     4680.0      3080    584734     110732.6  mmap64                
      0.3           508809         44    11563.8    10465.5      3600     35732       5726.8

### Exercise: Revisit UM Behavior for Vector Add Program

Returning to the [01-vector-add.cu](01-vector-add/01-vector-add.cu) program you have been working on throughout this lab, review the code bases in its current state, and hypothesize about what kinds of memory migrations and/or page faults you expect to occur. Look at the profiling output for your last refactor (either by scrolling up to find the output or by executing the code execution cell just below), observing the _CUDA Memory Operation Statistics_ section of the profiler output. Can you explain the kinds of migrations and the number of their operations based on the contents of the code base?

In [17]:
!nsys profile --stats=true ./sm-optimized-vector-add

Success! All values calculated correctly.
Generating '/tmp/nsys-report-46cc.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report7.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
     90.3       6154437319        318  19353576.5  10071614.0      2810  100165225   27629497.0  poll                  
      8.7        595579388        283   2104520.8   2065094.0       280   20501012    1439280.5  sem_timedwait         
      0.6         42661073        499     85493.1     12340.0       380    9780772     582618.2  ioctl                 
      0.3         19420907         24    809204.5      5795.0       930    7326331    2179422.7  mmap                  
      0.0           970031      

### Exercise: Initialize Vector in Kernel

When `nsys profile` gives the amount of time that a kernel takes to execute, the host-to-device page faults and data migrations that occur during this kernel's execution are included in the displayed execution time.

With this in mind, refactor the `initWith` host function in your [01-vector-add.cu](01-vector-add/01-vector-add.cu) program to instead be a CUDA kernel, initializing the allocated vector in parallel on the GPU. After successfully compiling and running the refactored application, but before profiling it, hypothesize about the following:

- How do you expect the refactor to affect UM memory migration behavior?
- How do you expect the refactor to affect the reported run time of `addVectorsInto`?

Once again, record the results. Refer to [the solution](07-init-in-kernel/solutions/01-vector-add-init-in-kernel-solution.cu) if you get stuck.

In [18]:
!nvcc -o initialize-in-kernel 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [19]:
!nsys profile --stats=true ./initialize-in-kernel

Success! All values calculated correctly.
Generating '/tmp/nsys-report-2ee5.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report8.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
     90.4       6045123622        316  19130138.0  10072094.5      3210  100159759   27363668.5  poll                  
      8.8        590483315        282   2093912.5   2065062.0       230   20440611    1263117.9  sem_timedwait         
      0.5         33272161        499     66677.7     12900.0       420    9340274     435991.9  ioctl                 
      0.3         19237692         24    801570.5      5425.0      1140    7188772    2156071.0  mmap                  
      0.0          1124391      

---
## Asynchronous Memory Prefetching

A powerful technique to reduce the overhead of page faulting and on-demand memory migrations, both in host-to-device and device-to-host memory transfers, is called **asynchronous memory prefetching**. Using this technique allows programmers to asynchronously migrate unified memory (UM) to any CPU or GPU device in the system, in the background, prior to its use by application code. By doing this, GPU kernels and CPU function performance can be increased on account of reduced page fault and on-demand data migration overhead.

Prefetching also tends to migrate data in larger chunks, and therefore fewer trips, than on-demand migration. This makes it an excellent fit when data access needs are known before runtime, and when data access patterns are not sparse.

CUDA Makes asynchronously prefetching managed memory to either a GPU device or the CPU easy with its `cudaMemPrefetchAsync` function. Here is an example of using it to both prefetch data to the currently active GPU device, and then, to the CPU:

```cpp
int deviceId;
cudaGetDevice(&deviceId);                                         // The ID of the currently active GPU device.

cudaMemPrefetchAsync(pointerToSomeUMData, size, deviceId);        // Prefetch to GPU device.
cudaMemPrefetchAsync(pointerToSomeUMData, size, cudaCpuDeviceId); // Prefetch to host. `cudaCpuDeviceId` is a
                                                                  // built-in CUDA variable.
```

### Exercise: Prefetch Memory

At this point in the lab, your [01-vector-add.cu](01-vector-add/01-vector-add.cu) program should not only be launching a CUDA kernel to add 2 vectors into a third solution vector, all which are allocated with `cudaMallocManaged`, but should also be initializing each of the 3 vectors in parallel in a CUDA kernel. If for some reason, your application does not do any of the above, please refer to the following [reference application](07-init-in-kernel/solutions/01-vector-add-init-in-kernel-solution.cu), and update your own code bases to reflect its current functionality.

Conduct 3 experiments using `cudaMemPrefetchAsync` inside of your [01-vector-add.cu](01-vector-add/01-vector-add.cu) application to understand its impact on page-faulting and memory migration.

- What happens when you prefetch one of the initialized vectors to the device?
- What happens when you prefetch two of the initialized vectors to the device?
- What happens when you prefetch all three of the initialized vectors to the device?

Hypothesize about UM behavior, page faulting specifically, as well as the impact on the reported run time of the initialization kernel, before each experiment, and then verify by running `nsys profile`. Refer to [the solution](08-prefetch/solutions/01-vector-add-prefetch-solution.cu) if you get stuck.

In [20]:
!nvcc -o prefetch-to-gpu 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [21]:
!nsys profile --stats=true ./prefetch-to-gpu

Success! All values calculated correctly.
Generating '/tmp/nsys-report-b1aa.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report9.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
     90.4       6154884536        318  19354982.8  10070489.0      2051  100169847   27644901.3  poll                  
      8.7        595397313        283   2103877.4   2065409.0       180   20478483    1430849.9  sem_timedwait         
      0.5         37363590        499     74876.9     10621.0       380    8255876     518039.6  ioctl                 
      0.3         19932179         24    830507.5      5725.5      1040    7347977    2234347.4  mmap                  
      0.0           940973      

### Exercise: Prefetch Memory Back to the CPU

Add additional prefetching back to the CPU for the function that verifies the correctness of the `addVectorInto` kernel. Again, hypothesize about the impact on UM before profiling in `nsys` to confirm. Refer to [the solution](08-prefetch/solutions/02-vector-add-prefetch-solution-cpu-also.cu) if you get stuck.

In [22]:
!nvcc -o prefetch-to-cpu 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [23]:
!nsys profile --stats=true ./prefetch-to-cpu

Success! All values calculated correctly.
Generating '/tmp/nsys-report-2975.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report10.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ----------------------
     90.3       6153995411        318  19352186.8  10069918.0      2760  100152978   27636083.2  poll                  
      8.8        597306693        283   2110624.4   2064941.0       120   20566435    1502932.6  sem_timedwait         
      0.6         39398849        499     78955.6     13560.0       370    9426475     521935.9  ioctl                 
      0.3         19160014         24    798333.9      5375.0      1190    7213967    2152662.6  mmap                  
      0.0          1097769     

After this series of refactors to use asynchronous prefetching, you should see that there are fewer, but larger, memory transfers, and, that the kernel execution time is significantly decreased.

---
## Summary

At this point in the lab, you are able to:

- Use the Nsight Systems command line tool (**nsys**) to profile accelerated application performance.
- Leverage an understanding of **Streaming Multiprocessors** to optimize execution configurations.
- Understand the behavior of **Unified Memory** with regard to page faulting and data migrations.
- Use **asynchronous memory prefetching** to reduce page faults and data migrations for increased performance.
- Employ an iterative development cycle to rapidly accelerate and deploy applications.

In order to consolidate your learning, and reinforce your ability to iteratively accelerate, optimize, and deploy applications, please proceed to this lab's final exercise. After completing it, for those of you with time and interest, please proceed to the *Advanced Content* section.

---
## Final Exercise: Iteratively Optimize an Accelerated SAXPY Application

A basic accelerated SAXPY (Single Precision a\*x+b) application has been provided for you [here](09-saxpy/01-saxpy.cu). It currently works and you can compile, run, and then profile it with `nsys profile` below.

Record the runtime of the `saxpy` kernel without making any modifications and then work *iteratively* to optimize the application, using `nsys profile` after each iteration to notice the effects of the code changes on kernel performance and UM behavior.

Utilize the techniques from this lab. To support your learning, utilize [effortful retrieval](http://sites.gsu.edu/scholarlyteaching/effortful-retrieval/) whenever possible, rather than rushing to look up the specifics of techniques from earlier in the lesson.

Your end goal is to profile an accurate `saxpy` kernel, without modifying `N`, to run in under *200,000 ns*. Check out [the solution](09-saxpy/solutions/02-saxpy-solution.cu) if you get stuck, and feel free to compile and profile it if you wish.

In [24]:
!nvcc -o saxpy 09-saxpy/01-saxpy.cu -run

c[0] = 0, c[1] = 0, c[2] = 0, c[3] = 0, c[4] = 0, 
c[4194299] = 0, c[4194300] = 0, c[4194301] = 0, c[4194302] = 0, c[4194303] = 0, 


In [25]:
!nsys profile --stats=true ./saxpy

c[0] = 0, c[1] = 0, c[2] = 0, c[3] = 0, c[4] = 0, 
c[4194299] = 0, c[4194300] = 0, c[4194301] = 0, c[4194302] = 0, c[4194303] = 0, 
Generating '/tmp/nsys-report-6c59.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report11.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)    Med (ns)   Min (ns)  Max (ns)  StdDev (ns)           Name         
 --------  ---------------  ---------  ---------  ----------  --------  --------  -----------  ----------------------
     69.4        190753371         21  9083493.9  10066988.0      2560  47068833   10430287.6  poll                  
     16.2         44649764         19  2349987.6    229301.0       160  20459149    5008192.2  sem_timedwait         
     12.8         35066636        497    70556.6     10440.0       380   8209655     472789.2  ioctl                 
      0.8          2208789         23    96034.3      7050.0      1100 