<h1><div align="center">Managing Accelerated Application Memory with CUDA C/C++ Unified Memory and nvprof</div></h1>

![CUDA](./images/CUDA_Logo.jpg)

The [*CUDA Best Practices Guide*](http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#memory-optimizations), a highly recommended followup to this and other CUDA fundamentals labs, recommends a design cycle called **APOD**: **A**ssess, **P**arallelize, **O**ptimize, **D**eploy. In short, APOD prescribes an iterative design process, where developers can apply incremental improvements to their accelerated application's performance, and ship their code. As developers become more competent CUDA programmers, more advanced optimization techniques can be applied to their accelerated codebases.

This lab will support such a style of iterative development. You will be using the **NVIDIA Command Line Profiler** to qualitatively measure your application's performance, and to identify opportunities for optimization, after which you will apply incremental improvements before learning new techniques and repeating the cycle. As a point of focus, many of the techniques you will be learning and applying in this lab will deal with the specifics of how CUDA's **Unified Memory** works. Understanding Unified Memory behavior is a fundamental skill for CUDA developers, and serves as a prerequisite to many more advanced memory management techniques.

---
## Prerequisites

To get the most out of this lab you should already be able to:

- Write, compile, and run C/C++ programs that both call CPU functions and launch GPU kernels.
- Control parallel thread hierarchy using execution configuration.
- Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free Unified Memory.

---
## Objectives

By the time you complete this lab, you will be able to:

- Use the **NVIDIA Command Line Profiler** (**nprof**) to profile accelerated application performance.
- Leverage an understanding of **Streaming Multiprocessors** to optimize execution configurations.
- Understand the behavior of **Unified Memory** with regard to page faulting and data migrations.
- Use **asynchronous memory prefetching** to reduce page faults and data migrations for increased performance.
- Employ an iterative development cycle to rapidly accelerate and deploy applications.

---
## Iterative Optimizations with the NVIDIA Command Line Profiler

The only way to be assured that attempts at optimizing accelerated code bases are actually successful is to profile the application for quantitative information about the application's performance. `nvprof` is the NVIDIA command line profiler. It ships with the CUDA toolkit, and is a powerful tool for profiling accelerated applications.

`nvprof` is easy to use. Its most basic usage is to simply pass it the path to an executable compiled with `nvcc`. `nvprof` will proceed to execute the application, after which it will print a summary output of the application's GPU activities, CUDA API calls, as well as information about **Unified Memory** activity, a topic which will be covered extensively later in this lab.

When accelerating applications, or optimizing already-accelerated applications, take a scientific and iterative approach. Profile your application after making changes, take note, and record the implications of any refactoring on performance. Make these observations early and often: frequently, enough performance boost can be gained with little effort such that you can ship your accelerated application. Additionally, frequent profiling will teach you how specific changes to your CUDA codebases impact its actual performance: knowledge that is hard to acquire when only profiling after many kinds of changes in your codebase.


### Exercise: Profile an Application with nvprof

[01-vector-add.cu](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/01-vector-add/01-vector-add.cu) (<------ you can click on this and any of the source file links in this lab to open them for editing) is a naively accelerated vector addition program. Use the two code execution cells below (by `CTRL` + clicking them). The first code execution cell will compile (and run) the vector addition program. The second code execution cell will profile the executable that was just compiled using `nvprof`.

After profiling the application, answer the following questions using information displayed in the profiling output:

- What was the name of the only CUDA kernel called in this application?
- How many times did this kernel run?
- How long did it take this kernel to run? Record this time somewhere: you will be optimizing this application and will want to know how much faster you can make it.

In [1]:
!nvcc -arch=sm_70 -o single-thread-vector-add 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [2]:
!nvprof ./single-thread-vector-add

==190== NVPROF is profiling process 190, command: ./single-thread-vector-add
Success! All values calculated correctly.
==190== Profiling application: ./single-thread-vector-add
==190== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.37084s         1  2.37084s  2.37084s  2.37084s  addVectorsInto(float*, float*, float*, int)
      API calls:   71.42%  2.37087s         1  2.37087s  2.37087s  2.37087s  cudaDeviceSynchronize
                   27.84%  924.26ms         3  308.09ms  19.116us  924.20ms  cudaMallocManaged
                    0.71%  23.596ms         3  7.8652ms  7.2328ms  9.0417ms  cudaFree
                    0.01%  303.67us        94  3.2300us     609ns  115.52us  cuDeviceGetAttribute
                    0.01%  249.53us         1  249.53us  249.53us  249.53us  cuDeviceTotalMem
                    0.00%  124.81us         1  124.81us  124.81us  124.81us  cudaLaunch
                    0.00%  18.655u

### Exercise: Optimize and Profile

Take a minute or two to make a simple optimization to [01-vector-add.cu](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/01-vector-add/01-vector-add.cu) by updating its execution configuration so that it runs on many threads in a single thread block. Recompile and then profile with `nvprof` using the code execution cells below. Use the profiling output to check the runtime of the kernel. What was the speed up from this optimization? Be sure to record your results somewhere.

In [3]:
!nvcc -arch=sm_70 -o multi-thread-vector-add 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [4]:
!nvprof ./multi-thread-vector-add

==244== NVPROF is profiling process 244, command: ./multi-thread-vector-add
Success! All values calculated correctly.
==244== Profiling application: ./multi-thread-vector-add
==244== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  722.02ms         1  722.02ms  722.02ms  722.02ms  addVectorsInto(float*, float*, float*, int)
      API calls:   79.93%  722.05ms         1  722.05ms  722.05ms  722.05ms  cudaDeviceSynchronize
                   17.44%  157.58ms         3  52.528ms  19.074us  157.52ms  cudaMallocManaged
                    2.56%  23.087ms         3  7.6955ms  7.0528ms  8.9298ms  cudaFree
                    0.03%  256.11us        94  2.7240us     610ns  68.869us  cuDeviceGetAttribute
                    0.03%  248.72us         1  248.72us  248.72us  248.72us  cuDeviceTotalMem
                    0.01%  122.36us         1  122.36us  122.36us  122.36us  cudaLaunch
                    0.00%  18.094us 

### Exercise: Optimize Iteratively

In this exercise you will go through several cycles of editing the execution configuration of [01-vector-add.cu](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/01-vector-add/01-vector-add.cu), profiling it, and recording the results to see the impact. Use the following guidelines while working:

- Start by listing 3 to 5 different ways you will update the execution configuration, being sure to cover a range of different grid and block size combinations.
- Edit the [01-vector-add.cu](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/01-vector-add/01-vector-add.cu) program in one of the ways you listed.
- Compile and profile your updated code with the two code execution cells below.
- Record the runtime of the kernel execution, as given in the profiling output.
- Repeat the edit/profile/record cycle for each possible optimzation you listed above

Which of the execution configurations you attempted proved to be the fastest?

In [11]:
!nvcc -arch=sm_70 -o iteratively-optimized-vector-add 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [12]:
!nvprof ./iteratively-optimized-vector-add

==460== NVPROF is profiling process 460, command: ./iteratively-optimized-vector-add
Success! All values calculated correctly.
==460== Profiling application: ./iteratively-optimized-vector-add
==460== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  413.97ms         1  413.97ms  413.97ms  413.97ms  addVectorsInto(float*, float*, float*, int)
      API calls:   69.45%  413.96ms         1  413.96ms  413.96ms  413.96ms  cudaDeviceSynchronize
                   26.55%  158.24ms         3  52.748ms  18.999us  158.18ms  cudaMallocManaged
                    3.90%  23.224ms         3  7.7413ms  7.0717ms  9.0250ms  cudaFree
                    0.04%  256.17us        94  2.7250us     614ns  68.464us  cuDeviceGetAttribute
                    0.04%  251.40us         1  251.40us  251.40us  251.40us  cuDeviceTotalMem
                    0.02%  124.00us         1  124.00us  124.00us  124.00us  cudaLaunch
                  

---
## Streaming Multiprocessors and Querying the Device

This section explores how understanding a specific feature of the GPU hardware can promote optimization. After introducing **Streaming Multiprocessors**, you will attempt to further optimize the accelerated vector addition program you have been working on.

The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more detailed coverage of their topics in following sections.

In [13]:
%%HTML

<div align="center"><iframe src="https://docs.google.com/presentation/d/e/2PACX-1vQTzaK1iaFkxgYxaxR5QgHCVx1ZqhpX2F3q9UU6sGKCYaNIq6CGAo8W_qyzg2qwpeiZoHd7NCug7OTj/embed?start=false&loop=false&delayms=3000" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

### Streaming Multiprocessors and Warps

The GPUs that CUDA applications run on have processing units called **streaming multiprocessors**, or **SMs**. During kernel execution, blocks of threads are given to SMs to execute. In order to support the GPU's ability to perform as many parallel operations as possible, performance gains can often be had by *choosing a grid size that has a number of blocks that is a multiple of the number of SMs on a given GPU.*

Additionally, SMs create, manage, schedule, and execute groupings of 32 threads from within a block called **warps**. A more [in depth coverage of SMs and warps](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation) is beyond the scope of this course, however, it is important to know that performance gains can also be had by *choosing a block size that has a number of threads that is a multiple of 32.*

### Programmatically Querying GPU Device Properties

In order to support portability, since the number of SMs on a GPU can differ depending on the specific GPU being used, the number of SMs should not be hard-coded into a codebase. Rather, this information should be acquired programatically.

The following shows how, in CUDA C/C++, to obtain a C struct which contains many properties about the currently active GPU device, including its number of SMs:

```cpp
int deviceId;
cudaGetDevice(&deviceId);                  // `deviceId` now points to the id of the currently active GPU.

cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceId); // `props` now has many useful properties about
                                           // the active GPU device.
```

### Exercise: Query the Device

Currently, [`01-get-device-properties.cu`](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/04-device-properties/01-get-device-properties.cu) contains many unassigned variables, and will print gibberish information intended to describe details about the currently active GPU.

Build out [`01-get-device-properties.cu`](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/04-device-properties/01-get-device-properties.cu) to print the actual values for the desired device properties indicated in the source code. In order to support your work, and as an introduction to them, use the [CUDA Runtime Docs](http://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html) to help identify the relevant properties in the device props struct. Refer to [the solution](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/04-device-properties/solutions/01-get-device-properties-solution.cu) if you get stuck.

In [15]:
!nvcc -arch=sm_70 -o get-device-properties 04-device-properties/01-get-device-properties.cu -run

Device ID: 0
Number of SMs: 80
Compute Capability Major: 7
Compute Capability Minor: 0
Warp Size: 32


### Exercise: Optimize Vector Add with Grids Sized to Number of SMs

Utilize your ability to query the device for its number of SMs to refactor the `addVectorsInto` kernel you have been working on inside [01-vector-add.cu](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/01-vector-add/01-vector-add.cu) so that it launches with a grid containing a number of blocks that is a multiple of the number of SMs on the device.

Depending on other specific details in the code you have written, this refactor may or may not improve, or significantly change, the performance of your kernel. Therefore, as always, be sure to use `nvprof` so that you can quantitatively evaulate performance changes. Record the results with the rest of your findings thus far, based on the profiling output.

In [18]:
!nvcc -arch=sm_70 -o sm-optimized-vector-add 01-vector-add/01-vector-add.cu -run

Success! All values calculated correctly.


In [19]:
!nvprof ./sm-optimized-vector-add

==618== NVPROF is profiling process 618, command: ./sm-optimized-vector-add
Success! All values calculated correctly.
==618== Profiling application: ./sm-optimized-vector-add
==618== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  141.77ms         1  141.77ms  141.77ms  141.77ms  addVectorsInto(float*, float*, float*, int)
      API calls:   49.47%  163.36ms         3  54.455ms  19.702us  163.30ms  cudaMallocManaged
                   42.93%  141.76ms         1  141.76ms  141.76ms  141.76ms  cudaDeviceSynchronize
                    7.32%  24.191ms         3  8.0635ms  7.4344ms  9.2131ms  cudaFree
                    0.08%  257.65us        94  2.7400us     611ns  70.417us  cuDeviceGetAttribute
                    0.08%  252.35us         1  252.35us  252.35us  252.35us  cudaGetDeviceProperties
                    0.08%  250.09us         1  250.09us  250.09us  250.09us  cuDeviceTotalMem
                    0.0

---
## Unified Memory Details

You have been allocting memory intended for use either by host or device code with `cudaMallocManaged` and up until now have enjoyed the benefits of this method - automatic memory migration, ease of programming - without diving into the details of how the **Unified Memory** (**UM**) allocated by `cudaMallocManaged` actual works. `nvprof` provides details about UM management in accelerated applications, and using this information, in conjunction with a more-detailed understanding of how UM works, provides additional opportunities to optimize accelerated applications.

The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more detailed coverage of their topics in following sections.

In [21]:
%%HTML

<div align="center"><iframe src="https://docs.google.com/presentation/d/e/2PACX-1vS0-BCGiWUb82r1RH-4cSRmZjN2vjebqoodlHIN1fvtt1iDh8X8W9WOSlLVxcsY747WVIebw13cDYBO/embed?start=false&loop=false&delayms=3000" frameborder="0" width="900" height="550" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe></div>

### Unified Memory Migration

When UM is allocated, the memory is not resident yet on either the host or the device. When either the host or device attempts to access the memory, a [page fault](https://en.wikipedia.org/wiki/Page_fault) will occur, at which point the host or device will migrate the needed data in batches. Similarly, at any point when the CPU, or any GPU in the accelerated system, attempts to access memory not yet resident on it, page faults will occur and trigger its migration.

The ability to page fault and migrate memory on demand is tremendously helpful for ease of development in your accelerated applications. Additionally, when working with data that exhibits sparse access patterns, for example when it is impossible to know which data will be required to be worked on until the application actually runs, and for scenarios when data might be accessed by multiple GPU devices in an accelerated system with multiple GPUs, on-demand memory migration is remarkably beneficial.

There are times - for example when data needs are known prior to runtime, and large contiguous blocks of memory are required - when the overhead of page faulting and migrating data on demand incurs an overhead cost that would be better avoided.

Much of the remainder of this lab will be dedicated to understanding on-demand migration, and how to identify it in the profiler's output. With this knowledge you will be able to reduce the overhead of it in scenarios when it would be beneficial.

### Exercise: Explore UM Page Faulting

`nvprof` provides output describing UM behavior for the profiled application. In this exercise, you will make several modifications to a simple application, and make use of `nvprof`'s Unified Memory output section after each change, to explore how UM data migration behaves.

[`01-page-faults.cu`](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/06-unified-memory-page-faults/01-page-faults.cu) contains a `hostFunction` and a `gpuKernel`, both which could be used to initialize the elements of a `2<<24` element vector with the number `1`. Curently neither the host function nor GPU kernel are being used.

For each of the 4 questions below, given what you have just learned about UM behavior, first hypothesize about what kind of page faulting should happen, then, edit [`01-page-faults.cu`](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/06-unified-memory-page-faults/01-page-faults.cu) to create a scenario, by using one or both of the 2 provided functions in the codebase, that will allow you to test your hypothesis.

In order to test your hypotheses, compile and profile your code using the code execution cells below. Be sure to record your hypotheses, as well as the results, obtained from `nvprof` output, specifically CPU and GPU page faults, for each of the 4 experiments you are conducting. There are links to solutions for each of the 4 experiments which you can refer to if you get stuck.

- What happens when unified memory is accessed only by the CPU? ([solution](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/06-unified-memory-page-faults/solutions/01-page-faults-solution-cpu-only.cu))
- What happens when unified memory is accessed only by the GPU? ([solution](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/06-unified-memory-page-faults/solutions/02-page-faults-solution-gpu-only.cu))
- What happens when unified memory is accessed first by the CPU then the GPU? ([solution](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/06-unified-memory-page-faults/solutions/03-page-faults-solution-cpu-then-gpu.cu))
- What happens when unified memory is accessed first by the GPU then the CPU? ([solution](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/06-unified-memory-page-faults/solutions/04-page-faults-solution-gpu-then-cpu.cu))

In [28]:
!nvcc -arch=sm_70 -o page-faults 06-unified-memory-page-faults/01-page-faults.cu -run

In [29]:
!nvprof ./page-faults

==834== NVPROF is profiling process 834, command: ./page-faults
==834== Profiling application: ./page-faults
==834== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  51.636ms         1  51.636ms  51.636ms  51.636ms  deviceKernel(int*, int)
      API calls:   72.15%  158.19ms         1  158.19ms  158.19ms  158.19ms  cudaMallocManaged
                   23.56%  51.644ms         1  51.644ms  51.644ms  51.644ms  cudaDeviceSynchronize
                    4.00%  8.7639ms         1  8.7639ms  8.7639ms  8.7639ms  cudaFree
                    0.12%  260.79us        94  2.7740us     640ns  70.319us  cuDeviceGetAttribute
                    0.11%  251.23us         1  251.23us  251.23us  251.23us  cuDeviceTotalMem
                    0.05%  101.34us         1  101.34us  101.34us  101.34us  cudaLaunch
                    0.01%  19.950us         1  19.950us  19.950us  19.950us  cuDeviceGetName
                    0.00%  4.

### Exercise: Revisit UM Behavior for Vector Add Program

Returning to the [01-vector-add.cu](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/01-vector-add/01-vector-add.cu) program you have been working on throughout this lab, review the codebase in its current state, and hypothesize about what kinds of page faults you expect to occur. Look at the profiling output for your last refactor (either by scrolling up to find the output or by executing the code execution cell just below), observing the Unified Memory section of the profiler output. Can you explain the page faulting descriptions based on the contents of the code base?

In [30]:
!nvprof ./sm-optimized-vector-add

==846== NVPROF is profiling process 846, command: ./sm-optimized-vector-add
Success! All values calculated correctly.
==846== Profiling application: ./sm-optimized-vector-add
==846== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  143.63ms         1  143.63ms  143.63ms  143.63ms  addVectorsInto(float*, float*, float*, int)
      API calls:   48.41%  158.23ms         3  52.743ms  18.333us  158.17ms  cudaMallocManaged
                   43.95%  143.64ms         1  143.64ms  143.64ms  143.64ms  cudaDeviceSynchronize
                    7.35%  24.024ms         3  8.0081ms  7.3212ms  9.1470ms  cudaFree
                    0.08%  256.39us        94  2.7270us     613ns  68.706us  cuDeviceGetAttribute
                    0.08%  250.67us         1  250.67us  250.67us  250.67us  cuDeviceTotalMem
                    0.08%  250.05us         1  250.05us  250.05us  250.05us  cudaGetDeviceProperties
                    0.0

### Exercise: Initialize Vector in Kernel

When `nvprof` gives the amount of time that a kernel takes to execute, the host-to-device page faults and data migrations that occur during this kernel's execution are included in the displayed execution time.

With this in mind, refactor the `initWith` host function in your [01-vector-add.cu](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/01-vector-add/01-vector-add.cu) program to instead be a CUDA kernel, initializing the allocated vector in parallel on the GPU. After successfully compiling and running the refactored application, but before profiling it, hypothesize about the following:

- How do you expect the refactor to affect UM page-fault behavior?
- How do you expect the refactor to affect the reported run time of `addVectorsInto`?

Once again, record the results. Refer to [the solution](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/07-init-in-kernel/solutions/01-vector-add-init-in-kernel-solution.cu) if you get stuck.

In [35]:
!nvcc -arch=sm_70 -o initialize-in-kernel 01-vector-add/01-vector-add.cu -run

Device ID: 0	Number of SMs: 80
Success! All values calculated correctly.


In [36]:
!nvprof ./initialize-in-kernel

==1008== NVPROF is profiling process 1008, command: ./initialize-in-kernel
Device ID: 0	Number of SMs: 80
Success! All values calculated correctly.
==1008== Profiling application: ./initialize-in-kernel
==1008== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.00%  49.838ms         3  16.613ms  16.174ms  16.924ms  initWith(float, float*, int)
                    1.00%  504.18us         1  504.18us  504.18us  504.18us  addArraysInto(float*, float*, float*, int)
      API calls:   69.01%  157.96ms         3  52.654ms  19.111us  157.91ms  cudaMallocManaged
                   21.95%  50.254ms         2  25.127ms  505.80us  49.748ms  cudaDeviceSynchronize
                    8.60%  19.687ms         3  6.5622ms  5.5125ms  8.6397ms  cudaFree
                    0.16%  375.41us         4  93.853us  8.6520us  251.72us  cudaLaunch
                    0.12%  263.79us        94  2.8060us     610ns  70.476us  cuDeviceGetAttrib

---
## Asynchronous Memory Prefetching

A powerful technique to reduce the overhead of page faulting and on-demand memory migrations, both in host-to-device and device-to-host memory transfers, is called **asynchronous memory prefetching**. Using this technique allows programmers to asynchronously migrate unified memory (UM) to any CPU or GPU device in the system, in the background, prior to its use by application code. By doing this, GPU kernels and CPU function performance can be increased on account of reduced page fault and on-demand data migration overhead.

Prefetching also tends to migrate data in larger chunks, and therefore fewer trips, than on-demand migration. This makes it an excellent fit when data access needs are known before runtime, and when data access patterns are not sparse.

CUDA Makes asynchronously prefetching managed memory to either a GPU device or the CPU easy with its `cudaMemPrefetchAsync` function. Here is an example of using it to both prefetch data to the currently active GPU device, and then, to the CPU:

```cpp
int deviceId;
cudaGetDevice(&deviceId);                                         // The ID of the currently active GPU device.

cudaMemPrefetchAsync(pointerToSomeUMData, size, deviceId);        // Prefetch to GPU device.
cudaMemPrefetchAsync(pointerToSomeUMData, size, cudaCpuDeviceId); // Prefetch to host. `cudaCpuDeviceId` is a
                                                                  // built-in CUDA variable.
```

### Exercise: Prefetch Memory

At this point in the lab, your [01-vector-add.cu](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/01-vector-add/01-vector-add.cu) program should not only be launching a CUDA kernel to add 2 vectors into a third solution vector, all which are allocated with `cudaMallocManaged`, but should also initializing each of the 3 vectors in parallel in a CUDA kernel. If for some reason, your application does not do any of the above, please refer to the following [reference application](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/08-prefetch/01-vector-add-prefetch.cu), and update your own codebase to reflect its current functionality.

Conduct 3 experiments using `cudaMemPrefetchAsync` inside of your [01-vector-add.cu](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/01-vector-add/01-vector-add.cu) application to understand its impact on page-faulting and memory migration.

- What happens when you prefetch one of the initialized vectors to the device?
- What happens when you prefetch two of the initialized vectors to the device?
- What happens when you prefetch all three of the initialized vectors to the device?

Hypothesize about UM behavior, page faulting specificially, as well as the impact on the reported run time of the initialization kernel, before each experiement, and then verify by running `nvprof`. Refer to [the solution](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/08-prefetch/solutions/01-vector-add-prefetch-solution.cu) if you get stuck.

In [41]:
!nvcc -arch=sm_70 -o prefetch-to-gpu 01-vector-add/01-vector-add.cu -run

Device ID: 0	Number of SMs: 80
Success! All values calculated correctly.


In [42]:
!nvprof ./prefetch-to-gpu

==1173== NVPROF is profiling process 1173, command: ./prefetch-to-gpu
Device ID: 0	Number of SMs: 80
Success! All values calculated correctly.
==1173== Profiling application: ./prefetch-to-gpu
==1173== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   51.66%  503.94us         1  503.94us  503.94us  503.94us  addArraysInto(float*, float*, float*, int)
                   48.34%  471.50us         3  157.17us  154.69us  160.96us  initWith(float, float*, int)
      API calls:   82.02%  157.82ms         3  52.607ms  21.527us  157.76ms  cudaMallocManaged
                    9.86%  18.964ms         3  6.3213ms  5.1780ms  8.5974ms  cudaFree
                    4.93%  9.4830ms         2  4.7415ms  505.43us  8.9775ms  cudaDeviceSynchronize
                    2.79%  5.3624ms         3  1.7875ms  13.676us  4.7890ms  cudaMemPrefetchAsync
                    0.13%  256.14us        94  2.7240us     615ns  69.254us  cuDeviceGetAttrib

### Exercise: Prefetch Memory Back to the CPU

Add additional prefetching back to the CPU for the function that verifies the correctness of the `addVectorInto` kernel. Again, hypothesize about the impact on UM before profiling in `nvprof` to confirm. Refer to [the solution](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/08-prefetch/solutions/02-vector-add-prefetch-solution-cpu-also.cu) if you get stuck.

In [43]:
!nvcc -arch=sm_70 -o prefetch-to-cpu 01-vector-add/01-vector-add.cu -run

Device ID: 0	Number of SMs: 80
Success! All values calculated correctly.


In [44]:
!nvprof ./prefetch-to-cpu

==1229== NVPROF is profiling process 1229, command: ./prefetch-to-cpu
Device ID: 0	Number of SMs: 80
Success! All values calculated correctly.
==1229== Profiling application: ./prefetch-to-cpu
==1229== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   51.51%  502.69us         1  502.69us  502.69us  502.69us  addArraysInto(float*, float*, float*, int)
                   48.49%  473.28us         3  157.76us  155.58us  162.05us  initWith(float, float*, int)
      API calls:   75.11%  158.31ms         3  52.771ms  21.526us  158.25ms  cudaMallocManaged
                   10.45%  22.036ms         4  5.5090ms  13.354us  18.481ms  cudaMemPrefetchAsync
                    8.85%  18.657ms         3  6.2189ms  5.2277ms  8.1403ms  cudaFree
                    5.21%  10.982ms         2  5.4908ms  504.94us  10.477ms  cudaDeviceSynchronize
                    0.14%  285.71us         1  285.71us  285.71us  285.71us  cuDeviceTotalMem


---
## Summary

At this point in the lab, you are able to:

- Use the **NVIDIA Command Line Profiler** (**nvprof**) to profile accelerated application performance.
- Leverage an understanding of **Streaming Multiprocessors** to optimize execution configurations.
- Understand the behavior of **Unified Memory** with regard to page faulting and data migrations.
- Use **asynchronous memory prefetching** to reduce page faults and data migrations for increased performance.
- Employ an iterative development cycle to rapidly accelerate and deploy applications.

In order to consolidate your learning, and reinforce your ability to iteratively accelerate, optimize, and deploy applications, please proceed to this lab's final exercise. After completing it, for those of you with time and interest, please proceed to the *Advanced Content* section.

---
## Final Exercise: Iteratively Optimize an Accelerated SAXPY Application

A basic accelerated [SAXPY](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_1) application has been provided for you [here](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/09-saxpy/01-saxpy.cu). It currently contains a couple of bugs that you will need to find and fix before you can successfully compile, run, and then profile it with `nvprof`.

After fixing the bugs and profiling the application, record the runtime of the `saxpy` kernel and then work *iteratively* to optimize the application, using `nvprof` after each iteration to notice the effects of the code changes on kernel performance and UM behavior.

Utilize the techniques from this lab. To support your learning, utilize [effortful retrieval](http://sites.gsu.edu/scholarlyteaching/effortful-retrieval/) whenever possible, rather than rushing to look up the specifics of techniques from earlier in the lesson.

Your end goal is to profile an accurate `saxpy` kernel, without modifying `N`, to run in under *50us*. Check out [the solution](../../../../../edit/tasks/task1/task/02_AC_UM_NVPROF/09-saxpy/solutions/02-saxpy-solution.cu) if you get stuck, and feel free to compile and profile it if you wish.

In [47]:
!nvcc -arch=sm_70 -o saxpy 09-saxpy/01-saxpy.cu -run

c[0] = 5, c[1] = 5, c[2] = 5, c[3] = 5, c[4] = 5, 
c[4194299] = 5, c[4194300] = 5, c[4194301] = 5, c[4194302] = 5, c[4194303] = 5, 


In [48]:
!nvprof ./saxpy

==1297== NVPROF is profiling process 1297, command: ./saxpy
c[0] = 5, c[1] = 5, c[2] = 5, c[3] = 5, c[4] = 5, 
c[4194299] = 5, c[4194300] = 5, c[4194301] = 5, c[4194302] = 5, c[4194303] = 5, 
==1297== Profiling application: ./saxpy
==1297== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  71.914us         1  71.914us  71.914us  71.914us  saxpy(int*, int*, int*)
      API calls:   94.31%  162.42ms         3  54.141ms  26.128us  162.36ms  cudaMallocManaged
                    2.40%  4.1303ms         1  4.1303ms  4.1303ms  4.1303ms  cudaDeviceSynchronize
                    1.58%  2.7279ms         3  909.29us  894.37us  928.88us  cudaFree
                    1.31%  2.2584ms         3  752.79us  16.648us  2.0987ms  cudaMemPrefetchAsync
                    0.15%  266.63us        94  2.8360us     614ns  70.172us  cuDeviceGetAttribute
                    0.14%  249.00us         1  249.00us  249.00us  249.00us  cuDev