<img src="Images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;">

## Exercise: Use NVTX

In this exercise, you will learn how to ease the analysis of your application by using NVTX to annotate your code.

In [None]:
import os

if os.getenv("COLAB_RELEASE_TAG"): # If running in Google Colab:
  !mkdir -p Sources
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/02.02-Asynchrony/Sources/ach.h -nv -O Sources/ach.h
!wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/02.02-Asynchrony/Sources/nvtx3.hpp -nv -O Sources/nvtx3.hpp
!sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub > /dev/null 2>&1
!sudo add-apt-repository -y "deb https://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture)/ /" > /dev/null 2>&1
!sudo apt install -y nsight-systems > /dev/null 2>&1

In [None]:
%%writefile Sources/nvtx.cpp
#include "ach.h"

void simulate(int width, int height, const thrust::device_vector<float> &in,
              thrust::device_vector<float> &out)
{
  cuda::std::mdspan temp_in(thrust::raw_pointer_cast(in.data()), height, width);
  cub::DeviceTransform::Transform(
      thrust::make_counting_iterator(0), out.begin(), width * height,
      [=] __host__ __device__(int id) { return ach::compute(id, temp_in); });
}

int main()
{
  int height = 2048;
  int width = 8192;

  thrust::device_vector<float> d_prev = ach::init(height, width);
  thrust::device_vector<float> d_next(height * width);
  thrust::host_vector<float> h_prev(height * width);

  const int compute_steps = 750;
  const int write_steps = 3;
  for (int write_step = 0; write_step < write_steps; write_step++)
  {
    nvtx3::scoped_range r{std::string("write step ") + std::to_string(write_step)};

    {
      // TODO: Annotate the "copy" step using nvtx range
      thrust::copy(d_prev.begin(), d_prev.end(), h_prev.begin());
    }

    {
      // TODO: Annotate the "compute" step using nvtx range
      for (int compute_step = 0; compute_step < compute_steps; compute_step++)
      {
        simulate(width, height, d_prev, d_next);
        d_prev.swap(d_next);
      }
    }

    {
      // TODO: Annotate the "write" step using nvtx range
      ach::store(write_step, height, width, h_prev);
    }

    {
      // TODO: Annotate the "wait" step using nvtx range
      cudaDeviceSynchronize();
    }
  }
}


In [None]:
!nvcc --extended-lambda -o /tmp/a.out Sources/nvtx.cpp -x cu -arch=native # build executable
!nsys profile --force-overwrite true -o nvtx /tmp/a.out # run and profile executable

The code above stores the output in a file called `nvtx` in the current directory.

If you just completed the Nsight exercise, your UI interface should still be open.  
If not, review the steps provided in the [Nsight exercise](02.02.03-Exercise-Nsight.ipynb).

Open the new `nvtx` report and navigate to see the timeline of your application.
Identify:
- when GPU compute is launched
- when CPU writes data on disk
- when CPU waits for GPU
- when data is transferred between CPU and GPU

If you’re unsure how to proceed, consider expanding this section for guidance. Use the hint only after giving the problem a genuine attempt.

<details>
  <summary>Hints</summary>
  
  - `nvtx3::scoped_range r{"name"}` creates a range called `name`
  - you can find NVTX ranges in the "NVTX" timeline row of Nsight Systems
</details>

Open this section only after you’ve made a serious attempt at solving the problem. Once you’ve completed your solution, compare it with the reference provided here to evaluate your approach and identify any potential improvements.

<details>
  <summary>Solution</summary>

  You can annotate scopes as follows:
  ```c++
  {
    nvtx3::scoped_range r{"copy"};
    thrust::copy(d_prev.begin(), d_prev.end(), h_prev.begin());
  }

  {
    nvtx3::scoped_range r{"compute"};
    for (int compute_step = 0; compute_step < compute_steps; compute_step++) {
      simulate(width, height, d_prev, d_next);
      d_prev.swap(d_next);
    }
  }

  {
    nvtx3::scoped_range r{"write"};
    ach::store(write_step, height, width, h_prev);
  }

  {
    nvtx3::scoped_range r{"wait"};
    cudaDeviceSynchronize();
  }
  ```

  You can find the full solution [here](Solutions/nvtx.cu).<br>
  The esulting timeline should look like this:

  ![Compute](Images/nvtx.png "NVTX")

</details>

---
Great job!  You've learned how to use NVTX to annotate your code.  Proceed to the [next section](../02.03-Streams/02.03.01-Streams.ipynb) on streams.

<img src="Images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;">