## Exercise: Profile Your Code with Nsight Systems

In this exercise, you will learn how to profile your code with Nsight Systems. 
Nsight Systems is a system-wide performance analysis tool, designed to visualize CPU and GPU activities. 

To run Nsight Systems, you can use the command-line interface provided by `nsys`:

In [None]:
import os

if os.getenv("COLAB_RELEASE_TAG"): # If running in Google Colab:
  !mkdir -p Solutions
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/02.02-Asynchrony/Solutions/ach.h -nv -O Solutions/ach.h
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/02.02-Asynchrony/Solutions/nvtx3.hpp -nv -O Solutions/nvtx3.hpp
  !wget https://raw.githubusercontent.com/NVIDIA/accelerated-computing-hub/refs/heads/main/gpu-cpp-tutorial/notebooks/02.02-Asynchrony/Sources/compute-io-overlap.cpp -nv -O Solutions/compute-io-overlap.cpp
  !sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub > /dev/null 2>&1
  !sudo add-apt-repository -y "deb https://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture)/ /" > /dev/null 2>&1
  !sudo apt install -y nsight-systems > /dev/null 2>&1

In [None]:
!nvcc --extended-lambda -o /tmp/a.out Solutions/compute-io-overlap.cpp -x cu -arch=native # build executable
!nsys profile --force-overwrite true -o compute-io-overlap /tmp/a.out # run and profile executable

The code above stores the output in a file called `compute-io-overlap` in the current directory.
To download the report file to your local machine:

1. Click on the "Files" tab on the left
2. Click on the three dots on the right of the file `compute-io-overlap.nsys-rep`
3. Click on "Download"

<img src="Images/download.png" alt="Download Report" width=600>

Then, install Nsight Systems on your local machine following the instructions [here](https://developer.nvidia.com/nsight-systems/get-started).
Launch Nsight Systems and open the report file that you downloaded.
Your task is to navigate the report and identify:
- when GPU compute is launched
- when CPU writes data on disk
- when CPU waits for GPU
- when data is transferred between CPU and GPU

If you’re unsure how to proceed, consider expanding this section for guidance. Use the hint only after giving the problem a genuine attempt.

<details>
  <summary>Hints</summary>
  
  - Try unfolding "CUDA HW" section to see more detail on what is happening on the GPU
  - Memory transfers between CPU and GPU will be under "CUDA HW / Memory" section
  - IO-related activities on CPU should express themselves as `writev` and `fclose` system calls
</details>

Open this section only after you’ve made a serious attempt at solving the problem. Once you’ve completed your solution, compare it with the reference provided here to evaluate your approach and identify any potential improvements.

<details>
  <summary>Solution</summary>

  Launch of computation happens on the CPU side:
  ![Compute](Images/compute.png "Compute")

  Data transfer between CPU and GPU can be located in the "CUDA HW / Memory" section:

  ![Copy](Images/copy.png "Copy")

  CPU writes data on disk can be found in the "OS runtime libraries" section:
  ![Write](Images/write.png "Write")

</details>

---
Congratulations! Proceed to the [next exercise](02.02.04-Exercise-NVTX.ipynb).