<h1 style="color:#65AE11;">Exercise: Apply Copy/Compute Overlap</h1>

In this section you will perform copy/compute overlap in the cipher application.

<h2 style="color:#65AE11;">Objectives</h2>

By the time you complete this section you will:

* Be able to perform copy/compute overlap using CUDA Streams in a CUDA C++ application
* Observe copy/compute overlap in the Nsight Systems timeline

<h2 style="color:#65AE11;">Exercise Instructions</h2>

Apply the techniques from the previous sections to perform copy/compute overlap in [streams.cu](streams_cipher/streams.cu).

Use the terminal to run `make streams` to compile the program, and then `./streams` to run it. You will see the timing outputs and check for correctness. See the [Makefile](streams_cipher/Makefile) for details.

After a successful refactor, adjust the number of streams (and therefore the size of memory chunks) and rerun to try to find the optimal number of streams.

**As a goal try to get the total amount of time (including memory transfers) on the GPU below 100ms, or even below 75ms.**

Use the terminal to run `make profile` to generate a report file that will be named `streams-report.qdrep`, and which you can open in Nsight Systems. See the [Makefile](streams_cipher/Makefile) for details.

The following screenshot, shows a profiler view of almost all host-to-device memory transfer (green) and device-to-host memory transfer (violet) overlapping with GPU compute (blue):

![streams solution](images/streams_solution.png)

<h2 style="color:#65AE11;">Exercise Hints</h2>

If you would like, expand the following hints to guide your work:

* All your work should be within the `main` function
* Define a number of streams
* Create the number of streams you defined and store them in an array
* As you work, edit the use of the timer instances, including their message strings, to reflect changes you make to the application
* Using the number of entries and the number of streams, define a chunk size for each stream's work. Remember to use the round-up division helper function `sdiv` for the reasons discussed in the previous section
* For each stream you have created:
  * Create indices for it to correctly access its chunk of data from within the global data
  * Asynchronously copy its chunk of data to the device
  * Perform the `decryptGPU` computations for its chunk of data
  * Asynchronously copy its chunk of data back to the host
  * Synchronize each stream before continuing on to check results on the CPU
* `make clean` will delete all binaries and report files
* You can edit the [*Makefile*](streams_cipher/Makefile) as you wish, for example, to change the name of generated binaries or report files. You can of course also enter the commands found in the *Makefile* directly into the terminal
* If you have time, play around with different numbers of streams, aiming to reduce the total time the application spends on the GPU

# 

# Ma version


In [17]:
%%bash
# Affiche le répertoire courant pour vérifier le point de départ
pwd

cd kernel_streams_cipher


module load cuda/12.6

make clean
make profile

./streams

/gpfs/home/scortinhal/CHPS0904/task/09_Exercise_Apply_Streams
rm -f streams streams_solution *.qdrep *.sqlite
nvcc -arch=sm_70 -O3 -Xcompiler="-march=native -fopenmp" streams.cu -o streams
nsys profile --stats=true --force-overwrite=true -o streams-report ./streams
concurrentKernels: 1, asyncEngineCount: 3
TIMING: 33.4562 ms (total time on GPU)
STATUS: test passed
TIMING: 15613 ms (Temps total)
Collecting data...
Generating '/tmp/nsys-report-118e.qdstrm'


SKIPPED: /gpfs/home/scortinhal/CHPS0904/task/09_Exercise_Apply_Streams/kernel_streams_cipher/streams-report.sqlite does not contain NV Tools Extension (NVTX) data.


[3/8] Executing 'nvtx_sum' stats report
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------
     57.0      15892758464        168   94599752.8  100115424.0      1280  158675584   23370593.1  poll                  
     41.9      11680803008         63  185409571.6    4501504.0     29344  780538272  314659151.0  futex                 
      1.1        305753888        893     342389.6      11520.0      1024   81645408    3058807.4  ioctl                 
      0.0          6764032         17     397884.2     353344.0    333056     834176     124566.7  pthread_create        
      0.0          5888544         38     154961.7      10144.0      2304    5464736     884652.7  mmap                  
      0.0           760064          3     253354.7      14080.0  

<h2 style="color:#65AE11;">Exercise Solution</h2>

After you complete your work, or if you get stuck, refer to [the solution](streams_cipher/streams_solution.cu). If you wish, you can compile the solution with `make streams_solution`, and/or generate a report file for viewing in Nsight Systems with `make profile_solution`.

<h2 style="color:#65AE11;">Next</h2>

Now that you have demonstrated the ability to perform copy/compute overlap, we will shift our attention for the next few sections to utilizing multiple GPUs on the same node before, at the end of the course, combining the use of multiple GPUs with copy/compute overlap.

Please continue to the next section: [*Multiple GPUs*](../10_Multiple_GPUs/Multiple_GPUs.ipynb).