<h1 style="color:#65AE11;">Exercise: Copy Compute Overlap with Multiple GPUs</h1>

In this section you will refactor the baseline cipher application to perform copy/compute overlap while utilizing multiple GPUs.

<h2 style="color:#65AE11;">Objectives</h2>

By the time you complete this section you will:

* Be able to perform copy/compute overlap on multiple GPUs
* Observe copy/compute overlap on multiple GPUs in the Nsight Systems timeline

<h2 style="color:#65AE11;">Exercise Instructions</h2>

Apply the techniques from the previous section to perform copy/compute overlap on multiple GPUs in [mgpu_stream.cu](mgpu_stream_cipher/mgpu_stream.cu).

Use the terminal to run `make mgpu_stream` to compile the program, and then `./mgpu_stream` to run it. You will see the timing outputs and check for correctness. See the [Makefile](mgpu_stream_cipher/Makefile) for details.

**As a goal try to get the total amount of time on the GPUs (including memory transfers) below 30ms.**

Use the terminal to run `make profile` to generate a report file that will be named `mgpu-stream-report.qdrep`, and which you can open in Nsight Systems. See the [Makefile](mgpu_stream_cipher/Makefile) for details.

The following screenshot shows the application performing copy/compute overlap with multiple GPUs:

![multiple gpu copy/compute](images/mgpu_copy_compute.png)

<h2 style="color:#65AE11;">Exercise Hints</h2>

If you would like, expand the following hints to guide your work:

* All your work should be within the `main` function
* As you work, edit the use of the timer instances, including their message strings, to reflect changes you make to the application
* Create variables to define each GPU's chunk of data, and, each stream on each GPU's chunk of data
* Create and store all streams in a 2D array, with each row containing one GPU's streams
* Store pointers for each GPU's memory in an array
* Using robust indexing techniques, allocate a GPU's chunk of data for each GPU
* For each stream, on each GPU, perform async HtoD transfer, kernel launch, and async DtoH transfer, synchronizing streams as needed
* `make clean` will delete all binaries and report files
* You can edit the [*Makefile*](mgpu_cipher/Makefile) as you wish, for example, to change the name of generated binaries or report files. You can of course also enter the commands found in the *Makefile* directly into the terminal

# Ma version

In [3]:
%%bash
# Affiche le répertoire courant pour vérifier le point de départ
pwd

cd kernel_streams_cipher

module load cuda/12.6

make profile

./mgpu_streams

/gpfs/home/scortinhal/CHPS0904/task/13_Exercise_MGPU_Streams
nvcc -arch=sm_70 -O3 -Xcompiler="-march=native -fopenmp" mgpu_streams.cu -o mgpu_streams
nsys profile --stats=true --force-overwrite=true -o mgpu_streams-report ./mgpu_streams
GPUs disponibles: 2
TIMING: 19.5001 ms (Temps H2D+compute+D2H)
STATUS: PASSED
TIMING: 15769.3 ms (Temps total)
Collecting data...
Generating '/tmp/nsys-report-823b.qdstrm'


SKIPPED: /gpfs/home/scortinhal/CHPS0904/task/13_Exercise_MGPU_Streams/kernel_streams_cipher/mgpu_streams-report.sqlite does not contain NV Tools Extension (NVTX) data.


[3/8] Executing 'nvtx_sum' stats report
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------
     52.6      16181200032        182   88907692.5  100111904.0      1568  161862336   30942778.5  poll                  
     45.7      14038515040         63  222833572.1    4431680.0     34304  681210048  265426929.1  futex                 
      1.3        414149312       1466     282502.9      12720.0      1024   81389536    2446659.9  ioctl                 
      0.3         93550976        129     725201.4       5248.0      1024   92849472    8174457.9  open64                
      0.0          6701440         18     372302.2     336496.0    321728     692544      92742.0  pthread_create        
      0.0          5026688         67      75025.2       9952.0  

<h2 style="color:#65AE11;">Exercise Solution</h2>

After you complete your work, or if you get stuck, refer to [the solution](mgpu_stream_cipher/mgpu_stream_solution.cu). If you wish, you can compile the solution with `make mgpu_stream_solution`, and/or generate a report file for viewing in Nsight Systems with `make profile_solution`.

<h2 style="color:#65AE11;">Next</h2>

Congratulations on the successful refactor and acceleration of the cipher application. Next, you will do a quick overview of everything you learned in this workshop, and will be asked to take the course survey before attempting the workshop assessment.

Please continue to the next section: [*Workshop Overview*](../14_Overview/Overview.ipynb).