<h1 style="color:#65AE11;">Exercise: Use Multiple GPUs</h1>

In this section you will refactor the baseline cipher application to utilize multiple GPUs.

*Please note, you will be working with the baseline cipher application that **does not use multiple non-default streams**. For the sake of learning you will be focusing on multiple GPU usage in this section, before combining multiple GPUs with multiple non-default streams in the next section.*

<h2 style="color:#65AE11;">Objectives</h2>

By the time you complete this section you will:

* Be able to utilize multiple GPUs in a CUDA C++ application
* Observe multiple GPU usage in the Nsight Systems timeline

<h2 style="color:#65AE11;">Exercise Instructions</h2>

Apply the techniques from the previous section to utilize multiple GPUs in [mgpu.cu](mgpu_cipher/mgpu.cu).

Use the terminal to run `make mgpu` to compile the program, and then `./mgpu` to run it. You will see the timing outputs and check for correctness. See the [Makefile](mgpu_cipher/Makefile) for details.

**As a goal try to get the amount of time spent decrypting on the GPUs (not including memory transfers) below 20ms.**

Use the terminal to run `make profile` to generate a report file that will be named `mgpu-report.qdrep`, and which you can open in Nsight Systems. See the [Makefile](mgpu_cipher/Makefile) for details.

The following screenshot shows the application utilizing multiple GPUs:

![multiple gpus](images/multiple_gpus.png)

<h2 style="color:#65AE11;">Exercise Hints</h2>

If you would like, expand the following hints to guide your work:

* All your work should be within the `main` function
* Store the number of GPUs available in a variable for later use
* Using the number of entries and the number of GPUs, define a chunk size for each stream's work. Remember to use the round-up division helper function `sdiv` for the reasons discussed in a previous section
* Create an array that contains pointers for the memory that will be allocated on each GPU
* Allocate a chunk's worth of data for each GPU
* Copy the correct chunk of data to each GPU
* For each GPU, decrypt its chunk of data
* Copy each GPU's chunk of data back to the host
* You may wish to edit the use of the timer instances, including their message strings, to reflect changes you make to the application
* `make clean` will delete all binaries and report files
* You can edit the [*Makefile*](mgpu_cipher/Makefile) as you wish, for example, to change the name of generated binaries or report files. You can of course also enter the commands found in the *Makefile* directly into the terminal

# Ma version

In [3]:
%%bash
# Affiche le répertoire courant pour vérifier le point de départ
pwd

cd kernel_streams_cipher

module load cuda/12.6

make profile

./mgpu

/gpfs/home/scortinhal/CHPS0904/task/11_Exercise_MGPU
nvcc -arch=sm_70 -O3 -Xcompiler="-march=native -fopenmp" mgpu.cu -o mgpu
nsys profile --stats=true --force-overwrite=true -o mgpu-report ./mgpu
Nombre de GPUs disponibles : 2
TIMING: 22.822 ms (Temps de décryptage kernel sur GPUs)
STATUS: test passed
TIMING: 15909.4 ms (Temps total)
Collecting data...
Generating '/tmp/nsys-report-dc92.qdstrm'


SKIPPED: /gpfs/home/scortinhal/CHPS0904/task/11_Exercise_MGPU/kernel_streams_cipher/mgpu-report.sqlite does not contain NV Tools Extension (NVTX) data.


[3/8] Executing 'nvtx_sum' stats report
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------
     50.0      16477826688        184   89553405.9  100114208.0      1120  157531232   29777802.7  poll                  
     48.3      15914324096         63  252608319.0    5261568.0     32320  774758496  302117346.2  futex                 
      1.2        407368800       1408     289324.4      13600.0      1024   82185696    2487670.6  ioctl                 
      0.5        151834400        129    1177010.9       5312.0      1024  151123296   13305175.3  open64                
      0.0          6954368         18     386353.8     347168.0    340256     775712     105646.2  pthread_create        
      0.0          6064736         67      90518.4      10784.0  

<h2 style="color:#65AE11;">Exercise Solution</h2>

After you complete your work, or if you get stuck, refer to [the solution](mgpu_cipher/mgpu_solution.cu). If you wish, you can compile the solution with `make mgpu_solution`, and/or generate a report file for viewing in Nsight Systems with `make profile_solution`.

<h2 style="color:#65AE11;">Check for Understanding</h2>

Please answer the following to confirm you've learned the main objectives of this section. You can display the answers for each question by clicking on the "..." cells below the questions.

---

**In the visual profiler, we can see that overlapping kernel execution. Why is this so?**

**Answer:**

We are using multiple GPUs to execute chunks of the work required by our application, all of which can perform work at the same time.

---

**In the visual profiler image of the solution code, above, we can see that there is no overlap of memory transfers. Why is this so?**

**Answer:**

The solution code is using neither non-default streams, nor, `cudaMemcpyAsync` for memory copies. They are, therefore, blocking operations.

---

<h2 style="color:#65AE11;">Next</h2>

You now know how to perform copy/compute overlap, and, how to perform work on multiple GPUs. In the next section you will learn about streams on multiple GPUs, and how to perform copy/compute overlap on multiple GPUs.

Please continue to the next section: [*MGPU Streams*](../12_MGPU_Streams/MGPU_Streams.ipynb).