# Solveur de Jacobi : Modèles de programmation Multi-GPU
Ce notebook présente 12 versions progressives d’un solveur de Jacobi 2D. Chaque section explique le modèle ou l’optimisation, compile la version, exécute et collecte les métriques Nsight.

```markdown
## Modules Spack à charger

Avant de compiler ou d’exécuter les différentes étapes, il est recommandé de charger les modules nécessaires via Spack. 
Préférablement avant de lancer VSCode ou jupyter
Par exemple :

```bash
spack load nvhpc@24.11
spack load cuda@12.6
export NVSHMEM_HOME=/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nvshmem
export NCCL_HOME=/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl
export LD_LIBRARY_PATH=/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nvshmem/lib:$LD_LIBRARY_PATH
unset OPAL_PREFIX
unset PMIX_INSTALL_PREFIX
```

Adaptez la version de chaque module selon la configuration de votre cluster.
```

## etape1_cpu
**Description :** Solveur Jacobi CPU de base : implémentation mono-thread. Utile pour valider la correction et les petites tailles de problème ; met en évidence la limite de calcul CPU.

**Intérêt :** Baseline : évalue la limite CPU pour établir une référence.

In [1]:
%%bash
cd etape1_cpu
make clean all

rm -f main *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html
gcc -O2 -o main main.c


In [2]:
%%bash
cd etape1_cpu
./main

Terminé etape1_cpu
CPU time: 15.613598 seconds


## etape2_cpu_gpu
**Description :** 1 CPU + 1 GPU + 1 stream : le CPU pilote le GPU via un unique stream CUDA. Le calcul Jacobi est entièrement délégué au GPU, le CPU ne fait que l’orchestration.

**Intérêt :** Met en évidence l'écart de performance CPU vs GPU lorsque la grille est suffisamment grande, dans un contexte réaliste d’utilisation d’un seul GPU et d’un seul stream.

### Compilation et exécution (1 CPU + 1 GPU + 1 stream)

In [3]:
%%bash
cd etape2_cpu_gpu
make clean all

rm -f app *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html
nvcc -O2 -o app main.cu kernel.cu


In [4]:
%%bash
cd etape2_cpu_gpu
nsys profile -t cuda --stats=true --force-overwrite true -o main ./app

Terminé etape2_cpu_gpu (1CPU + 1GPU + 1stream)
GPU time: 1.560823 seconds
Collecting data...
Generating '/tmp/nsys-report-73ea.qdstrm'
[3/6] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------
     71.2        706806976          3  235602325.3  350057600.0    2193280  354556096  202150676.4  cudaMemcpyAsync       
     16.2        160747968          1  160747968.0  160747968.0  160747968  160747968          0.0  cudaStreamCreate      
      9.9         97992864       1001      97895.0      97984.0       1696     104160       3114.6  cudaStreamSynchronize 
      2.6         25991168       1000      25991.2       2784.0       2080   23075072     729605.7  cudaLaunchKernel      
      0.1           665184          2     332592.0     332592.0     315008     350

## etape3_mpi_gpus
**Description :** MPI + GPUs : domaine réparti sur plusieurs rangs MPI, chacun pilotant un GPU. Illustrations des défis de mise à l'échelle multi-nœuds et des communications inter-rangs.

**Intérêt :** Test de montée en charge inter-nœuds et coût MPI sur cluster multi-GPU.

**Step :** 
- Initialiser MPI, déterminer le rang et le nombre de processus.
- Associer chaque rang à un GPU différent.
- Diviser la grille entre les rangs (découpage 1D vertical).
- Gérer les échanges d’halos entre rangs voisins avec MPI_Sendrecv.
- Synchroniser les échanges à chaque itération.
- Nettoyer MPI à la fin.

In [5]:
%%bash
cd etape3_mpi_gpus
make clean all

rm -f app *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html main.qdstrm
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -O3 -std=c++14 -lcudart -Xcompiler "-fopenmp" -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/include main.cpp kernel.cu -o app -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/lib64 -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/lib -lmpi -lnccl -lstdc++ \
	-Xlinker --no-as-needed


In [6]:
%%bash
cd etape3_mpi_gpus
# Nsys ne marche pas tout le temps avec mpirun
mpirun -np 4 nsys profile -t mpi,cuda --stats=true --force-overwrite true -o main ./app 1000 4096 4096 1
#mpirun -np 4 ./app 1000 4096 4096 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
Done: 1000 iters in 4.4008s, norm=1.68569
Generating '/tmp/nsys-report-6e4e.qdstrm'
Generating '/tmp/nsys-report-6595.qdstrm'
Generating '/tmp/nsys-report-a55a.qdstrm'
Generating '/tmp/nsys-report-1d6c.qdstrm'
[3/7] Executing 'nvtx_sum' stats report


SKIPPED: No data available.


[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Name         
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------
     93.4        150371712       5002   30062.3   10240.0      3072    151936      40507.1  cudaMemcpy            
      6.1          9886048       1000    9886.0    9344.0      6688    430592      13401.8  cudaLaunchKernel      
      0.2           314240          2  157120.0  157120.0     79712    234528     109471.4  cudaFree              
      0.2           314112          2  157056.0  157056.0    119200    194912      53536.5  cudaMalloc            
      0.1            95072        413     230.2      96.0        32      4608        366.2  cuGetProcAddress_v2   
      0.0            46112          2   23056.0   23056.0      4736     41376      25908.4  cudaMemset            
      0.0             4128         

Importer error status: Importation failed.
File is corrupted.


Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape3_mpi_gpus/main.qdstrm


Importer error status: An unknown error occurred.
Dynamic exception type: boost::filesystem::filesystem_error
std::exception::what: boost::filesystem::file_size: No such file or directory [system:2]: "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape3_mpi_gpus/main.nsys-rep"



Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape3_mpi_gpus/main.qdstrm


Importer error status: An unknown error occurred.
Dynamic exception type: boost::filesystem::filesystem_error
std::exception::what: boost::filesystem::file_size: No such file or directory [system:2]: "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape3_mpi_gpus/main.nsys-rep"



Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape3_mpi_gpus/main.qdstrm


### À partir de quelle taille de matrice le recouvrement communication/calcul devient-il rentable ?

Le recouvrement (overlap) communication/calcul devient généralement rentable lorsque :
- Le temps de communication (MPI + transferts host/device) devient significatif devant le temps de calcul local.
- La partie du calcul qui peut être effectuée pendant la communication (hors bords) est suffisamment grande pour masquer la latence réseau.

Pour une grille Jacobi 2D, la taille critique dépend :
- De la bande passante et latence réseau,
- Du nombre de rangs MPI,
- De la rapidité des transferts CUDA Host/Device,
- De la puissance du GPU.

**Sur la plupart des clusters modernes, le recouvrement commence à être rentable pour des matrices de l’ordre de 16k×16k à 32k×32k (voire plus),** surtout si le nombre de rangs est élevé (≥4) et que la communication devient un vrai goulot d’étranglement.

**Pour une matrice 8k×8k,** le calcul local reste souvent dominant, donc le surcoût du overlap (copies, synchronisations) peut masquer le gain.  
**Essayez avec 16k×16k ou 32k×32k** pour voir un bénéfice, surtout si vous augmentez le nombre de rangs MPI (et donc la proportion de communication).

**Résumé :**  
- < 8k×8k : overlap rarement utile  
- 16k×16k : commence à être intéressant  
- 32k×32k et + : overlap souvent rentable, surtout sur cluster multi-nœuds

## etape4_mpi_overlap
**Description :** MPI + recouvrement : recouvrements des échanges d’halo non-bloquants avec le calcul Jacobi local. Réduit l'impact de la latence réseau.

**Intérêt :** Cache la latence réseau en recouvrant communication et calcul local.

In [7]:
%%bash
cd etape4_mpi_overlap
make clean all

rm -f app *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html main.qdstrm
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -O3 -std=c++14 -lcudart -Xcompiler "-fopenmp" -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/include main.cpp kernel.cu -o app -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/lib64 -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/lib -lmpi -lnccl -lstdc++ \
	-Xlinker --no-as-needed


In [8]:
%%bash 
cd etape4_mpi_overlap
# Nsys ne marche pas tout le temps avec mpirun
mpirun -np 4 nsys profile -t mpi,cuda --stats=true --force-overwrite true -o main ./app 1000 4096 4096 1
#mpirun -np 4 ./app 1000 4096 4096 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
nx=4096 ny=4096 size=4
Rang 1: ny_local=1024 offset=1025
Rang 2: ny_local=1023 offset=2049
Rang 3: ny_local=1023 offset=3072
Rang 0: ny_local=1024 offset=1
Overlap: 1000 iters en 3.91523 s, norm=1.96709
Generating '/tmp/nsys-report-41e7.qdstrm'
Generating '/tmp/nsys-report-6f61.qdstrm'
Generating '/tmp/nsys-report-ff06.qdstrm'
Generating '/tmp/nsys-report-05b4.qdstrm'
[3/7] Executing 'nvtx_sum' stats report


SKIPPED: No data available.


 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)              Name             
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -----------------------------
     83.5        206500512       1002   206088.3   205312.0    202272    398016       6861.7  cudaMemcpy                   
      5.6         13787808       1000    13787.8    13184.0      8864    433024      13564.0  cudaLaunchKernel             
      4.0          9918560          2  4959280.0  4959280.0     68704   9849856    6916318.9  cuMemGetHandleForAddressRange
      3.3          8204480       1000     8204.5     8000.0      6656     55360       2456.5  cuMemcpyAsync                
      1.6          3936416       2753     1429.9     1376.0       224     28448       1105.8  cuEventQuery                 
      1.1          2686976          1  2686976.0  2686976.0   2686976   2686976          0.0  cuMemHostRegister_v2         
      0.

Importer error status: Importation failed.
File is corrupted.


Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape4_mpi_overlap/main.qdstrm

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
  16777.216   1000    16.777    16.777    16.777    16.777        0.000  [CUDA memcpy Device-to-Host]
     50.004   1002     0.050     0.016     0.016    16.810        0.750  [CUDA memcpy Host-to-Device]
     33.620      2    16.810    16.810    16.810    16.810        0.000  [CUDA memset]               

Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape4_mpi_overlap/main.nsys-rep
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape4_mpi_overlap/main.sqlite



Importer error status: An unknown error occurred.
Dynamic exception type: boost::filesystem::filesystem_error
std::exception::what: boost::filesystem::file_size: No such file or directory [system:2]: "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape4_mpi_overlap/main.nsys-rep"


Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape4_mpi_overlap/main.qdstrm


Importer error status: An unknown error occurred.
Dynamic exception type: boost::filesystem::filesystem_error
std::exception::what: boost::filesystem::file_size: No such file or directory [system:2]: "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape4_mpi_overlap/main.nsys-rep"



Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape4_mpi_overlap/main.qdstrm


## etape5_nccl
**Description :** NCCL : utilisation de la NVIDIA Collective Communications Library pour les échanges GPU à GPU. Montre les gains via NVLink ou PCIe haute bande passante.

**Intérêt :** Exploite automatiquement le topologie NVLink/PCIe pour des échanges GPU efficaces.

### Introduction à NCCL

NCCL (NVIDIA Collective Communications Library) permet des communications collectives efficaces entre plusieurs GPU, en exploitant la topologie matérielle (NVLink, PCIe).  
Dans un contexte Jacobi multi-GPU, NCCL peut être utilisé pour échanger les halos entre GPUs sans repasser par le CPU.

**Exemple minimal d'utilisation de NCCL pour un échange entre deux GPU :**

In [9]:
%%bash
cd etape5_nccl
make clean all

rm -f app *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html main.qdstrm
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -lineinfo -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -std=c++14 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/include kernel.cu -c


/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/bin/mpicxx -DUSE_NVTX -O3 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/include -std=c++14 main.cpp kernel.o -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/lib64 -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/lib -lcudart -ldl -lnccl -o app


main.cpp:


In [10]:
%%bash 
cd etape5_nccl
# Nsys ne marche pas tout le temps avec mpirun
NCCL_DEBUG=WARN mpirun -np 4 nsys profile -t mpi,cuda,nvtx --stats=true --force-overwrite true -o main ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1
#NCCL_DEBUG=WARN mpirun -np 4 ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
NCCL version 2.19.3+cuda12.3
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998031
  100, 0.448892
  200, 0.267747
  300, 0.197740
  400, 0.159443
  500, 0.134900
  600, 0.117686
  700, 0.104841
  800, 0.094847
  900, 0.086838
Jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998045
  100, 0.448918
  200, 0.267779
  300, 0.197766
  400, 0.159464
  500, 0.134924
  600, 0.117701
  700, 0.104860
  800, 0.094871
  900, 0.086854
Num GPUs: 4.
4096x4096: 1 GPU:  29.9511 s, 4 GPUs:  15.3749 s, speedup:     1.95, efficiency:    48.70 
Generating '/tmp/nsys-report-7ccc.qdstrm'
Generating '/tmp/nsys-report-0d5a.qdstrm'
Generating '/tmp/nsys-report-06e5.qdstrm'
Generating '/tmp/nsys-report-b7b0.qdstrm'


Importer error status: Importation failed.
File is corrupted.


Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.qdstrm


Importer error status: Importation failed.
Import Failed with unexpected exception: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/StreamSections/FileStream.cpp(368): Throw in function void QuadDCommon::FileStream::truncate(std::streamsize)
Dynamic exception type: boost::wrapexcept<QuadDCommon::InvalidArgumentException>
std::exception::what: InvalidArgumentException
[QuadDCommon::tag_message*] = Invalid truncate size.
[QuadDCommon::tag_report_file_name*] = "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.nsys-rep"


Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.qdstrm



Importer error status: An unknown error occurred.
Dynamic exception type: boost::filesystem::filesystem_error
std::exception::what: boost::filesystem::file_size: No such file or directory [system:2]: "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.nsys-rep"


Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.qdstrm


Importer error status: An unknown error occurred.
Dynamic exception type: boost::filesystem::filesystem_error
std::exception::what: boost::filesystem::file_size: No such file or directory [system:2]: "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.nsys-rep"



Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.qdstrm


## etape6_nccl_overlap
**Description :** NCCL + recouvrement : superposition des collectifs NCCL avec le calcul sur GPU, cachant le coût de communication.

**Intérêt :** Essentiel à forte densité GPU pour maintenir les cœurs occupés.

In [11]:
%%bash
cd etape6_nccl_overlap
make clean all

rm -f app *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html main.qdstrm
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -lineinfo -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -std=c++14 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/include kernel.cu -c
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/bin/mpicxx -DUSE_NVTX -O3 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/include -std=c++14 main.cpp kernel.o -L/apps/2025/manual

main.cpp:


In [41]:
%%bash 
cd etape6_nccl_overlap
# Nsys ne marche pas tout le temps avec mpirun
NCCL_DEBUG=WARN mpirun -np 4 nsys profile -t mpi,cuda,nvtx --stats=true --force-overwrite true -o main ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1
#NCCL_DEBUG=WARN mpirun -np 4 ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
NCCL version 2.19.3+cuda12.3
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998031
  100, 0.448893
  200, 0.267746
  300, 0.197740
  400, 0.159443
  500, 0.134900
  600, 0.117686
  700, 0.104841
  800, 0.094847
  900, 0.086838
Jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998046
  100, 0.448918
  200, 0.267779
  300, 0.197766
  400, 0.159464
  500, 0.134923
  600, 0.117701
  700, 0.104860
  800, 0.094871
  900, 0.086854
Num GPUs: 4.
4096x4096: 1 GPU:  29.8715 s, 4 GPUs:   8.9578 s, speedup:     3.33, efficiency:    83.37 
Generating '/tmp/nsys-report-1e2b.qdstrm'
Generating '/tmp/nsys-report-bf7c.qdstrm'
Generating '/tmp/nsys-report-4357.qdstrm'
Generating '/tmp/nsys-report-c351.qdstrm'


Export error: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/StreamSections/StreamWithSections.cpp(741): Throw in function void QuadDCommon::parseProtobufFromStream(std::istream&, google::protobuf::Message&)
Dynamic exception type: boost::wrapexcept<QuadDCommon::ProtobufParseException>
std::exception::what: ProtobufParseException
[boost::errinfo_api_function_*] = parseProtobufFromStream



[3/7] Executing 'nvtx_sum' stats report


FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[2/7] [===22%                      ] main.sqlite


Export error: LZ4 decompression failed.
FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[3/7] Executing 'nvtx_sum' stats report
[2/7] [9%                          ] main.sqlite


Export error: LZ4 decompression failed.


[2/7] [10%                         ] main.sqlite[3/7] Executing 'nvtx_sum' stats report


FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[2/7] [===22%                      ] main.sqlite


Export error: LZ4 decompression failed.
FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[3/7] Executing 'nvtx_sum' stats report


## etape7_nccl_graphs
**Description :** NCCL + CUDA Graphs : capture et relecture des séquences Jacobi/échange pour réduire le surcoût des lancements.

**Intérêt :** Réduit l’overhead de lancement grâce aux CUDA Graphs.

In [12]:
%%bash 
cd etape7_nccl_graphs
make all

/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -lineinfo -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -std=c++14 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/include kernel.cu -c


/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/bin/mpicxx -DUSE_NVTX -O3 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/include -std=c++14 main.cpp kernel.o -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/lib64 -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nccl/lib -lcudart -ldl -lnccl -o app


main.cpp:
          NCCL_CALL(ncclCommDeregister(nccl_comm, a_new_reg_handle));
          ^


          NCCL_CALL(ncclCommDeregister(nccl_comm, a_reg_handle));
          ^



In [13]:
%%bash 
cd etape7_nccl_graphs
# Nsys ne marche pas tout le temps avec mpirun
NCCL_DEBUG=WARN mpirun -np 4 nsys profile -t mpi,cuda,nvtx --stats=true --force-overwrite true -o main ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1
#NCCL_DEBUG=WARN mpirun -np 4 ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
NCCL version 2.19.3+cuda12.3
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998032
  100, 0.448893
  200, 0.267747
  300, 0.197740
  400, 0.159443
  500, 0.134900
  600, 0.117686
  700, 0.104841
  800, 0.094847
  900, 0.086838
Jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998045
  100, 0.448918
  200, 0.267779
  300, 0.197766
  400, 0.159464
  500, 0.134923
  600, 0.117701
  700, 0.104860
  800, 0.094871
  900, 0.086854
Num GPUs: 4.
4096x4096: 1 GPU:  29.9109 s, 4 GPUs:  20.4265 s, speedup:     1.46, efficiency:    36.61 
Generating '/tmp/nsys-report-3428.qdstrm'
Generating '/tmp/nsys-report-0d77.qdstrm'
Generating '/tmp/nsys-report-c564.qdstrm'
Generating '/tmp/nsys-report-da7f.qdstrm'


Importer error status: Importation failed.
File is corrupted.


    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape7_nccl_graphs/main.qdstrm


Importer error status: Importation failed.
Import Failed with unexpected exception: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/StreamSections/FileStream.cpp(368): Throw in function void QuadDCommon::FileStream::truncate(std::streamsize)
Dynamic exception type: boost::wrapexcept<QuadDCommon::InvalidArgumentException>
std::exception::what: InvalidArgumentException
[QuadDCommon::tag_message*] = Invalid truncate size.
[QuadDCommon::tag_report_file_name*] = "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape7_nccl_graphs/main.nsys-rep"


    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape7_nccl_graphs/main.qdstrm


Importer error status: An unknown error occurred.
Dynamic exception type: boost::filesystem::filesystem_error
std::exception::what: boost::filesystem::file_size: No such file or directory [system:2]: "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape7_nccl_graphs/main.nsys-rep"


    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape7_nccl_graphs/main.qdstrm
[3/7] Executing 'nvtx_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)    Style              Range           
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  -------  ---------------------------
     96.2      50338335008          2  25169167504.0  25169167504.0  20426919904  29911415104  6706550872.1  PushPop  :Jacobi solve              
      3.1       1637318688          1   1637318688.0   1637318688.0   1637318688   1637318688           0.0  PushPop  NCCL:ncclCommInitRank      
      0.3        143329856          1    143329856.0    143329856.0    143329856    143329856           0.0  PushPop  :NCCL_Warmup               
      0.2        113124320         70      1616061.7          224.0           32    105392256    12601138.0  PushPop  NCCL:nc

## etape8_nvshmem
**Description :** NVSHMEM : modèle PGAS à accès mémoire unilatéral GPU, simplifiant les mises à jour d’halo.

**Intérêt :** Simplifie les échanges via modèle PGAS unilatéral.

In [14]:
%%bash 
cd etape8_nvshmem
make all

/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -Xptxas --optimize-float-atomics -ccbin=mpic++ -dc -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -ldl -gencode arch=compute_90,code=sm_90 -std=c++14 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nvshmem/include main.cu -c -o main.o


/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -gencode arch=compute_90,code=sm_90 main.o -o main -ccbin=mpic++ -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/lib64 -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/nvshmem/lib -lnvshmem -lcuda -lcudart -ldl -lnvidia-ml 


In [15]:
%%bash 
cd etape8_nvshmem
# Nsys ne marche pas tout le temps avec mpirun
NCCL_DEBUG=WARN mpirun -np 4 nsys profile -t mpi,cuda,nvtx --stats=true --force-overwrite true -o main ./main -niter 1000 -nx 4096 -ny 4096 -nccheck 1
#NCCL_DEBUG=WARN mpirun -np 4 ./main -niter 1000 -nx 4096 -ny 4096 -nccheck 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
Setting environment variable NVSHMEM_SYMMETRIC_SIZE = 36981964
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.994146
  100, 0.448816
  200, 0.267719
  300, 0.197726
  400, 0.159431
  500, 0.134898
  600, 0.117676
  700, 0.104837
  800, 0.094852
  900, 0.086836
Jacobi relaxation: 1000 iterations on 4096 x 4096 mesh
    0, 15.994146
  100, 0.448816
  200, 0.267722
  300, 0.197726
  400, 0.159431
  500, 0.134898
  600, 0.117676
  700, 0.104838
  800, 0.094853
  900, 0.086836
Num GPUs: 4.
4096x4096: 1 GPU:   1.9420 s, 4 GPUs:  38.5545 s, speedup:     0.05, efficiency:     1.26 
Generating '/tmp/nsys-report-f924.qdstrm'
Generating '/tmp/nsys-report-8d3f.qdstrm'
Generating '/tmp/nsys-report-6f96.qdstrm'
Generating '/tmp/nsys-report-aad2.qds

Export error: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/StreamSections/StreamWithSections.h(179): Throw in function void QuadDCommon::readFromStream(std::istream&, T&) [with T = long int; std::istream = std::basic_istream<char>]
Dynamic exception type: boost::wrapexcept<QuadDCommon::ReadStreamException>
std::exception::what: ReadStreamException
[QuadDAnalysis::tag_report_file_name*] = "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape8_nvshmem/main.nsys-rep"
[boost::errinfo_api_function_*] = parseSectionTable()

FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[3/7] Executing 'nvtx_sum' stats report


Importer error status: Importation failed.
File is corrupted.


Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape8_nvshmem/main.qdstrm



Importer error status: An unknown error occurred.
Dynamic exception type: boost::filesystem::filesystem_error
std::exception::what: boost::filesystem::file_size: No such file or directory [system:2]: "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape8_nvshmem/main.nsys-rep"


Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape8_nvshmem/main.qdstrm



Importer error status: An unknown error occurred.
Dynamic exception type: boost::filesystem::filesystem_error
std::exception::what: boost::filesystem::file_size: No such file or directory [system:2]: "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape8_nvshmem/main.nsys-rep"


Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape8_nvshmem/main.qdstrm


**Remarque :**

Les versions utilisant LTO (Link-Time Optimization) ne sont pas disponibles ici. En effet, pour que LTO fonctionne avec NVSHMEM, il est impératif que NVSHMEM lui-même soit compilé avec l’option LTO activée. Or, dans notre environnement, NVSHMEM n’a pas été compilé avec cette option, ce qui rend impossible la génération des exécutables LTO pour ces étapes.
Résultat : les étapes nécessitant LTO ne peuvent pas être testées dans ce notebook.


## etape9_nvshmem_lto
**Description :** NVSHMEM + LTO : ajout de l’optimisation link-time pour inliner les fonctions critiques et réduire le coût des appels.

**Intérêt :** Optimisation link-time pour inliner les sections critiques.

In [None]:
%%bash cd etape9_nvshmem_lt
make all

In [None]:
%%bash 
cd etape9_nvshmem_lt
nv-nsight-cu-cli --csv --report-file rapport_etape9_nvshmem_lt.csv ./main
cat rapport_etape9_nvshmem_lt.csv

## etape10_vshmem_neighborhood_lto
**Description :** vshmem neighborhood_sync + LTO : synchronisation fine-grain de voisinage et optimisations link-time O2.

**Intérêt :** Synchronisation fine et LTO pour boucles serrées.

In [None]:
%%bash 
cd etape10_vshmem_neighborhood_lto
make all

In [None]:
%%bash 
cd etape10_vshmem_neighborhood_lto
nv-nsight-cu-cli --csv --report-file rapport_etape10_vshmem_neighborhood_lto.csv ./main 
cat rapport_etape10_vshmem_neighborhood_lto.csv

## etape11_nvshmem_norm_overlap_neighborhood_sync_lto
**Description :** Combinaison : NVSHMEM avec recouvrement, synchrone de voisinage, et LTO pour maximiser la concurrence.

**Intérêt :** Combinaison des meilleures pratiques pour un binaire ultra-optimisé.

In [None]:
%%bash cd etape11_nvshmem_norm_overlap_neighborhood_sync_lto
make all

In [None]:
%%bash cd etape11_nvshmem_norm_overlap_neighborhood_sync_lto
nv-nsight-cu-cli --csv --report-file rapport_etape11_nvshmem_norm_overlap_neighborhood_sync_lto.csv ./main
cat rapport_etape11_nvshmem_norm_overlap_neighborhood_sync_lto.csv

## etape12_nvshmem_norm_overlap_neighborhood_sync_lto_ext1
**Description :** Tuning étendu : paramètres ajustables (taille de tuile, ordre de boucles) et hooks de benchmark.

**Intérêt :** Ajout de paramètres de tuning et hooks de benchmarking.

In [None]:
%%bash 
cd etape12_nvshmem_norm_overlap_neighborhood_sync_lto_ext1 
make all

In [None]:
%%bash 
cd etape12_nvshmem_norm_overlap_neighborhood_sync_lto_ext1 
nv-nsight-cu-cli --csv --report-file rapport_etape12_nvshmem_norm_overlap_neighborhood_sync_lto_ext1.csv ./main
cat rapport_etape12_nvshmem_norm_overlap_neighborhood_sync_lto_ext1.csv