# Solveur de Jacobi : Modèles de programmation Multi-GPU
Ce notebook présente 12 versions progressives d’un solveur de Jacobi 2D. Chaque section explique le modèle ou l’optimisation, compile la version, exécute et collecte les métriques Nsight.

```markdown
## Modules Spack à charger

Avant de compiler ou d’exécuter les différentes étapes, il est recommandé de charger les modules nécessaires via Spack. 
Préférablement avant de lancer VSCode ou jupyter
Par exemple :

```bash
spack load nvhpc@24.11
export LD_LIBRARY_PATH=/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib:/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/lib:$LD_LIBRARY_PATH
unset OPAL_PREFIX
unset PMIX_INSTALL_PREFIX
```

Adaptez la version de chaque module selon la configuration de votre cluster.
```

## etape1_cpu
**Description :** Solveur Jacobi CPU de base : implémentation mono-thread. Utile pour valider la correction et les petites tailles de problème ; met en évidence la limite de calcul CPU.

**Intérêt :** Baseline : évalue la limite CPU pour établir une référence.

In [5]:
%%bash
cd etape1_cpu
make clean all

rm -f main *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html
gcc -O2 -o main main.c


In [6]:
%%bash
cd etape1_cpu
./main

Terminé etape1_cpu
CPU time: 10.315427 seconds


## etape2_cpu_gpu
**Description :** 1 CPU + 1 GPU + 1 stream : le CPU pilote le GPU via un unique stream CUDA. Le calcul Jacobi est entièrement délégué au GPU, le CPU ne fait que l’orchestration.

**Intérêt :** Met en évidence l'écart de performance CPU vs GPU lorsque la grille est suffisamment grande, dans un contexte réaliste d’utilisation d’un seul GPU et d’un seul stream.

### Compilation et exécution (1 CPU + 1 GPU + 1 stream)

In [7]:
%%bash
cd etape2_cpu_gpu
make clean all

rm -f app *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html
nvcc -O2 -o app main.cu kernel.cu


In [8]:
%%bash
cd etape2_cpu_gpu
nsys profile -t cuda --stats=true --force-overwrite true -o main ./app

Terminé etape2_cpu_gpu (1CPU + 1GPU + 1stream)
GPU time: 1.540748 seconds
Collecting data...
Generating '/tmp/nsys-report-352f.qdstrm'
[3/6] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------
     72.1        704315616          3  234771872.0  350497920.0    1334496  352483200  202165134.8  cudaMemcpyAsync       
     16.2        158182816          1  158182816.0  158182816.0  158182816  158182816          0.0  cudaStreamCreate      
     10.0         98101984       1001      98004.0      98112.0       1632     102400       3126.8  cudaStreamSynchronize 
      1.5         15092064       1000      15092.1       2720.0       1952   12158912     384406.2  cudaLaunchKernel      
      0.1           606272          2     303136.0     303136.0     277408     328

## etape3_mpi_gpus
**Description :** MPI + GPUs : domaine réparti sur plusieurs rangs MPI, chacun pilotant un GPU. Illustrations des défis de mise à l'échelle multi-nœuds et des communications inter-rangs.

**Intérêt :** Test de montée en charge inter-nœuds et coût MPI sur cluster multi-GPU.

**Step :** 
- Initialiser MPI, déterminer le rang et le nombre de processus.
- Associer chaque rang à un GPU différent.
- Diviser la grille entre les rangs (découpage 1D vertical).
- Gérer les échanges d’halos entre rangs voisins avec MPI_Sendrecv.
- Synchroniser les échanges à chaque itération.
- Nettoyer MPI à la fin.

In [16]:
%%bash
cd etape3_mpi_gpus
make clean all

rm -f app *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html main.qdstrm
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -O3 -std=c++14 -lcudart -Xcompiler "-fopenmp" -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/include main.cpp kernel.cu -o app -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/lib64 -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/lib -lmpi -lnccl -lstdc++ \
	-Xlinker --no-as-needed


In [None]:
%%bash
cd etape3_mpi_gpus
# Nsys ne marche pas tout le temps avec mpirun
mpirun -np 4 nsys profile -t mpi,cuda --stats=true --force-overwrite true -o main ./app 1000 4096 4096 1
#mpirun -np 4 ./app 1000 4096 4096 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
Done: 1000 iters in 3.99646s, norm=1.72125
Generating '/tmp/nsys-report-06a9.qdstrm'
Generating '/tmp/nsys-report-37a5.qdstrm'
Generating '/tmp/nsys-report-c030.qdstrm'
Generating '/tmp/nsys-report-d175.qdstrm'


Export error: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/ProtobufComm/Common/ProtobufUtils.cpp(73): Throw in function void QuadDProtobufUtils::ReadMessage(QuadDProtobufUtils::PbCodedIStream&, QuadDProtobufUtils::PbMessageLite&)
Dynamic exception type: boost::wrapexcept<QuadDCommon::ProtobufParseException>
std::exception::what: ProtobufParseException



[3/7] Executing 'nvtx_sum' stats report


FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[2/7] [0%                          ] main.sqlite


Export error: Section Table Reference magic number mismatch.
FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[3/7] Executing 'nvtx_sum' stats report


Export error: LZ4 decompression failed.
FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[3/7] Executing 'nvtx_sum' stats report
[3/7] Executing 'nvtx_sum' stats report


SKIPPED: No data available.


[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Name         
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------
     95.8        249794528       5002   49938.9   12704.0      3808    365184      77647.1  cudaMemcpy            
      3.9         10051040       1000   10051.0    9312.0      6656    538688      16959.9  cudaLaunchKernel      
      0.1           336032          2  168016.0  168016.0     84800    251232     117685.2  cudaFree              
      0.1           308928          2  154464.0  154464.0    118272    190656      51183.2  cudaMalloc            
      0.0           109216        413     264.4      96.0        32      3712        338.0  cuGetProcAddress_v2   
      0.0            51520          2   25760.0   25760.0      4640     46880      29868.2  cudaMemset            
      0.0             4128         

### À partir de quelle taille de matrice le recouvrement communication/calcul devient-il rentable ?

Le recouvrement (overlap) communication/calcul devient généralement rentable lorsque :
- Le temps de communication (MPI + transferts host/device) devient significatif devant le temps de calcul local.
- La partie du calcul qui peut être effectuée pendant la communication (hors bords) est suffisamment grande pour masquer la latence réseau.

Pour une grille Jacobi 2D, la taille critique dépend :
- De la bande passante et latence réseau,
- Du nombre de rangs MPI,
- De la rapidité des transferts CUDA Host/Device,
- De la puissance du GPU.

**Sur la plupart des clusters modernes, le recouvrement commence à être rentable pour des matrices de l’ordre de 16k×16k à 32k×32k (voire plus),** surtout si le nombre de rangs est élevé (≥4) et que la communication devient un vrai goulot d’étranglement.

**Pour une matrice 8k×8k,** le calcul local reste souvent dominant, donc le surcoût du overlap (copies, synchronisations) peut masquer le gain.  
**Essayez avec 16k×16k ou 32k×32k** pour voir un bénéfice, surtout si vous augmentez le nombre de rangs MPI (et donc la proportion de communication).

**Résumé :**  
- < 8k×8k : overlap rarement utile  
- 16k×16k : commence à être intéressant  
- 32k×32k et + : overlap souvent rentable, surtout sur cluster multi-nœuds

## etape4_mpi_overlap
**Description :** MPI + recouvrement : recouvrements des échanges d’halo non-bloquants avec le calcul Jacobi local. Réduit l'impact de la latence réseau.

**Intérêt :** Cache la latence réseau en recouvrant communication et calcul local.

In [19]:
%%bash
cd etape4_mpi_overlap
make clean all

rm -f app *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html main.qdstrm
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -O3 -std=c++14 -lcudart -Xcompiler "-fopenmp" -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/include main.cpp kernel.cu -o app -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/lib64 -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/lib -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/lib -lmpi -lnccl -lstdc++ \
	-Xlinker --no-as-needed


In [None]:
%%bash 
cd etape4_mpi_overlap
# Nsys ne marche pas tout le temps avec mpirun
mpirun -np 4 nsys profile -t mpi,cuda --stats=true --force-overwrite true -o main ./app 1000 4096 4096 1
#mpirun -np 4 ./app 1000 4096 4096 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
nx=4096 ny=4096 size=4
Rang 0: ny_local=1024 offset=1
Rang 3: ny_local=1023 offset=3072
Rang 1: ny_local=1024 offset=1025
Rang 2: ny_local=1023 offset=2049
Overlap: 1000 iters en 3.96817 s, norm=1.96709
Generating '/tmp/nsys-report-c523.qdstrm'
Generating '/tmp/nsys-report-d101.qdstrm'
Generating '/tmp/nsys-report-a97c.qdstrm'
Generating '/tmp/nsys-report-a50b.qdstrm'


Export error: Reading raw data failed, size: 60668


[3/7] Executing 'nvtx_sum' stats report


FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[2/7] [0%                          ] main.sqlite


Export error: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/StreamSections/StreamWithSections.cpp(741): Throw in function void QuadDCommon::parseProtobufFromStream(std::istream&, google::protobuf::Message&)
Dynamic exception type: boost::wrapexcept<QuadDCommon::ProtobufParseException>
std::exception::what: ProtobufParseException
[boost::errinfo_api_function_*] = parseProtobufFromStream



[3/7] Executing 'nvtx_sum' stats report


FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[2/7] [0%                          ] main.sqlite


Export error: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/StreamSections/StreamWithSections.cpp(741): Throw in function void QuadDCommon::parseProtobufFromStream(std::istream&, google::protobuf::Message&)
Dynamic exception type: boost::wrapexcept<QuadDCommon::ProtobufParseException>
std::exception::what: ProtobufParseException
[boost::errinfo_api_function_*] = parseProtobufFromStream



[3/7] Executing 'nvtx_sum' stats report


FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[3/7] Executing 'nvtx_sum' stats report


SKIPPED: No data available.


[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)              Name             
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -----------------------------
     84.4        209431584       1002   209013.6   203840.0    201312   4571328     138135.4  cudaMemcpy                   
      5.6         14017024       1000    14017.0    13632.0      9600    442912      13719.5  cudaLaunchKernel             
      5.0         12387168       2000     6193.6     6768.0      2720     45536       2863.6  cuMemcpyAsync                
      3.0          7325312      10926      670.4      288.0       224     89856       1115.7  cuEventQuery                 
      1.1          2608736          1  2608736.0  2608736.0   2608736   2608736          0.0  cuMemHostRegister_v2         
      0.6          1369696       2000      684.8      672.0       192      5728        

## etape5_nccl
**Description :** NCCL : utilisation de la NVIDIA Collective Communications Library pour les échanges GPU à GPU. Montre les gains via NVLink ou PCIe haute bande passante.

**Intérêt :** Exploite automatiquement le topologie NVLink/PCIe pour des échanges GPU efficaces.

### Introduction à NCCL

NCCL (NVIDIA Collective Communications Library) permet des communications collectives efficaces entre plusieurs GPU, en exploitant la topologie matérielle (NVLink, PCIe).  
Dans un contexte Jacobi multi-GPU, NCCL peut être utilisé pour échanger les halos entre GPUs sans repasser par le CPU.

**Exemple minimal d'utilisation de NCCL pour un échange entre deux GPU :**

In [36]:
%%bash
cd etape5_nccl
make clean all

rm -f app *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html main.qdstrm
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -lineinfo -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -std=c++14 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/include kernel.cu -c
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/bin/mpicxx -DUSE_NVTX -O3 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/include -std=c++14 main.cpp kernel.o -L/apps/2025/manual_install/n

main.cpp:


In [None]:
%%bash 
cd etape5_nccl
# Nsys ne marche pas tout le temps avec mpirun
NCCL_DEBUG=WARN mpirun -np 4 nsys profile -t mpi,cuda,nvtx --stats=true --force-overwrite true -o main ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1
#NCCL_DEBUG=WARN mpirun -np 4 ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
NCCL version 2.19.3+cuda12.3
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998032
  100, 0.448893
  200, 0.267747
  300, 0.197740
  400, 0.159443
  500, 0.134900
  600, 0.117686
  700, 0.104841
  800, 0.094847
  900, 0.086838
Jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998046
  100, 0.448918
  200, 0.267779
  300, 0.197766
  400, 0.159464
  500, 0.134923
  600, 0.117701
  700, 0.104860
  800, 0.094871
  900, 0.086854
Num GPUs: 4.
4096x4096: 1 GPU:  29.8297 s, 4 GPUs:  14.6582 s, speedup:     2.04, efficiency:    50.88 
Generating '/tmp/nsys-report-e8c0.qdstrm'
Generating '/tmp/nsys-report-1b68.qdstrm'
Generating '/tmp/nsys-report-70d0.qdstrm'
Generating '/tmp/nsys-report-4806.qdstrm'


Export error: Reading raw data failed, size: 123451


[3/7] Executing 'nvtx_sum' stats report


FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[2/7] [11%                         ] main.sqlite


Importer error status: Importation failed.
Import Failed with unexpected exception: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/StreamSections/FileStream.cpp(368): Throw in function void QuadDCommon::FileStream::truncate(std::streamsize)
Dynamic exception type: boost::wrapexcept<QuadDCommon::InvalidArgumentException>
std::exception::what: InvalidArgumentException
[QuadDCommon::tag_message*] = Invalid truncate size.
[QuadDCommon::tag_report_file_name*] = "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.nsys-rep"


[2/7] [13%                         ] main.sqliteGenerated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.qdstrm
[3/7] Executing 'nvtx_sum' stats report


ERROR: Database file /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.sqlite could not be opened and appears to be invalid.




ERROR: Database file /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape5_nccl/main.sqlite could not be opened and appears to be invalid.


[3/7] Executing 'nvtx_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)    Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                             Name                                           
 --------  ---------------  ---------  ----------  ----------  --------  --------  -----------  ------------------------------------------------------------------------------------------
     89.3      36800081702       2000  18400040.9  18399664.0   7360064  29442240   11042168.5  void jacobi_kernel<(int)32, (int)32>(float *, const float *, float *, int, int, int, bool)
     10.7       4428819801       1010   4384970.1   3831904.0      8416  27120544    4643524.2  ncclDevKernel_SendRecv(ncclDevComm *, unsigned long, ncclWork *)                          
      0.0             5696          2      2848.0      2848.0      2656      3040        271.5  initialize_boundaries(float *, float *, float, int, int, int, int)                        

[6/7] Executing 'cuda_g

## etape6_nccl_overlap
**Description :** NCCL + recouvrement : superposition des collectifs NCCL avec le calcul sur GPU, cachant le coût de communication.

**Intérêt :** Essentiel à forte densité GPU pour maintenir les cœurs occupés.

In [40]:
%%bash
cd etape6_nccl_overlap
make clean all

rm -f app *.o main.nsys-rep main.sqlite main.AnalysisSummary.html main.DiagnosticsSummary.html main.qdstrm
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -lineinfo -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -std=c++14 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/include kernel.cu -c
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/bin/mpicxx -DUSE_NVTX -O3 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/include -std=c++14 main.cpp kernel.o -L/apps/2025/manual_install/n

main.cpp:


In [41]:
%%bash 
cd etape6_nccl_overlap
# Nsys ne marche pas tout le temps avec mpirun
NCCL_DEBUG=WARN mpirun -np 4 nsys profile -t mpi,cuda,nvtx --stats=true --force-overwrite true -o main ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1
#NCCL_DEBUG=WARN mpirun -np 4 ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
NCCL version 2.19.3+cuda12.3
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998031
  100, 0.448893
  200, 0.267746
  300, 0.197740
  400, 0.159443
  500, 0.134900
  600, 0.117686
  700, 0.104841
  800, 0.094847
  900, 0.086838
Jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998046
  100, 0.448918
  200, 0.267779
  300, 0.197766
  400, 0.159464
  500, 0.134923
  600, 0.117701
  700, 0.104860
  800, 0.094871
  900, 0.086854
Num GPUs: 4.
4096x4096: 1 GPU:  29.8715 s, 4 GPUs:   8.9578 s, speedup:     3.33, efficiency:    83.37 
Generating '/tmp/nsys-report-1e2b.qdstrm'
Generating '/tmp/nsys-report-bf7c.qdstrm'
Generating '/tmp/nsys-report-4357.qdstrm'
Generating '/tmp/nsys-report-c351.qdstrm'


Export error: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/StreamSections/StreamWithSections.cpp(741): Throw in function void QuadDCommon::parseProtobufFromStream(std::istream&, google::protobuf::Message&)
Dynamic exception type: boost::wrapexcept<QuadDCommon::ProtobufParseException>
std::exception::what: ProtobufParseException
[boost::errinfo_api_function_*] = parseProtobufFromStream



[3/7] Executing 'nvtx_sum' stats report


FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[2/7] [===22%                      ] main.sqlite


Export error: LZ4 decompression failed.
FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[3/7] Executing 'nvtx_sum' stats report
[2/7] [9%                          ] main.sqlite


Export error: LZ4 decompression failed.


[2/7] [10%                         ] main.sqlite[3/7] Executing 'nvtx_sum' stats report


FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[2/7] [===22%                      ] main.sqlite


Export error: LZ4 decompression failed.
FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/Agent/OutputFile.cpp(726): Throw in function const boost::filesystem::path& QuadDDaemon::OutputFile::GetPath(QuadDDaemon::Extension) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::FileException>
std::exception::what: FileException
[QuadDCommon::tag_message*] = Output file was never created



[3/7] Executing 'nvtx_sum' stats report


## etape7_nccl_graphs
**Description :** NCCL + CUDA Graphs : capture et relecture des séquences Jacobi/échange pour réduire le surcoût des lancements.

**Intérêt :** Réduit l’overhead de lancement grâce aux CUDA Graphs.

In [43]:
%%bash 
cd etape7_nccl_graphs
make all

/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -lineinfo -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -std=c++14 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/include kernel.cu -c
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/bin/mpicxx -DUSE_NVTX -O3 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/include -std=c++14 main.cpp kernel.o -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/lib64 -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/

main.cpp:
          NCCL_CALL(ncclCommDeregister(nccl_comm, a_new_reg_handle));
          ^


          NCCL_CALL(ncclCommDeregister(nccl_comm, a_reg_handle));
          ^



In [45]:
%%bash 
cd etape7_nccl_graphs
# Nsys ne marche pas tout le temps avec mpirun
NCCL_DEBUG=WARN mpirun -np 4 nsys profile -t mpi,cuda,nvtx --stats=true --force-overwrite true -o main ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1
#NCCL_DEBUG=WARN mpirun -np 4 ./app -niter 1000 -nx 4096 -ny 4096 -nccheck 1

Collecting data...
Collecting data...
Collecting data...
Collecting data...
NCCL version 2.19.3+cuda12.3
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998031
  100, 0.448893
  200, 0.267747
  300, 0.197740
  400, 0.159443
  500, 0.134900
  600, 0.117686
  700, 0.104841
  800, 0.094847
  900, 0.086838
Jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 1 iterations
    0, 15.998045
  100, 0.448918
  200, 0.267779
  300, 0.197766
  400, 0.159464
  500, 0.134923
  600, 0.117701
  700, 0.104860
  800, 0.094871
  900, 0.086854
Num GPUs: 4.
4096x4096: 1 GPU:  29.7773 s, 4 GPUs:  20.2274 s, speedup:     1.47, efficiency:    36.80 
Generating '/tmp/nsys-report-7f01.qdstrm'
Generating '/tmp/nsys-report-f769.qdstrm'
Generating '/tmp/nsys-report-87e8.qdstrm'
Generating '/tmp/nsys-report-533d.qdstrm'


Importer error status: Importation failed.
File is corrupted.


Generated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape7_nccl_graphs/main.qdstrm


Importer error status: An unknown error occurred.
Dynamic exception type: boost::filesystem::filesystem_error
std::exception::what: boost::filesystem::file_size: No such file or directory [system:2]: "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape7_nccl_graphs/main.nsys-rep"


    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape7_nccl_graphs/main.qdstrm


Importer error status: Importation failed.
Import Failed with unexpected exception: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/StreamSections/FileStream.cpp(368): Throw in function void QuadDCommon::FileStream::truncate(std::streamsize)
Dynamic exception type: boost::wrapexcept<QuadDCommon::InvalidArgumentException>
std::exception::what: InvalidArgumentException
[QuadDCommon::tag_message*] = Invalid truncate size.
[QuadDCommon::tag_report_file_name*] = "/gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape7_nccl_graphs/main.nsys-rep"


[2/7] [0%                          ] main.sqliteGenerated:
    /gpfs/home/colevalet/Cours/CHPS0904_RENDU/MultiGPU Programming Model/etape7_nccl_graphs/main.qdstrm
[3/7] Executing 'nvtx_sum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)    Style              Range           
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  -------  ---------------------------
     96.3      50019851712          2  25009925856.0  25009925856.0  20227132192  29792719520  6763891665.7  PushPop  :Jacobi solve              
      3.0       1574830784          1   1574830784.0   1574830784.0   1574830784   1574830784           0.0  PushPop  NCCL:ncclCommInitRank      
      0.3        136148448          1    136148448.0    136148448.0    136148448    136148448           0.0  PushPop  :NCCL_Warmup               
      0.2        114199968         70      1631428.1           96.

## etape8_nvshmem
**Description :** NVSHMEM : modèle PGAS à accès mémoire unilatéral GPU, simplifiant les mises à jour d’halo.

**Intérêt :** Simplifie les échanges via modèle PGAS unilatéral.

In [49]:
%%bash 
cd etape8_nvshmem
make all

/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -Xptxas --optimize-float-atomics -ccbin=mpic++ -dc -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -ldl -gencode arch=compute_90,code=sm_90 -std=c++14 -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nccl/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nvshmem/include -I/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/nvshmem/include/nvshmem main.cu -c -o main.o
/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/bin/nvcc -gencode arch=compute_90,code=sm_90 main.o -o main -ccbin=mpic++ -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/cuda/12.6/lib64 -L/apps/2025/manual_install/nvhpc/24.11/Linux_aarch64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ompi/l

In [None]:
%%bash 
cd etape8_nvshmem
# Nsys ne marche pas tout le temps avec mpirun
NCCL_DEBUG=WARN mpirun -np 4 nsys profile -t mpi,cuda,nvtx --stats=true --force-overwrite true -o main ./main -niter 1000 -nx 4096 -ny 4096 -nccheck 1
#NCCL_DEBUG=WARN mpirun -np 4 ./main -niter 1000 -nx 4096 -ny 4096 -nccheck 1

**Remarque :**

Les versions utilisant LTO (Link-Time Optimization) ne sont pas disponibles ici. En effet, pour que LTO fonctionne avec NVSHMEM, il est impératif que NVSHMEM lui-même soit compilé avec l’option LTO activée. Or, dans notre environnement, NVSHMEM n’a pas été compilé avec cette option, ce qui rend impossible la génération des exécutables LTO pour ces étapes.
Résultat : les étapes nécessitant LTO ne peuvent pas être testées dans ce notebook.


## etape9_nvshmem_lto
**Description :** NVSHMEM + LTO : ajout de l’optimisation link-time pour inliner les fonctions critiques et réduire le coût des appels.

**Intérêt :** Optimisation link-time pour inliner les sections critiques.

In [None]:
%%bash cd etape9_nvshmem_lt
make all

In [None]:
%%bash 
cd etape9_nvshmem_lt
nv-nsight-cu-cli --csv --report-file rapport_etape9_nvshmem_lt.csv ./main
cat rapport_etape9_nvshmem_lt.csv

## etape10_vshmem_neighborhood_lto
**Description :** vshmem neighborhood_sync + LTO : synchronisation fine-grain de voisinage et optimisations link-time O2.

**Intérêt :** Synchronisation fine et LTO pour boucles serrées.

In [None]:
%%bash 
cd etape10_vshmem_neighborhood_lto
make all

In [None]:
%%bash 
cd etape10_vshmem_neighborhood_lto
nv-nsight-cu-cli --csv --report-file rapport_etape10_vshmem_neighborhood_lto.csv ./main 
cat rapport_etape10_vshmem_neighborhood_lto.csv

## etape11_nvshmem_norm_overlap_neighborhood_sync_lto
**Description :** Combinaison : NVSHMEM avec recouvrement, synchrone de voisinage, et LTO pour maximiser la concurrence.

**Intérêt :** Combinaison des meilleures pratiques pour un binaire ultra-optimisé.

In [None]:
%%bash cd etape11_nvshmem_norm_overlap_neighborhood_sync_lto
make all

In [None]:
%%bash cd etape11_nvshmem_norm_overlap_neighborhood_sync_lto
nv-nsight-cu-cli --csv --report-file rapport_etape11_nvshmem_norm_overlap_neighborhood_sync_lto.csv ./main
cat rapport_etape11_nvshmem_norm_overlap_neighborhood_sync_lto.csv

## etape12_nvshmem_norm_overlap_neighborhood_sync_lto_ext1
**Description :** Tuning étendu : paramètres ajustables (taille de tuile, ordre de boucles) et hooks de benchmark.

**Intérêt :** Ajout de paramètres de tuning et hooks de benchmarking.

In [None]:
%%bash 
cd etape12_nvshmem_norm_overlap_neighborhood_sync_lto_ext1 
make all

In [None]:
%%bash 
cd etape12_nvshmem_norm_overlap_neighborhood_sync_lto_ext1 
nv-nsight-cu-cli --csv --report-file rapport_etape12_nvshmem_norm_overlap_neighborhood_sync_lto_ext1.csv ./main
cat rapport_etape12_nvshmem_norm_overlap_neighborhood_sync_lto_ext1.csv