

# **MPI+CUDA**



# **MPI+CUDA**



#### **MPI+CUDA**



```
//MPI rank 0
MPI_Send(s_buf_d,size,MPI_CHAR,n-1,tag,MPI_COMM_WORLD);

//MPI rank n-1
MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
```



#### YOU WILL LEARN

What MPI is

How to use MPI for inter GPU communication with CUDA and OpenACC

What CUDA-aware MPI is

What Multi Process Service is and how to use it

How to use NVIDIA tools in an MPI environment

How to hide MPI communication times



### **MESSAGE PASSING INTERFACE - MPI**

Standard to exchange data between processes via messages

Defines API to exchanges messages

Point to Point: e.g. MPI Send, MPI Recv

Collectives: e.g. MPI\_Reduce

Multiple implementations (open source and commercial)

Bindings for C/C++, Fortran, Python, ...

E.g. MPICH, OpenMPI, MVAPICH, IBM Platform MPI, Cray MPT, ...



### **MPI - SKELETON**

```
#include <mpi.h>
int main(int argc, char *argv[]) {
    int rank, size;
    /* Initialize the MPI library */
   MPI Init (&argc, &argv);
    /* Determine the calling process rank and total number of ranks */
    MPI Comm rank (MPI COMM WORLD, & rank);
    MPI Comm size (MPI COMM WORLD, & size);
    /* Call MPI routines like MPI Send, MPI Recv, ... */
    /* Shutdown MPI library */
    MPI Finalize();
    return 0;
```

### **MPI**

#### Compiling and Launching



# A SIMPLE EXAMPLE

### **EXAMPLE: JACOBI SOLVER**

Solves the 2D-Laplace Equation on a rectangle

$$\Delta u(x,y) = \mathbf{0} \ \forall \ (x,y) \in \Omega \backslash \delta \Omega$$

Dirichlet boundary conditions (constant values on boundaries)

$$u(x,y) = f(x,y) \in \delta\Omega$$

2D domain decomposition with n x k domains





#### **EXAMPLE: JACOBI SOLVER**

#### Single GPU

While not converged

Do Jacobi step:

```
for (int iy=1; iy < ny-1; ++iy)

for (int ix=1; ix < nx-1; ++ix)

u_new[ix][iy] = 0.0f - 0.25f*( u[ix-1][iy] + u[ix+1][iy] 
+ u[ix][iy-1] + u[ix][iy+1]);</pre>
```

 $\textbf{Swap} \ u\_\texttt{new} \ \textbf{and} \ u$ 

Next iteration



#### **EXAMPLE: JACOBI SOLVER**

#### Multi GPU

While not converged

Do Jacobi step:

Exchange halo with 2 4 neighbors

Swap u new and u

Next iteration



#### **EXAMPLE JACOBI**

#### Top/Bottom Halo

```
MPI Sendrecv u new+offset first row m-2, MPI DOUBLE, t nb, 0,
            u_new+offset_bottom_boundary, m-2, MPI_DOUBLE, b_nb, 0,
             MPI COMM WORLD, MPI STATUS IGNORE);
MPI Sendrecv u new+offset last row, m-2, MPI DOUBLE, b nb, 1,
            u new+offset top boundary, m-2, MPI DOUBLE, t nb, 1,
             MPI COMM WORLD, MPI STATUS IGNORE);
```



#### **EXAMPLE JACOBI**

#### Top/Bottom Halo

```
#pragma acc host data use device ( u new ) {
MPI Sendrecv u new+offset first row, m-2, MPI DOUBLE, t nb, 0,
             u new+offset bottom boundary, m-2, MPI DOUBLE, b nb, 0,
             MPI COMM WORLD, MPI STATUS IGNORE);
MPI Sendrecv u new+offset last row, m-2, MPI_DOUBLE, b_nb, 1,
             u new+offset top boundary, m-2, MPI_DOUBLE, t_nb, 1,
             MPI COMM WORLD, MPI STATUS IGNORE);
MPI Sendrecv (u new d+offset first row, m-2, MPI DOUBLE, t nb, 0,
             u new d+offset bottom boundar_{y}, m-2, MPI DOUBLE, b nb, 0,
             MPI COMM WORLD, MPI STATUS IGNORE);
MPI Sendrecv u new d+offset last row m-2, MPI DOUBLE, b nb, 1,
             u_new_d+offset top boundary, m-2, MPI DOUBLE, t nb, 1,
             MPI COMM WORLD, MPI STATUS IGNORE);
```



### **EXAMPLE: JACOBI**

#### Left/Right Halo

```
//right neighbor omitted
#pragma acc parallel loop present ( u new, to left )
for ( int i=0; i<n-2; ++i )</pre>
        to left[i] = u new[(i+1)*m+1];
#pragma acc host_data use_device ( from_right, to_left ) {
 MPI Sendrecv ( to left, n-2, MPI DOUBLE, 1 nb, 0,
                from right, n-2, MPI DOUBLE, r nb, 0,
                MPI COMM WORLD, MPI STATUS IGNORE );
#pragma acc parallel loop present ( u new, from right )
for ( int i=0; i<n-2; ++i )</pre>
   u new[(m-1)+(i+1)*m] = from right[i];
```





### **EXAMPLE: JACOBI**

#### Left/Right Halo

```
//right neighbor omitted
pack << gs, bs, 0, s >>> (to left d, u new d, n, m);
cudaStreamSynchronize(s);
  MPI Sendrecv ( to left_d, n-2, MPI_DOUBLE, l_nb, 0,
                from right d, n-2, MPI DOUBLE, r nb, 0,
                MPI COMM WORLD, MPI STATUS IGNORE );
unpack<<<gs,bs,0,s>>>(u new d, from right d, n, m);
```







#### LAUNCH MPI+CUDA/OPENACC PROGRAMS

#### Launch one process per GPU

```
MVAPICH: MV2 USE CUDA
```

```
$ MV2_USE_CUDA=1 mpirun -np ${np} ./myapp <args>
```

Open MPI: CUDA-aware features are enabled per default

Cray: MPICH RDMA ENABLED CUDA

IBM Spectrum MPI:

```
$ mpirun -gpu -np ${np} ./myapp <args>
```





How to determine the local rank? - MPI-3

```
MPI Comm loc comm;
MPI Comm split type (MPI COMM WORLD, MPI COMM TYPE SHARED, rank, MPI INFO NULL, &loc comm);
int local rank = -1;
MPI Comm rank(loc comm, & local rank);
MPI Comm free (&loc comm);
```



#### **GPU-affinity**

#### Use local rank:

```
int local_rank = -1;
MPI_Comm_rank(local_comm, &local_rank);
int num_devices = 0;
cudaGetDeviceCount(&num_devices);
cudaSetDevice(local_rank % num_devices);
```

# EXAMPLE JACOBI

#### Top/Bottom Halo

```
without
CUDA-aware
   MPI
```

```
#pragma acc update host(u new[offset first row:m-2],u new[offset last row:m-2])
MPI Sendrecv (u new+offset first row, m-2, MPI DOUBLE, t nb, 0,
             u new+offset bottom boundary, m-2, MPI DOUBLE, b nb, 0,
             MPI COMM WORLD, MPI STATUS IGNORE);
MPI Sendrecv (u new+offset last row, m-2, MPI DOUBLE, b nb, 1,
             u new+offset top boundary, m-2, MPI DOUBLE, t nb, 1,
             MPI COMM WORLD, MPI STATUS IGNORE);
#pragma acc update device(u new[offset top boundary:m-2],u new[offset bottom boundary:m-2])
//send to bottom and receive from top top bottom omitted
cudaMemcpy ( u new+offset first row,
             u new d+offset first row, (m-2)*sizeof(double), cudaMemcpyDeviceToHost);
MPI Sendrecv (u new+offset first row, m-2, MPI DOUBLE, t nb, 0,
             u new+offset bottom boundary, m-2, MPI DOUBLE, b nb, 0,
             MPI COMM WORLD, MPI STATUS IGNORE);
cudaMemcpy ( u new d+offset bottom boundary,
             u new+offset bottom boundary, (m-2) *sizeof(double), cudaMemcpyDeviceToHost);
```

# THE DETAILS

### UNIFIED VIRTUAL ADDRESSING

No UVA: Separate Address Spaces UVA: Single Address Space





### UNIFIED VIRTUAL ADDRESSING



One address space for all CPU and GPU memory

Determine physical memory location from a pointer value

Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy)

Supported on devices with compute capability 2.0+ for

64-bit applications on Linux and Windows (+TCC)



# **NVIDIA GPUDIRECT™**

#### Peer to Peer Transfers



### **NVIDIA GPUDIRECT™**

#### Support for RDMA



### **CUDA-AWARE MPI**

#### Example:

MPI Rank 0 MPI Send from GPU Buffer

MPI Rank 1 MPI Recv to GPU Buffer

Show how CUDA+MPI works in principle

Depending on the MPI implementation, message size, system setup, ... situation might be different

Two GPUs in two nodes

### MPI GPU TO REMOTE GPU



```
MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
```

### MPI GPU TO REMOTE GPU

Support for RDMA



#### REGULAR MPI GPU TO REMOTE GPU



```
cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost);
MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);
```



### REGULAR MPI GPU TO REMOTE GPU



#### MPI GPU TO REMOTE GPU

#### without GPUDirect



```
MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);
MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);
```

# MPI GPU TO REMOTE GPU

without GPUDirect



#### PERFORMANCE RESULTS GPUDIRECT RDMA

MVAPICH2-GDR 2.3a DGX-1V Tesla V100



Latency (1 Byte)

16.75 us

18.68 us

3.25 us

### PERFORMANCE RESULTS GPUDIRECT P2P

MVAPICH2-GDR 2.3a DGX-1V Tesla V100



# MULTI PROCESS SERVICE (MPS) FOR MPI APPLICATIONS

#### GPU ACCELERATION OF LEGACY MPI APPS

Typical legacy application

MPI parallel

Single or few threads per MPI rank (e.g. OpenMP)

Running with multiple MPI ranks per node

GPU acceleration in phases

Proof of concept prototype, ...

Great speedup at kernel level

Application performance misses expectations



# **MULTI PROCESS SERVICE (MPS)**

For Legacy MPI Applications



Multicore CPU only

**GPU-accelerated** 



## PROCESSES SHARING GPU WITHOUT MPS

#### No Overlap





## PROCESSES SHARING GPU WITHOUT MPS

#### **Context Switch Overhead**



## PROCESSES SHARING GPU WITH MPS

#### Maximum Overlap





## PROCESSES SHARING GPU WITH MPS

#### No Context Switch Overhead



# HYPER-Q/MPS CASE STUDY: UMT



# HYPER-Q/MPS CASE STUDIES

**CPU Scaling Speedup** 



## HYPER-Q/MPS CASE STUDIES

Additional Speedup with MPS



## **USING MPS**

No application modifications necessary

Not limited to MPI applications

MPS control daemon

Spawn MPS server upon CUDA application startup

```
#Typical Setup

nvidia-smi -c EXCLUSIVE_PROCESS

nvidia-cuda-mps-control -d

#On Cray XK/XC systems
export CRAY_CUDA_MPS=1
```



## MPS: IMPROVEMENTS WITH VOLTA

More MPS clients per GPU: 48 instead of 16

**Less overhead:** Volta MPS clients submit work directly to the GPU without passing through the MPS server.

More security: Each Volta MPS client owns its own GPU address space instead of sharing GPU address space with all other MPS clients.

**More control:** Volta MPS supports limited execution resource provisioning for Quality of Service (QoS). ->

CUDA MPS ACTIVE THREAD PERCENTAGE



## MPS SUMMARY

Easy path to get GPU acceleration for legacy applications

Enables overlapping of memory copies and compute between different MPI ranks

Remark: MPS adds some overhead!





# **DEBUGGING AND PROFILING**

#### **TOOLS FOR MPI+CUDA APPLICATIONS**

Memory checking: cuda-memcheck

Debugging: cuda-gdb

Profiling: nvprof and the NVIDIA Visual Profiler (nvvp)

#### MEMORY CHECKING WITH CUDA-MEMCHECK

cuda-memcheck is a tool similar to Valgrind's memcheck

#### Can be used in a MPI environment

```
mpiexec -np 2 cuda-memcheck ./myapp <args>
```

Problem: Output of different processes is interleaved

Solution: Use save or log-file command line options

```
OpenMPI: OMPI_COMM_WORLD_RANK
```

MVAPICH2: MV2\_COMM\_WORLD\_RANK

#### MEMORY CHECKING WITH CUDA-MEMCHECK

```
- -
jkraus@ivb114:~/workspace/qwiklabs/Multi-GPU-MPI/task3
[jkraus@ivb114 task3]$ mpirun -np 2 cuda-memcheck --log-file laplace2d.%q{OMPI COMM WORLD RANK}
log --save laplace2d.%q{OMPI COMM WORLD RANK}.memcheck ./laplace2d
Jacobi relaxation Calculation: 2048 x 2048 mesh
Calculate reference solution and time serial execution.
call to cuMemcpyDtoHAsync returned error 719: Launch failed (often invalid pointer dereference)
call to cuMemcpyDtoHAsync returned error 719: Launch failed (often invalid pointer dereference)
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[42894,1],0]
  Exit code:
[jkraus@ivb114 task3]$ ls laplace2d.*.log laplace2d.*.memcheck
aplace2d.0.log laplace2d.0.memcheck laplace2d.1.log laplace2d.1.memcheck
[jkraus@ivb114 task3]$
```

## MEMORY CHECKING WITH CUDA-MEMCHECK

Read Output Files with cuda-memcheck --read

```
jkraus@ivb114:~/workspace/qwiklabs/Multi-GPU-MPI/task3
              Saved host backtrace up to driver entry point at kernel launch time
             Host Frame: /usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x150bbd]
             Host Frame:/shared/apps/pgi/centos-6.2/linux86-64/15.1/lib/libaccn.so ( pgi uacc
cuda launch + 0x1796) [0x10896]
             Host Frame:/shared/apps/pgi/centos-6.2/linux86-64/15.1/lib/libaccg.so ( pgi uacc
launch + 0x1a5) [0x10ed5]
             Host Frame:./laplace2d [0x26fd]
         Invalid global write of size 4
             at 0x00000778 in /home-2/jkraus/workspace/qwiklabs/Multi-GPU-MPI/task3/./laplace2d
serial.h:35:laplace2d serial 32 gpu
             by thread (33,0,0) in block (8,6,0)
             Address 0x24edd2f088 is out of bounds
              Saved host backtrace up to driver entry point at kernel launch time
             Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x150bbd]
             Host Frame:/shared/apps/pgi/centos-6.2/linux86-64/15.1/lib/libaccn.so ( pgi uacc
cuda launch + 0x1796) [0x10896]
             Host Frame:/shared/apps/pgi/centos-6.2/linux86-64/15.1/lib/libaccg.so ( pgi uacc
launch + 0x1a5) [0x10ed5]
             Host Frame:./laplace2d [0x26fd]
   ===== Invalid global write of size 4
              at 0x00000778 in /home-2/jkraus/workspace/qwiklabs/Multi-GPU-MPI/task3/./laplace2d
  erial.h:35:laplace2d serial 32 gpu
```

#### Using cuda-gdb with MPI Applications

Use cuda-gdb just like gdb

For smaller applications, just launch xterms and cuda-gdb

mpiexec -x -np 2 xterm -e cuda-gdb ./myapp <args>

```
- - X
cuda-gdb
   NVIDIA (R) CUDA Debugger
   5.0 release
   Portions Copyright (C) 2007-2012 NVIDIA Corporation
   GNU adb (GDB) 7.2
   Copyright (C) 2010 Free Software Foundation, Inc.
   License GPLv3+: GNU GPL version 3 or later <a href="http://gnu.org/licenses/gpl.html">http://gnu.org/licenses/gpl.html</a>
   This is free software: you are free to change and redistribute it.
   There is NO WARRANTY, to the extent permitted by law. Type "show copying"
   and "show warranty" for details.
   This GDB was configured as "x86_64-unknown-linux-gnu".
   For bug reporting instructions, please see:
   <a href="http://www.gnu.org/software/gdb/bugs/>...">http://www.gnu.org/software/gdb/bugs/>...</a>
   Reading symbols from /homeb/zam/jkraus/workspace/JSC-GPU-Course/CUDA-aware MPI/e
   xercises/tasks/jacobi_mpi+cuda...done.
   (cuda-gdb) run
   Starting program: /homeb/zam/.jkraus/workspace/JSC-GPU-Course/CUDA-aware_MPI/exer
   cises/tasks/jacobi mpi+cuda
   [Thread debugging using libthread_db enabled]
```

```
- - X
cuda-gdb
  NVIDIA (R) CUDA Debugger
  5.0 release
  Portions Copyright (C) 2007-2012 NVIDIA Corporation
   GNU gdb (GDB) 7.2
   Copyright (C) 2010 Free Software Foundation, Inc.
   License GPLv3+: GNU GPL version 3 or later <a href="http://gnu.org/licenses/gpl.html">http://gnu.org/licenses/gpl.html</a>
   This is free software: you are free to change and redistribute it.
   There is NO WARRANTY, to the extent permitted by law. Type "show copying"
   and "show warranty" for details.
   This GDB was configured as "x86_64-unknown-linux-gnu".
   For bug reporting instructions, please see:
   <http://www.gnu.org/software/gdb/bugs/>...
  Reading symbols from /homeb/zam/.jkraus/workspace/JSC-GPU-Course/CUDA-aware_MPI/e
   xercises/tasks/jacobi mpi+cuda...done.
   (cuda-gdb) run
  Starting program: /homeb/zam/.jkraus/workspace/JSC-GPU-Course/CUDA-aware_MPI/exer
   cises/tasks/jacobi mpi+cuda
   [Thread debugging using libthread_db enabled]
```

cuda-gdb Attach

```
if ( rank == 0 ) {
        int i=0;
        printf("rank %d: pid %d on %s ready for attach\n.", rank, getpid(), name);
        while (0 == i) \{ sleep(5); \}
> mpiexec -np 2 ./jacobi mpi+cuda
Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and one Tesla M2070 for
each process (2049 rows per process).
rank 0: pid 30034 on judge107 ready for attach
> ssh judge107
jkraus@judge107:~> cuda-gdb --pid 30034
```

CUDA DEVICE WAITS ON EXCEPTION



With CUDA\_ENABLE\_COREDUMP\_ON\_EXCEPTION=1 core dumps are generated in case of an exception:

Can be used for offline debugging

Helpful if live debugging is not possible

CUDA\_ENABLE\_CPU\_COREDUMP\_ON\_EXCEPTION: Enable/Disable CPU part of core dump (enabled by default)

CUDA\_COREDUMP\_FILE: Specify name of core dump file

Open GPU: (cuda-gdb) target cudacore core.cuda

Open CPU+GPU: (cuda-gdb) target core core.cpu core.cuda



CUDA ENABLE COREDUMP ON EXCEPTION



CUDA ENABLE COREDUMP ON EXCEPTION

```
ikraus@ivb114:~/workspace/CUDA-aware_MPI/cuda/exercises/solutions
NVIDIA (R) CUDA Debugger
7.0 release
Portions Copyright (C) 2007-2014 NVIDIA Corporation
GNU qdb (GDB) 7.6.2
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <a href="http://gnu.org/licenses/gpl.html">http://gnu.org/licenses/gpl.html</a>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86 64-unknown-linux-gnu".
For bug reporting instructions, please see:
<a href="http://www.gnu.org/software/gdb/bugs/">http://www.gnu.org/software/gdb/bugs/>.</a>
(cuda-qdb) target cudacore core.cuda.ivb114.28722
Opening GPU coredump: core.cuda.ivb114.28722
[New Thread 28742]
CUDA Exception: Device Illegal Address
The exception was triggered in device 0.
[Current focus set to CUDA kernel 0, grid 1, block (107,0,0), thread (0,12,0), device 0, sm 12,
warp 6, lane 01
#0 0x000000001c02ac0 in jacobi kernel<<<(257,129,1),(16,16,1)>>> (u d=0x23048a0000,
    unew d=0x23068c0000, n=2049, m=4096, residue d=0x23088e0000) at jacobi cuda kernel.cu:43
                          residue = fabsf(unew d[j *m+ i]-u d[j *m+ i]);
 cuda-gdb)
```

#### **Third Party Tools**

Allinea DDT debugger

Rogue Wave TotalView







#### Using nvprof+NVVP

New since CUDA 9

Embed MPI rank in output filename, process name, and context name (OpenMPI)

#### Alternatives:

Only save the textual output (--log-file)

MVAPICH2: MV2\_COMM\_WORLD\_RANK
--annotate-mpi mpich

Collect data from all processes that run on a node (--profile-all-processes)



#### Using nvprof+NVVP



#### Using nvprof+NVVP



**Third Party Tools** 

Multiple parallel profiling tools are CUDA-aware

Score-P

Vampir

Tau

These tools are good for discovering MPI issues as well as basic CUDA performance inhibitors.





# **ADVANCED MPI ON GPUS**

## **BEST PRACTICE: USE NON-BLOCKING MPI**

```
#pragma acc host data use device ( u new ) {
MPI Sendrecv (u new+offset first row, m-2, MPI DOUBLE, t nb, 0,
             u new+offset bottom boundary, m-2, MPI DOUBLE, b nb, 0,
            MPI COMM WORLD, MPI STATUS IGNORE);
MPI Sendrecv (u new+offset last row, m-2, MPI DOUBLE, b nb, 1,
            u new+offset top boundary, m-2, MPI DOUBLE, t nb, 1,
             MPI COMM WORLD, MPI STATUS IGNORE);
                                                      Gives MPI more
MPI Request t b req[4];
                                                   opportunities to build
#pragma acc host data use device ( u new ) {
 MPI Irecv (u new+offset top boundary, m-2, MPI DOUBL
                                                       efficient piplines
 MPI Irecv(u new+offset bottom boundary,m-2,MP)
                                                                       _p_req+2);
 MPI_Isend(u_new+offset last row,m-2,MPI_DOUBLE
 MPI Isend(u new+offset first row,m-2,MPI DOUBLE
                                                              INV WORLD, t b req+3);
MPI Waitall (4, t b req, MPI STATUSES IGNORE);
```

OpenMPI 3.0.1RC1 - DGX-1V - 2048x2048





#### **CUDA** with Streams

```
process boundary and pack<<gs b,bs b,0,s1>>>(u new d,u d,to left d,to right d,n,m);
process inner domain <<<gs id, bs id, 0, s2>>> (u new d, u d, to left d, to right d, n, m);
MPI Request req[8];
//Exchange halo with left, right, top and bottom neighbor
MPI Waitall (8, req, MPI STATUSES IGNORE);
unpack<<<gs s,bs s,0,s2>>>(u new d, from left d, from right d, n, m);
                             //wait for iteration to finish
cudaDeviceSynchronize();
```

#### OpenACC with Async Queues

```
#pragma acc parallel loop present ( u new, u, to left, to right ) async(1)
for ( ... )
    //Process boundary and pack to left and to right
#pragma acc parallel loop present ( u new, u ) async(2)
for ( ... )
   //Process inner domain
#pragma acc wait(1) //wait for boundary
MPI Request req[8];
#pragma acc host data use device ( from left, to left, form right, to right, u new ) {
    //Exchange halo with left, right, top and bottom neighbor
MPI Waitall (8, req, MPI STATUSES IGNORE);
#pragma acc parallel loop present ( u new, from left, from right ) async(2)
for ( ... )
   //unpack from left and from right
#pragma acc wait
                                 //wait for iteration to finish
```

OpenMPI 3.0.1RC1 - DGX-1V - 2048x2048



## HIGH PRIORITY STREAMS

#### Improve scalability with high priority streams

cudaStreamCreateWithPriority

Use-case: MD Simulations



#### **CAVEAT**

Using Unified Memory with a non Unified Memory-aware MPI might fail with errors or even worse silently produce wrong results, e.g. when registering Unified Memory for RDMA.



Use a Unified Memory-aware MPI, e.g. OpenMPI since 1.8.5 or MVAPICH2-GDR since 2.2b

Unified Memory-aware: CUDA-aware MPI with support for Unified Memory



#### **Performance Implications**

Unified Memory can be used by any processor in the system

Memory pages of a Unified Memory allocation may migrate between processors memories to ensure coherence and maximize performance

Different data paths are optimal for performance depending on where the data is: e.g. NVLink between peer GPUs



The MPI implementation needs to know where the data is,

but it can't!

#### Performance Implications - Simple Example

```
cudaMallocManaged( &array, n*sizeof(double), cudaMemAttachGlobal );
while( ... ) {
    foo(array,n);
    MPI_Send(array,...);
    foo(array,n);
```

#### Performance Implications - Simple Example

- If foo is a CPU function pages of array might migrate to System Memory
- If foo is a GPU function pages of array might migrate to GPU Memory
- The MPI implementation is not aware of the application and thus doesn't know where array is and what's optimal

```
while( ... ) {
    foo(array,n);
    MPI_Send(array,...);
    foo(array,n);
}
```

#### The Future with Data Usage Hints

Tell where the application intends to use the data

Remark: Data Usage Hints are available since CUDA 8, but currently not evaluated by any Unified Memory-aware MPI implementation.

#### The Future with Data Usage Hints

Tell where the application intends to use the data

```
cudaMallocManaged( &array, n*sizeof(double), cudaMemAttachGlobal );

cudaMemAdvise(array,n*sizeof(double), cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);

while( ... ) {
    foo(array,n);
    MPI_Send(array,...);
    foo(array,n);

Array is intended to be used on the CPU
```

Remark: Data Usage Hints are available since CUDA 8, but currently not evaluated by any Unified Memory-aware MPI implementation.

#### The Future with Data Usage Hints - Summary

Data usage hints can be queried by the MPI Implementation and allow it to take the optimal data path

If the application lies about the data usage hints it will run correctly but performance will be affected

Performance tools help to identify missing or wrong data usage hints

Data usage hints are general useful for the Unified Memory system and can improve application performance.

Remark: Data Usage Hints are only hints to guide the data usage policies of the Unified Memory system. The Unified Memory system might ignore them, e.g. to ensure coherence or in oversubscription scenarios.



#### **Current Status**

Available Unified Memory-aware MPI implementations

- OpenMPI (since 1.8.5)
- MVAPICH2-GDR (since 2.2b)
  - Performance improvements with 2.2RC1 for Intranode GPU to GPU communication

Currently both don't evaluate Data Usage Hints, i.e. all Unified Memory is treated as Device Memory



Good performance if all buffers used in MPI are touched mainly on the GPU.

Without Unified Memory-aware MPI

Only use non Unified Memory Buffers for MPI: cudaMalloc, cudaMallocHost or malloc

Application managed non Unified Memory Buffers also allow to work around current missing cases in Unified Memory-aware MPI Implementations.

## **DETECTING CUDA-AWARENESS**

OpenMPI (since 2.0.0):

Macro:

MPIX CUDA AWARE SUPPORT

Function for runtime decisions

MPIX\_Query\_cuda\_support()

Include mpi-ext.h for both.

See <a href="http://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-aware-support">http://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-aware-support</a>



### THANK YOU FOR YOUR ATTENTION

#### **Questions?**

Tuesday, March 27, 2:00 PM - 2:50 PM: S8316 - Multi GPU Programming Models - Room 211A

Wednesday, March 28, 9:00 AM - 10:00 AM: CE8110 - Multi-GPU Programming - LL Pod C

Wednesday, March 28, 10:00 AM - 10:50 AM: S8595 - NVSHMEM: A Partitioned Global Address Space Library for NVIDIA GPU Clusters - Room 211B

Wednesday, March 28, 11:00 AM - 11:50 AM: CE8124 NVSHMEM - LL Pod C

Wednesday, March 28, 2:00 PM - 2:50 PM: S8373 - MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies - Room 211B

