- The three key steps in porting to high performance accelerated code:
- Analyze/Identify parallelism
- Express data movement and parallelism
- Optimize data movement and loop performance
- Go back to 1!
- Analyze your code to determine most likely places needing parallelization or optimization.
- Parallelize your code by starting with the most time consuming parts and check for correctness.
- Optimize your code to improve observed speed-up from parallelization.
- One should generally start the process at the top with the analyze step. For complex applications, it's useful to have a profiling tool available to learn where your application is spending its execution time and to focus your efforts there. Since our example code is quite a bit simpler than a full application, we'll skip profiling the code and simply analyze the code by reading it
- Compiler diagnostics is usually the first thing to check when starting the OpenACC work
- It can tell you what operations were actually performed
- Data copies that were made
- If and how the loops were parallelized
- The diagnostics is very compiler dependent
- Compiler flags
- Level and formatting of information
- Diagnostics is controlled by compiler flag
-Minfo=option
- Useful options:
accel
– operations related to the acceleratorall
– print all compiler outputintensity
– print loop computational intensity info
- Define a loop to be parallelized
- C/C++:
#pragma acc loop [clauses]
- Fortran:
!$acc loop [clauses]
- Must be followed by a C/C++ or Fortran loop construct.
- Combined constructs with
parallel
andkernels
#pragma acc kernels loop / !$acc kernels loop
#pragma acc parallel loop / !$acc parallel loop
- Combined constructs with
- C/C++:
- Similar in functionality to OpenMP
for/do
construct - Loop index variables are
private
variables by default
Adding two vectors
CPU
./examples/sum.c
OpenACC parallel
./examples/sum_parallel.c
OpenACC kernels
./examples/sum_kernels.c
$ pgcc -g -O3 -acc -Minfo=acc sum_parallel.c -o sum
main:
21, Generating copy(vecA[:],vecB[:],vecC[:]) [if not already present]
23, Generating Tesla code
25, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
$ pgcc -g -O3 -acc -Minfo=accel sum_kernels.c -o sum
main:
21, Generating copy(vecA[:],vecB[:],vecC[:]) [if not already present]
23, Loop is parallelizable
Generating Tesla code
23, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
$ cat slurm-13186502.out #output of sum_parallel.c
Accelerator Kernel Timing data
main NVIDIA devicenum=0
time(us): 451
21: data region reached 2 times
21: data copyin transfers: 3
device time(us): total=245 max=100 min=71 avg=81
29: data copyout transfers: 3
device time(us): total=206 max=72 min=67 avg=68
23: compute region reached 1 time
23: kernel launched 1 time
grid: [800] block: [128]
elapsed time(us): total=41 max=41 min=41 avg=41
Reduction sum: 1.2020569031119108
$ cat slurm-13186514.out
- Accelerator Kernel Timing data
- main NVIDIA devicenum=0
time(us): 453 21: data region reached 2 times 21: data copyin transfers: 3 device time(us): total=247 max=100 min=72 avg=82 26: data copyout transfers: 3 device time(us): total=206 max=73 min=66 avg=68 23: compute region reached 1 time 23: kernel launched 1 time grid: [800] block: [128] elapsed time(us): total=40 max=40 min=40 avg=40
Reduction sum: 1.2020569031119108
- NVIDIA nvprof provides a simple interface to collect on a target without using the GUI. (new NSight system)
- GPU profiling capabilities: High-level usage statistics, Timeline collection, Analysis metrics
- basic CPU sampling
Solving Poisson’s equation for incompressible fluid by the Jacobi iteration method.
(https://blogs.fau.de/hager/archives/7850)
himeno code
c
./examples/himeno_C.c
fortran
../examples/himeno_f77.f
$ srun -n 1 nvprof --cpu-profiling on --cpu-profiling-mode top-down ./himeno.x
======== CPU profiling result (top down):
Time(%) Time Name
85.74% 18.16s jacobi
7.46% 1.58s initmt
0.94% 200ms | ???
6.75% 1.43s __c_mcopy4_sky
0.05% 10ms __c_mcopy4
- Profiling is essential for optimization
- NVPROF and NVVP for NVIDIA platform