## NPB-openACC-C-MG Implementation
In this self-paced, hands-on lab, we will briefly explore some methods for OpenACC

Qichao Hong

---
Before we begin, let's verify [WebSockets](http://en.wikipedia.org/wiki/WebSocket) are working on your system.  To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above.  If all goes well, you should see get some output returned below the grey cell.  If not, please consult the [Self-paced Lab Troubleshooting FAQ](https://developer.nvidia.com/self-paced-labs-faq#Troubleshooting) to debug the issue.

In [1]:
print ("The answer should be three: " + str(1+2))

The answer should be three: 3


First, run the cell below to get some info about the GPUs on the server.

In [2]:
!nvidia-smi

Tue May 23 06:15:37 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 780 Ti  Off  | 0000:01:00.0     N/A |                  N/A |
| 26%   36C    P8    N/A /  N/A |    345MiB /  3017MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 780 Ti  Off  | 0000:02:00.0     N/A |                  N/A |
| 26%   36C    P8    N/A /  N/A |      1MiB /  3020MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                            

CPU: Intel i7-4960x

---
<p class="hint_trigger">If you have never before taken an IPython Notebook based self-paced lab from NVIDIA, click this green box.
      <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">The following video will explain the infrastructure we are using for this self-paced lab, as well as give some tips on it's usage.  If you've never taken a lab on this system before, it's highly encourage you watch this short video first.<br><br>
<div align="center"><iframe width="640" height="390" src="http://www.youtube.com/embed/ZMrDaLSFqpY" frameborder="0" allowfullscreen></iframe></div>
<br>
<h2 style="text-align:center;color:red;">Attention Firefox Users</h2><div style="text-align:center; margin: 0px 25px 0px 25px;">There is a bug with Firefox related to setting focus in any text editors embedded in this lab. Even though the cursor may be blinking in the text editor, focus for the keyboard may not be there, and any keys you press may be applying to the previously selected cell.  To work around this issue, you'll need to first click in the margin of the browser window (where there are no cells) and then in the text editor.  Sorry for this inconvenience, we're working on getting this fixed.</div></div></div></div></p>

## Introduction to OpenACC

Open-specification OpenACC directives are a straightforward way to accelerate existing Fortran and C applications. With OpenACC directives, you provide hints via compiler directives (or 'pragmas') to tell the compiler where -- and how -- it should parallelize compute-intensive code for execution on an accelerator. 

If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized *incrementally*, with little or no change to the Fortran, C or C++ source. Debugging and code maintenance are easier. OpenACC directives are designed for *portability* across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, *all in the same program.*

Watch the following short video introduction to OpenACC:

<div align="center"><iframe width="640" height="390" style="margin: 0 auto;" src="http://www.youtube.com/embed/c9WYCFEt_Uo" frameborder="0" allowfullscreen></iframe></div>

This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple C program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost.

If you are confused now, or at any point in this lab, you can consult the <a href="#FAQ">FAQ</a> located at the bottom of this page.

# Step 1 - Characterize Your Application



The most difficult part of accelerator programming begins before the first line of code is written. If your program is not highly parallel, an accelerator or coprocesor won't be much use. Understanding the code structure is crucial if you are going to *identify opportunities* and *successfully* parallelize a piece of code. The first step in OpenACC programming then is to *characterize the application*. This includes:

+ benchmarking the single-thread, CPU-only version of the application
+ understanding the program structure and how data is passed through the call tree
+ profiling the application and identifying computationally-intense "hot spots"
    + which loop nests dominate the runtime?
    + what are the minimum/average/maximum tripcounts through these loop nests?
    + are the loop nests suitable for an accelerator?
+ insuring that the algorithms you are considering for acceleration are *safely* parallel

Note: what we've just said may sound a little scary, so please note that as parallel programming methods go OpenACC is really pretty friendly: think of it as a sandbox you can play in. Because OpenACC directives are incremental, you can add one or two directives at a time and see how things work: the compiler provides a *lot* of feedback. The right software plus good tools plus educational experiences like this one should put you on the path to successfully accelerating your programs.

## Step 1 Profiling and Benchmarking

Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the `MG` file without making any changes -- and see how fast the serial program executes. This will establish a baseline for future comparisons.  Execute the following two cells to compile and run the program.

In [3]:
!cd ./NPB-acc/MG-seq/ && make clean && make MG CLASS=B

rm -f *.o *~ ../common/*.o *.w2c.c *.w2c.h *.i *.B *.t *.w2c.cu *.w2c.ptx *.spin *.s *.x
rm -f npbparams.h core
make[1]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
rm -f setparams setparams.h npbparams.h
rm -f *~ *.o
cc  -o setparams setparams.c
make[1]: Leaving directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
../sys/setparams mg B
cc  -c -I../common  -DCRPL_COMP=0 mg.c
cd ../common; cc  -c -I../common  print_results.c
cd ../common; cc  -c -I../common  c_timers.c
cd ../common; cc  -c -I../common   -o wtime.o ../common/wtime.c
cd ../common; cc  -c -I../common  randdp.c
cc  -o ./mg.B.x mg.o ../common/print_results.o ../common/c_timers.o ../common/wtime.o ../common/randdp.o -lm


In [4]:
!pgprof --cpu-profiling on --cpu-profiling-mode top-down ./NPB-acc/MG-seq/mg.B.x



 NAS Parallel Benchmarks (NPB3.3-SER-C) - MG Benchmark

 No input file. Using compiled defaults 
 Size:  256x 256x 256  (class B)
 Iterations:  20

 Initialization time:           3.743 seconds

  iter   1
  iter   5
  iter  10
  iter  15
  iter  20

 Benchmark completed
 VERIFICATION SUCCESSFUL
 L2 Norm is  1.8005644013551E-06
 Error is    6.6330115975290E-14


 MG Benchmark Completed.
 Class           =                        B
 Size            =            256x 256x 256
 Iterations      =                       20
 Time in seconds =                    28.04
 Mop/s total     =                   694.11
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1
 Compile date    =              23 May 2017

 Compile options:
    CC           = (none)
    CLINK        = (none)
    C_LIB        = -lm
    C_INC        = -I../common
    CFLAGS       = (none)
    CLINKFLAGS   = (none)
    RAND         = randdp

--------

### Quality Checking/Keeping a Record

*After each step*, we will record the results from our benchmarking and correctness tests in a table like this one: 

|Step| Execution       | ExecutionTime (s)     | Speedup vs. 1 CPU Thread       | Correct? | Programming Time |
|:--:| --------------- | ---------------------:| ------------------------------:|:--------:| -----------------|
|1   | CPU 1 thread    | 28.04           |                                | Yes      |                |  |



We see mg3P(), resid(), psinv(), interp(), rprj3() need the most time to compute. So we will work mainly on these functions.

## Step 2 - Add Compute Directives 

Things need to to before you add #pragma ...
    1. Initiate the GPU
        acc_init(acc_device_default);
    2. Create the variables on GPU needed to run
        #pragma acc data create(u[0:gnr],v[0:gnr],r[0:gnr])
        {
            ...
         }
    3. Tell the GPU where the data is using deviceptr()
        #pragma acc data deviceptr(r1,r2) present(ou[0:n3*n2*n1]) present(or[0:n3*n2*n1])
        {
        ...
        }


```
In psinv(), initiate r1 and r2 by acc malloc
    Using #pragma acc data deviceptr(r1,r2) to tell GPU the address
    
In resid(), initiate u1 and u2 by acc_malloc
    using #pragma acc data deviceptr() to pass to GPU
    
In rprj3(), initiate x1 and y1 by acc_malloc
    using #pragma acc data deviceptr() to pass to GPU
    
In interp(), ~~ z1, z2 and z3 by ~~
    Using deviceptr()

In zero3(), after it finishing, update oz[] on device

    
Add '#pragma acc parallel loop' to all nesty for loops
```

In [5]:
!cd ./NPB-acc/MG-step1/ && make clean && make CC=pgcc CLASS=B

rm -f *.o *~ ../common/*.o *.w2c.c *.w2c.h *.i *.B *.t *.w2c.cu *.w2c.ptx *.spin *.s *.x
rm -f npbparams.h core
make[1]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
rm -f setparams setparams.h npbparams.h
rm -f *~ *.o
cc  -o setparams setparams.c
make[1]: Leaving directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
../sys/setparams mg B
pgcc  -c -I../common -O3 -acc -ta=nvidia,cc35,cuda8.0  -Minfo=accel -mcmodel=medium -DCRPL_COMP=0 mg.c
main:
    248, Generating create(r[:gnr],u[:gnr],v[:gnr])
psinv:
    530, Generating present(or[:n1*(n2*n3)],ou[:n1*(n2*n3)])
    534, Accelerator kernel generated
         Generating Tesla code
        535, #pragma acc loop gang /* blockIdx.x */
        536, #pragma acc loop seq
        537, #pragma acc loop vector(128) /* threadIdx.x */
    536, Loop carried dependence of r1->,r2-> prevents parallelization
         Loop carried backward dependence of r1->,r2-> prevents vectorization
    537, Loop is parallelizable
    55

If you get any error please check your work and try re-compilling.

### We can see the detials about how compiler handle the loops.

In [6]:
!./NPB-acc/MG-step1/mg.B.x



 NAS Parallel Benchmarks (NPB3.3-ACC-C) - MG Benchmark

 No input file. Using compiled defaults 
 Size:  256x 256x 256  (class B)
 Iterations:  20

 Initialization time:           1.879 seconds

  iter   1
  iter  20

 Benchmark completed
 VERIFICATION SUCCESSFUL
 L2 Norm is  1.8005644013552E-06
 Error is    9.4202877475545E-14


 MG Benchmark Completed.
 Class           =                        B
 Size            =            256x 256x 256
 Iterations      =                       20
 Time in seconds =                    16.18
 Mop/s total     =                  1202.48
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1
 Compile date    =              23 May 2017

 Compile options:
    CC           = (none)
    CLINK        = (none)
    C_LIB        = -lm
    C_INC        = -I../common
    CFLAGS       = (none)
    CLINKFLAGS   = (none)
    RAND         = randdp

--------------------------------------
 P

Let's record our results in the table:

|Step| Execution    | Time(s)     | Speedup vs. 1 CPU Thread  | Correct? | Programming Time |
| -- || ------------ | ----------- | ------------------------- | -------- | ---------------- |
|1| CPU 1 thread |28.04      |                           |          | |
|2| Add parallel loop  |16.18      | 1.73X           | Yes      | ||


## Optimization
Compiler use default setting of gang, worker and vector to run the benchmark.
We can still adjust these values manully to let the program fit the device you have.

For example:
```
#pragma acc parallel loop gang num_gangs(n3-2) num_workers(16) vector_length(64)
    for (i3 = 1; i3 < n3-1; i3++) {
#pragma acc loop worker
      for (i2 = 1; i2 < n2-1; i2++) {
#pragma acc loop vector
        for (i1 = 0; i1 < n1; i1++) {

```

In [7]:
!cd ./NPB-acc/MG-final/ && make clean && make CC=pgcc CLASS=B

rm -f *.o *~ ../common/*.o *.w2c.c *.w2c.h *.i *.B *.t *.w2c.cu *.w2c.ptx *.spin *.s *.x
rm -f npbparams.h core
make[1]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
rm -f setparams setparams.h npbparams.h
rm -f *~ *.o
cc  -o setparams setparams.c
make[1]: Leaving directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
../sys/setparams mg B
pgcc  -c -I../common -O3 -acc -ta=nvidia,cc35,cuda8.0  -Minfo=accel -mcmodel=medium -DCRPL_COMP=0 mg.c
main:
    248, Generating create(r[:gnr],u[:gnr],v[:gnr])
psinv:
    530, Generating present(or[:n1*(n2*n3)],ou[:n1*(n2*n3)])
    539, Loop is parallelizable
    541, Loop is parallelizable
    543, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        539, #pragma acc loop gang /* blockIdx.y */
        541, #pragma acc loop gang, worker(4) /* blockIdx.x threadIdx.y */
        543, #pragma acc loop gang, vector(32) /* blockIdx.z threadIdx.x */
    564, Loop is parallelizable
  

In [8]:
!./NPB-acc/MG-final/mg.B.x



 NAS Parallel Benchmarks (NPB3.3-ACC-C) - MG Benchmark

 No input file. Using compiled defaults 
 Size:  256x 256x 256  (class B)
 Iterations:  20

 Initialization time:           0.948 seconds

  iter   1
  iter  20

 Benchmark completed
 VERIFICATION SUCCESSFUL
 L2 Norm is  1.8005644013552E-06
 Error is    9.4202877475545E-14


 MG Benchmark Completed.
 Class           =                        B
 Size            =            256x 256x 256
 Iterations      =                       20
 Time in seconds =                     0.58
 Mop/s total     =                 33454.47
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1
 Compile date    =              23 May 2017

 Compile options:
    CC           = (none)
    CLINK        = (none)
    C_LIB        = -lm
    C_INC        = -I../common
    CFLAGS       = (none)
    CLINKFLAGS   = (none)
    RAND         = randdp

--------------------------------------
 P

Let's record our results in the table:

|Step| Execution    | Time(s)     | Speedup vs. 1 CPU Thread  | Correct? | Programming Time |
| -- || ------------ | ----------- | ------------------------- | -------- | ---------------- |
|1| CPU 1 thread |28.04      |                           |          | |
|2| Add parallel loop  |16.18      | 1.73X           | Yes      | |
|3| Optimization  |0.58      | 48.43X           | Yes      | ||


## Different Input will have different result
### No idea why it speed up only 1.73x by default configuration of block dimension