## NPB-openACC-C-CG Implementation
In this self-paced, hands-on lab, we will briefly explore some methods for OpenACC

Qichao Hong

---
Before we begin, let's verify [WebSockets](http://en.wikipedia.org/wiki/WebSocket) are working on your system.  To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above.  If all goes well, you should see get some output returned below the grey cell.  If not, please consult the [Self-paced Lab Troubleshooting FAQ](https://developer.nvidia.com/self-paced-labs-faq#Troubleshooting) to debug the issue.

In [1]:
print ("The answer should be three: " + str(1+2))

The answer should be three: 3


First, run the cell below to get some info about the GPUs on the server.

In [2]:
!nvidia-smi

Tue May 23 07:49:32 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 780 Ti  Off  | 0000:01:00.0     N/A |                  N/A |
| 31%   42C    P8    N/A /  N/A |    345MiB /  3017MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 780 Ti  Off  | 0000:02:00.0     N/A |                  N/A |
| 26%   40C    P8    N/A /  N/A |      1MiB /  3020MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                            

CPU: Intel i7-4960x

---
<p class="hint_trigger">If you have never before taken an IPython Notebook based self-paced lab from NVIDIA, click this green box.
      <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">The following video will explain the infrastructure we are using for this self-paced lab, as well as give some tips on it's usage.  If you've never taken a lab on this system before, it's highly encourage you watch this short video first.<br><br>
<div align="center"><iframe width="640" height="390" src="http://www.youtube.com/embed/ZMrDaLSFqpY" frameborder="0" allowfullscreen></iframe></div>
<br>
<h2 style="text-align:center;color:red;">Attention Firefox Users</h2><div style="text-align:center; margin: 0px 25px 0px 25px;">There is a bug with Firefox related to setting focus in any text editors embedded in this lab. Even though the cursor may be blinking in the text editor, focus for the keyboard may not be there, and any keys you press may be applying to the previously selected cell.  To work around this issue, you'll need to first click in the margin of the browser window (where there are no cells) and then in the text editor.  Sorry for this inconvenience, we're working on getting this fixed.</div></div></div></div></p>

## Introduction to OpenACC

Open-specification OpenACC directives are a straightforward way to accelerate existing Fortran and C applications. With OpenACC directives, you provide hints via compiler directives (or 'pragmas') to tell the compiler where -- and how -- it should parallelize compute-intensive code for execution on an accelerator. 

If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized *incrementally*, with little or no change to the Fortran, C or C++ source. Debugging and code maintenance are easier. OpenACC directives are designed for *portability* across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, *all in the same program.*

Watch the following short video introduction to OpenACC:

<div align="center"><iframe width="640" height="390" style="margin: 0 auto;" src="http://www.youtube.com/embed/c9WYCFEt_Uo" frameborder="0" allowfullscreen></iframe></div>

This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple C program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost.

If you are confused now, or at any point in this lab, you can consult the <a href="#FAQ">FAQ</a> located at the bottom of this page.

# Step 1 - Characterize Your Application



The most difficult part of accelerator programming begins before the first line of code is written. If your program is not highly parallel, an accelerator or coprocesor won't be much use. Understanding the code structure is crucial if you are going to *identify opportunities* and *successfully* parallelize a piece of code. The first step in OpenACC programming then is to *characterize the application*. This includes:

+ benchmarking the single-thread, CPU-only version of the application
+ understanding the program structure and how data is passed through the call tree
+ profiling the application and identifying computationally-intense "hot spots"
    + which loop nests dominate the runtime?
    + what are the minimum/average/maximum tripcounts through these loop nests?
    + are the loop nests suitable for an accelerator?
+ insuring that the algorithms you are considering for acceleration are *safely* parallel

Note: what we've just said may sound a little scary, so please note that as parallel programming methods go OpenACC is really pretty friendly: think of it as a sandbox you can play in. Because OpenACC directives are incremental, you can add one or two directives at a time and see how things work: the compiler provides a *lot* of feedback. The right software plus good tools plus educational experiences like this one should put you on the path to successfully accelerating your programs.

## Step 1 Profiling and Benchmarking

Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the `BT` file without making any changes -- and see how fast the serial program executes. This will establish a baseline for future comparisons.  Execute the following two cells to compile and run the program.

In [3]:
!cd ./NPB-acc/CG-seq/ && make clean && make CG CLASS=B

rm -f *.x *.w2c.ptx *.o *.w2c.cu *.w2c.c *.w2c.h *.i *.spin *.B *.s *.t *~ ../common/*.o
rm -f npbparams.h core
make[1]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
rm -f setparams setparams.h npbparams.h
rm -f *~ *.o
cc  -o setparams setparams.c
make[1]: Leaving directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
../sys/setparams cg B
cc  -c -I../common  -DCRPL_COMP=0 cg.c
cd ../common; cc  -c -I../common  print_results.c
cd ../common; cc  -c -I../common  randdp.c
cd ../common; cc  -c -I../common  c_timers.c
cd ../common; cc  -c -I../common   -o wtime.o ../common/wtime.c
cc  -o ./cg.B.x cg.o ../common/print_results.o ../common/randdp.o ../common/c_timers.o ../common/wtime.o -lm


In [4]:
!pgprof --cpu-profiling on --cpu-profiling-mode top-down ./NPB-acc/CG-seq/cg.B.x



 NAS Parallel Benchmarks (NPB3.3-SER-C) - CG Benchmark

 Size:       75000
 Iterations:    75

 Initialization time =           9.300 seconds

   iteration           ||r||                 zeta
        1       2.27623230253657E-13    59.9994751578754
        2       8.44559392893216E-16    21.7627846142536
        3       8.64511022632338E-16    22.2876617043224
        4       8.61697722482409E-16    22.5230738188346
        5       8.71222095709208E-16    22.6275390653892
        6       8.71977185255567E-16    22.6740259189533
        7       8.73832261918524E-16    22.6949056826251
        8       8.87870239801501E-16    22.7044023166872
        9       8.83216527363961E-16    22.7087834345620
       10       8.80909464617773E-16    22.7108351397177
       11       8.80803816383157E-16    22.7118107121341
       12       8.84631507363274E-16    22.7122816240971
       13       8.84386307708107E-16    22.7125122663242
       14       8.85738245256241E-16    22.7126268007594
       

### Quality Checking/Keeping a Record

*After each step*, we will record the results from our benchmarking and correctness tests in a table like this one: 

|Step| Execution       | ExecutionTime (s)     | Speedup vs. 1 CPU Thread       | Correct? | Programming Time |
|:--:| --------------- | ---------------------:| ------------------------------:|:--------:| -----------------|
|1   | CPU 1 thread    | 130.08           |                                | Yes      |                |  |



Look mainly at conj_grad

## Step 2 - Add Compute Directives 

Things need to to before you add #pragma ...
    1. Initiate the GPU
        acc_init(acc_device_default);
    2. Copy in variables GPU needed to run
        #pragma acc data copyin(colidx[0:nz],a[0:nz], rowstr[0:na+1]) create(x[0:na+2],z[0:na+2], p[0:na+2],q[0:na+2], r[0:na+2])
        {
            ...
         }

#### Then find for loops and add '#pragma acc kernels loop' to the loop
#### Add recuction() when it is applicable
        

In [5]:
!cd ./NPB-acc/CG-step1/ && make clean && make CC=pgcc CLASS=B

rm -f *.x *.w2c.ptx *.o *.w2c.cu *.w2c.c *.w2c.h *.i *.spin *.B *.s *.t *~ ../common/*.o
rm -f npbparams.h core
make[1]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
rm -f setparams setparams.h npbparams.h
rm -f *~ *.o
cc  -o setparams setparams.c
make[1]: Leaving directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
../sys/setparams cg B
pgcc  -c -I../common -O3 -acc -ta=nvidia,cc35,cuda8.0  -Minfo=accel -mcmodel=medium -DCRPL_COMP=0 cg.c
main:
    258, Generating copyin(a[:nz],colidx[:nz])
         Generating create(p[:na+2],q[:na+2],r[:na+2])
         Generating copyin(rowstr[:na+1])
         Generating create(x[:na+2],z[:na+2])
    269, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        269, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    275, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        275, #pragma acc loop gang, vector(128) /* bl

If you get any error please check your work and try re-compilling.

In [6]:
!ulimit -s unlimited && ./NPB-acc/CG-step1/cg.B.x



 NAS Parallel Benchmarks (NPB3.3-ACC-C) - CG Benchmark

 Size:       75000
 Iterations:    75

 Initialization time =           4.634 seconds

   iteration           ||r||                 zeta
        1       8.88765256257298E-14    59.9994751578754
        2       3.64772534035289E-16    21.7627846142538
        3       3.80342263564589E-16    22.2876617043225
        4       3.82893975525703E-16    22.5230738188352
        5       3.84615658753603E-16    22.6275390653890
        6       3.87812069871467E-16    22.6740259189537
        7       3.83529848508281E-16    22.6949056826253
        8       3.84959489734748E-16    22.7044023166871
        9       3.87169229586280E-16    22.7087834345616
       10       3.83418016642621E-16    22.7108351397172
       11       3.79213562407113E-16    22.7118107121337
       12       3.80483938355708E-16    22.7122816240973
       13       3.79296490032389E-16    22.7125122663245
       14       3.79382311109759E-16    22.7126268007598
       

Let's record our results in the table:

|Step| Execution    | Time(s)     | Speedup vs. 1 CPU Thread  | Correct? | Programming Time |
| -- || ------------ | ----------- | ------------------------- | -------- | ---------------- |
|1| CPU 1 thread |130.08      |                           |          | |
|2| Add kernels directive  |108.42      | 1.21X           | Yes      | ||


## Optimization
Compiler use default setting of gang, worker and vector to run the benchmark.
We can still adjust these values manully to let the program fit the device you have.

For example:
```
#pragma acc parallel num_gangs(end) num_workers(4) vector_length(32)
{
#pragma acc loop gang
        for (j = 0; j < end; j++) {
          tmp1 = rowstr[j];
          tmp2 = rowstr[j+1];
          sum = 0.0;
#pragma acc loop worker vector reduction(+:sum)
          for (k = tmp1; k < tmp2; k++) {

```

In [7]:
!cd ./NPB-acc/CG-final/ && make clean && make CC=pgcc CLASS=B

rm -f *.x *.w2c.ptx *.o *.w2c.cu *.w2c.c *.w2c.h *.i *.spin *.B *.s *.t *~ ../common/*.o
rm -f npbparams.h core
make[1]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
rm -f setparams setparams.h npbparams.h
rm -f *~ *.o
cc  -o setparams setparams.c
make[1]: Leaving directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
../sys/setparams cg B
pgcc  -c -I../common -O3 -acc -ta=nvidia,cc35,cuda8.0  -Minfo=accel -mcmodel=medium -DCRPL_COMP=0 cg.c
main:
    258, Generating copyin(a[:nz],colidx[:nz])
         Generating create(p[:na+2],q[:na+2],r[:na+2])
         Generating copyin(rowstr[:na+1])
         Generating create(x[:na+2],z[:na+2])
    273, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        273, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    283, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        283, #pragma acc loop gang, vector(128) /* bl

In [8]:
!ulimit -s unlimited && ./NPB-acc/CG-final/cg.B.x



 NAS Parallel Benchmarks (NPB3.3-ACC-C) - CG Benchmark

 Size:       75000
 Iterations:    75

 Initialization time =          10.713 seconds

   iteration           ||r||                 zeta
        1       8.88102088587189E-14    59.9994751578754
        2       3.66238294280087E-16    21.7627846142538
        3       3.79702708746571E-16    22.2876617043225
        4       3.81606760478825E-16    22.5230738188352
        5       3.84849469752605E-16    22.6275390653890
        6       3.86542494879370E-16    22.6740259189537
        7       3.83405242458680E-16    22.6949056826253
        8       3.86641304912142E-16    22.7044023166871
        9       3.82345464089509E-16    22.7087834345616
       10       3.81857672275906E-16    22.7108351397172
       11       3.80161276146272E-16    22.7118107121337
       12       3.79588251304061E-16    22.7122816240973
       13       3.78754892135336E-16    22.7125122663245
       14       3.77214586387744E-16    22.7126268007598
       

Let's record our results in the table:

|Step| Execution    | Time(s)     | Speedup vs. 1 CPU Thread  | Correct? | Programming Time |
| -- || ------------ | ----------- | ------------------------- | -------- | ---------------- |
|1| CPU 1 thread |130.08      |                           |          | |
|2| Add parallel directive  |108.42      | 1.21X           | Yes      | |
|3| Slower  |569.84      | 0.22X           | Yes      | ||


### Sergio's version is slower than sequential version
