Exploiting parallelism
======================
We have now talked about the basic things of GPU programming with OpenACC. Now we
have to put that into practice. Sometimes when you have big loops it can be
difficult to figure out if they are actually parallel or partly parallel so they
can be split up.

0 When are loops parallel?
--------------------------
When we look at nested loops it can be difficult to spot if they are
parallelisable. But there are some techniques, which can help you do so. One we
can use is called direction vectors. By determining the direction of dependence
in a loop we can see if a loop is parallel.

There is exists three types of dependencies:
- True dependency, also called read after write (RAW).
- Anti dependency, also called write after read (WAR).
- Output dependency, also called write after write (WAW).

In a loop this is defined as what happens in the previous iteration. For example
```
FOR i = 1 TO N
    A[i] = A[i-1]
```
This is a RAW as every iteration reads what has been written in the previous.

To create a direction vector we use the following characters `=`, `<`, and `>`.
`=` is used for WAW operations, `<` for RAW operations, and `>` for WAR
operations.

For some more advanced examples we have
```
FOR i = 0 TO N
    FOR j = 0 TO N
        A[i,j] = A[i,j] ...
```
Here the direction vector is `[=, =]` or WAW in both directions.

```
FOR i = 0 TO N
    FOR j = 1 TO N
        A[j,i] = A[j-1,i] ...
```
Here the direction vectors is `[=, <]`, where the outer loop is WAW, and the
inner RAW.

```
FOR i = 0 TO N
    FOR j = 0 TO N
        A[i,j] = A[i-1,j+1] ...
```
For the last example we have the direction vector `[<, >]`, which means the
outer loop is RAW, and the inner is WAR.

And then we have the important part about direction vectors, which is that
loop in a loop is parallel if all its directions are `=` or there exist an outer
loop whose direction is `<`. This means we can determine parallelism from our
direction vectors.
Another thing we know is that direction vectors can not have `>` as the first
non-`=` symbol. This would lead us to depend on something that we have yet to
calculate.

Loop interchange
----------------
To exploit even more parallelism in our code, we can use loop interchange. By
doing loop interchange we can also make sure that we get coalesced memory
access, which we have seen earlier can bring us performance gains.

Loop interchange is allowed if and only if it does not result in a `>` direction
as the leftmost non-`=` direction.

1 Loop interchange example
--------------------------



TODO: make WAR example with result array


In [3]:
#include<stdlib.h>
#include<iostream>
#include<timer.h>

using namespace std;

int main() {
    int num = 500;
    int memsize = num*num*num;
    long long* elements = new long long[memsize];

    #pragma acc data copyout(elements[:memsize])
    {
        #pragma acc parallel loop
        for (long long i = 0; i < memsize; i++) {
            elements[i] = i;
        }

        timer time;
        for (int i = 0; i < num; i++) {
            for (int j = 1; j < num; j++) {
                #pragma acc parallel loop
                for (int k = 0; k < num; k++) {
                    elements[i*num*num+j*num+k] += elements[i*num*num+(j-1)*num+k];
                }
            }
        }

        cout << "Elapsed time: " << time.getTime() << endl;
    }

    cout << elements[memsize-1] << endl;
}


Elapsed time: 45.4758
62437624500


In [5]:
#include<stdlib.h>
#include<iostream>
#include<timer.h>

using namespace std;

int main() {
    int num = 500;
    int memsize = num*num*num;
    long long* elements = new long long[memsize];

    #pragma acc data copyout(elements[:memsize])
    {
        #pragma acc parallel loop
        for (long long i = 0; i < memsize; i++) {
            elements[i] = i;
        }

        timer time;

        for (int j = 1; j < num; j++) {
            #pragma acc parallel loop collapse(2) present(elements[:memsize])
            for (int i = 0; i < num; i++) {
                for (int k = 0; k < num; k++) {
                    elements[i*num*num+j*num+k] += elements[i*num*num+(j-1)*num+k];
                }
            }
        }

        cout << "Elapsed time: " << time.getTime() << endl;
    }

    cout << elements[memsize-1] << endl;
}

Elapsed time: 0.099774
62437624500
