Memory Coalescing
=================
In this part we will talk about memory coalescence. We will talk about what it
is and why it is important. We will also showcase a program, where we will see
how it should be done and how it should not be done.

0 What is it?
-------------
On a GPU we have three layers of memory:
- Global
- Shared
- Local (registers)

When we access global memory on a gpu, we access multiple elements at the same
time. This is important to keep in mind, when programming, because access to
global memory is slow. So we need to utilise that we are accessing multiple
elements at the same time. Therefore we need adjacent threads on the GPU to
access adjacent memory in order to gain maximum performance.

1 Matrix addition
-----------------
We will be looking at matrix addition, but for teaching purposes we will only
parallelise one dimension. We will show the differences in parallelising each
dimension and describe why there is a difference.

2 Parallelising the outer loop
------------------------------
When programming a CPU the correct thing to do would be to parallelise the outer
loop, because we would then get cache coherency. So this is what we have done in
the first part. This is not optimal on a GPU, because when we access memory, we
get multiple elements at the same time as described earlier. When parallelising
the outer loop, every thread in the thread block will read their section of
memory, which requires multiple reads of global memory.

![Every thread will read from a different block of memory](notcoalesced.png)

In [22]:
#include<stdlib.h>
#include<iostream>
#include "timer.h"

using namespace std;

int main() {
    int height = 20000;
    int width = 20000;
    int memsize = width*height;

    int* a = new int[memsize];
    int* b = new int[memsize];
    int* res = new int[memsize];

    for (int i = 0; i < height; i++) {
        for (int j = 0; j < width; j++) {
            a[i*width+j] = 1;
            b[i*width+j] = 1;
        }
    }

    timer time;

    #pragma omp target teams distribute parallel for map(to:a[:width*height]) map(to:b[:width*height]) map(from:res[:width*height])
    for (int i = 0; i < height; i++) {
        for (int j = 0; j < width; j++) {
            res[i*width+j] = a[i*width+j] + b[i*width+j];
        }
    }

    printf("Elapsed time: %f\n", time.getTime());

    bool allElementsAre2 = true;
    for (int i = 0; i < height; i++) {
        for (int j = 0; j < width; j++) {
            if (res[i*width+j] != 2) {
                allElementsAre2 = false;
            }
        }
    }

    if (allElementsAre2) {
        cout << "All numbers in matrix are 2" << endl;
    } else {
        cout << "Not all numbers in matrix are 2" << endl;
    }

    return 0;
}

Elapsed time: 1.196168
All numbers in matrix are 2



3 Parallelising the inner loop
------------------------------
To fix the error in the previous version, we instead parallelise the inner loop.
This means when we are reading data from global memory, then every data point is
given to a thread and no data is fetched without being assigned to a thread.

![All threads read within the same block of memory](coalesced.png)

In [23]:
#include<stdlib.h>
#include<iostream>
#include<timer.h>

using namespace std;

int main() {
    int height = 20000;
    int width = 20000;
    int memsize = width*height;
    timespec start,end,diff;

    int* a = new int[memsize];
    int* b = new int[memsize];
    int* res = new int[memsize];

    for (int i = 0; i < height; i++) {
        for (int j = 0; j < width; j++) {
            a[i*width+j] = 1;
            b[i*width+j] = 1;
        }
    }

    timer time;

    #pragma omp target teams distribute parallel for map(to:a[:width*height]) map(to:b[:width*height]) map(from:res[:width*height])
    for (int j = 0; j < width; j++) {
        for (int i = 0; i < height; i++) {
            res[i*width+j] = a[i*width+j] + b[i*width+j];
        }
    }

    printf("Elapsed time: %f\n", time.getTime());

    bool allElementsAre2 = true;
    for (int i = 0; i < height; i++) {
        for (int j = 0; j < width; j++) {
            if (res[i*width+j] != 2) {
                allElementsAre2 = false;
            }
        }
    }

    if (allElementsAre2) {
        cout << "All numbers in matrix are 2" << endl;
    } else {
        cout << "Not all numbers in matrix are 2" << endl;
    }

    return 0;
}

Elapsed time: 1.971065
All numbers in matrix are 2
