<div align="center"><h1>Vector Addition with HIP C/C++</h1></div>

---
## Prerequisites

To get the most out of this lab you should already be able to:

- Declare variables, write loops, and use if / else statements in C.
- Define and invoke functions in C.
- Allocate arrays in C.

No previous HIP knowledge is required.

---
## Objectives

By the time you complete this lab, you will be able to:

- Write, compile, and run C/C++ programs that both call CPU functions and **launch** GPU **kernels**.
- Control parallel **thread hierarchy** using **execution configuration**.
- Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free memory available to both CPUs and GPUs.
- Handle errors generated by HIP code.
- Accelerate CPU-only applications.

---
## Accelerated Systems

*Accelerated systems*, also referred to as *heterogeneous systems*, are those composed of both CPUs and GPUs. Accelerated systems run CPU programs which in turn, launch functions that will benefit from the massive parallelism providied by GPUs. This lab environment is an accelerated system which includes an NVIDIA GPU. Information about this GPU can be queried with the `rocm-smi` (*Systems Management Interface*) command line command. Issue the `rocm-smi` command now, by `CTRL` + clicking on the code execution cell below. You will find these cells throughout this lab any time you need to execute code. The output from running the command will be printed just below the code execution cell after the code runs. After running the code execution block immediately below, take care to find and note the name of the GPU in the output.

In [1]:
!rocm-smi



Device  [Model : Revision]    Temp    Power     Partitions      SCLK  MCLK     Fan  Perf  PwrCap       VRAM%  GPU%  
[3m        Name (20 chars)       (Edge)  (Socket)  (Mem, Compute)                                                      [0m
0       [0xb002 : 0xc1]       33.0°C  11.189W   N/A, N/A        None  2800Mhz  0%   auto  Unsupported    3%   1%    
        0x15bf                                                                                                      


---
## Writing Application Code for the GPU

HIP provides extensions for many common programming languages, in the case of this lab, C/C++. These language extensions easily allow developers to run functions in their source code on a GPU.

Below is a `.cpp` file. It contains two functions, the first which will run on the CPU, the second which will run on the GPU. Spend a little time identifying the differences between the functions, both in terms of how they are defined, and how they are invoked.

```cpp
void CPUFunction()
{
  printf("This function is defined to run on the CPU.\n");
}

__global__ void GPUFunction()
{
  printf("This function is defined to run on the GPU.\n");
}

int main()
{
  CPUFunction();

  GPUFunction<<<1, 1>>>();
  HIPDeviceSynchronize();
}
```

Here are some important lines of code to highlight, as well as some other common terms used in accelerated computing:

`__global__ void GPUFunction()`
  - The `__global__` keyword indicates that the following function will run on the GPU, and can be invoked **globally**, which in this context means either by the CPU, or, by the GPU.
  - Often, code executed on the CPU is referred to as **host** code, and code running on the GPU is referred to as **device** code.
  - Notice the return type `void`. It is required that functions defined with the `__global__` keyword return type `void`.

`GPUFunction<<<1, 1>>>();`
  - Typically, when calling a function to run on the GPU, we call this function a **kernel**, which is **launched**.
  - When launching a kernel, we must provide an **execution configuration**, which is done by using the `<<< ... >>>` syntax just prior to passing the kernel any expected arguments.
  - At a high level, execution configuration allows programmers to specify the **thread hierarchy** for a kernel launch, which defines the number of thread groupings (called **blocks**), as well as how many **threads** to execute in each block. Execution configuration will be explored at great length later in the lab, but for the time being, notice the kernel is launching with `1` block of threads (the first execution configuration argument) which contains `1` thread (the second configuration argument).

`HIPDeviceSynchronize();`
  - Unlike much C/C++ code, launching kernels is **asynchronous**: the CPU code will continue to execute *without waiting for the kernel launch to complete*.
  - A call to `HIPDeviceSynchronize`, a function provided by the HIP runtime, will cause the host (CPU) code to wait until the device (GPU) code completes, and only then resume execution on the CPU.

---
### Compiling and Running Accelerated HIP Code

This section contains details about the `hipcc` command you issued above to compile and run your `.cpp` program.

The HIP platform ships with the [**ROCm HIP Compiler**](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html) `hipcc`, which can compile HIP accelerated applications, both the host, and the device code they contain. For the purposes of this lab, `hipcc` discussion with be pragmatically scoped to suit our immediate needs. After completing the lab, For anyone interested in a deeper dive into `hipcc`, start with [the documentation](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html).

`hipcc` will be very familiar to experienced `gcc` users. Compiling, for example, a `some-HIP.cpp` file, is simply:

`hipcc -o out some-HIP.cpp `
  - `hipcc` is the command line command for using the `hipcc` compiler.
  - `some-HIP.cpp` is passed as the file to compile.
  - The `o` flag is used to specify the output file for the compiled program.

---
### Exercise: Accelerate Vector Addition Application

The following challenge involves accelerating a CPU-only vector addition program, which, while not the most sophisticated program, will give you an opportunity to focus on what you have learned about GPU-accelerating an application with HIP.

[`01_vector_addition.cpp`](./examples/01_vector_addition/vector_addition.cpp) contains a functioning CPU-only vector addition application. You need to write a HIP kernel on the GPU and to do its work in parallel.

![vec_add_01](./images/vec_add_01.png)

![vec_add_02](./images/vec_add_02.png)


```cpp
/* --------------------------------------------------
Include lib
-------------------------------------------------- */
#include <stdio.h>
#include <math.h>
#include "hip/hip_runtime.h"
```

```cpp
/* --------------------------------------------------
vector addition kernel
-------------------------------------------------- */
__global__ void vector_addition(double *A, double *B, double *C, int n)
{
    int id = blockDim.x * blockIdx.x + threadIdx.x;
    if (id < n) C[id] = A[id] + B[id];
}
```

```cpp
/* --------------------------------------------------
Main program
-------------------------------------------------- */
int main(int argc, char *argv[]){

    /* Size of array */
    int N = 1024 * 1024;

    /* Bytes in array in double precision */
    size_t bytes = N * sizeof(double);

    /* Allocate memory for host arrays */
    double *h_A = (double*)malloc(bytes);
    double *h_B = (double*)malloc(bytes);
    double *h_C = (double*)malloc(bytes);

    /* Initialize host arrays */
    for(int i=0; i<N; i++){
        h_A[i] = sin(i) * sin(i); 
        h_B[i] = cos(i) * cos(i);
        h_C[i] = 0.0;
    }    

    /* Allocate memory for device arrays */
    double *d_A, *d_B, *d_C;
    hipMalloc(&d_A, bytes);
    hipMalloc(&d_B, bytes);
    hipMalloc(&d_C, bytes);

    /* Copy data from host arrays to device arrays */
    hipMemcpy(d_A, h_A, bytes, hipMemcpyHostToDevice);
    hipMemcpy(d_B, h_B, bytes, hipMemcpyHostToDevice);
    hipMemcpy(d_C, h_C, bytes, hipMemcpyHostToDevice);

    /* Set kernel configuration parameters
           thr_per_blk: number of threads per thread block
           blk_in_grid: number of thread blocks in grid */
    int thr_per_blk = 256;
    int blk_in_grid = ceil( float(N) / thr_per_blk );

    /* Launch vector addition kernel */
    vector_addition<<<blk_in_grid, thr_per_blk>>>(d_A, d_B, d_C, N);

    /* Copy data from device array to host array (only need result, d_C) */
    hipMemcpy(h_C, d_C, bytes, hipMemcpyDeviceToHost);

    /* Check for correct results */
    double sum       = 0.0;
    double tolerance = 1.0e-14;

    for(int i=0; i<N; i++){
        sum = sum + h_C[i];
    } 

    if( fabs( (sum / N) - 1.0 ) > tolerance ){
        printf("Error: Sum/N = %0.2f instead of ~1.0\n", sum / N);
        exit(1);
    }

    /* Free CPU memory */
    free(h_A);
    free(h_B);
    free(h_C);

    /* Free Device memory */
    hipFree(d_A);
    hipFree(d_B);
    hipFree(d_C);

    printf("\n==============================\n");
    printf("__SUCCESS__\n");
    printf("------------------------------\n");
    printf("N                : %d\n", N);
    printf("Blocks in Grid   : %d\n", blk_in_grid);
    printf("Threads per Block: %d\n", thr_per_blk);
    printf("==============================\n\n"); 

    return 0;
}
```

In [2]:
!mkdir build
!hipcc --offload-arch=gfx908 -Wno-unused-result -o build/vector_add examples/01_vector_addition/vector_addition.cpp && build/vector_add

Error: Sum/N = 0.00 instead of ~1.0
