<div align="center"><h1>ROCm GPU Camp --- HIP C/C++</h1></div>

---
## Prerequisites

This lab is meant to serve as a set of hands-on exercises following the associated Introduction to HIP lecture, so there is an expectation that you have a basic understanding of the HIP programming model. If you already have experience with other GPU programming models, you should be able to easily follow along.

[comment]: < To get the most out of this lab, you should have already watched the associated lecture and be able to: >

[comment]: < - Declare variables, write loops, and use if / else statements in C. >
[comment]: < - Define and invoke functions in C. >
[comment]: < - Allocate arrays in C. >

[comment]: < No previous HIP knowledge is required. >

---
## Objectives

By the time you complete this lab, you will be able to:

- Write, compile, and run C/C++ programs that both call CPU functions and **launch** GPU **kernels**.
- Control parallel **thread hierarchy** using **execution configuration**.
- Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free memory available to both CPUs and GPUs.
- Handle errors generated by HIP code.
- Accelerate CPU-only applications.

---
## Accelerated Systems

*Accelerated systems*, also referred to as *heterogeneous systems*, are those composed of both CPUs and GPUs. Accelerated systems run CPU programs which in turn, launch functions that will benefit from the massive parallelism providied by GPUs. This lab environment is an accelerated system which includes an AMD GPU. Information about this GPU can be queried with the `rocm-smi` (*ROCm System Management Interface*) command line tool. Issue the `rocm-smi` command now by clicking on the code execution cell below and pressing `SHIFT + ENTER`. You will find these cells throughout this lab any time you need to execute commands. The output from running the command will be printed just below the code execution cell after the code runs. In the output, you will see the e.g., VRAM, compute utilization `(GPU%)`, temperature, etc. of the GPU.

In [None]:
!rocm-smi

---
## Review: Writing Application Code for the GPU

HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code.

Below is a `.cpp` file. It contains two functions, the first which will run on the CPU, the second which will run on the GPU. Spend a little time identifying the differences between the functions, both in terms of how they are defined, and how they are invoked.

```c++
void CPUFunction()
{
  printf("This function is defined to run on the CPU.\n");
}

__global__ void GPUFunction()
{
  printf("This function is defined to run on the GPU.\n");
}

int main()
{
  CPUFunction();

  GPUFunction<<<1, 1>>>();
  HIPDeviceSynchronize();
}
```

Here are some important lines of code to highlight, as well as some other common terms used in accelerated computing:

`__global__ void GPUFunction()`
  - The `__global__` keyword indicates that the following function will run on the GPU, and can be invoked **globally**, which in this context means either by the CPU, or, by the GPU.
  - Often, code executed on the CPU is referred to as **host** code, and code running on the GPU is referred to as **device** code.
  - Notice the return type `void`. It is required that functions defined with the `__global__` keyword return type `void`.

`GPUFunction<<<1, 1>>>();`
  - Typically, when calling a function to run on the GPU, we call this function a **kernel**.
  - When launching a kernel, we must provide an **execution configuration**, which is done by using the `<<< ... >>>` syntax just prior to passing the kernel any expected arguments.
  - At a high level, execution configuration allows programmers to specify the **thread hierarchy** for a kernel launch, which defines the number of thread groupings (called **blocks**), as well as how many **threads** to execute in each block. Execution configuration will be explored at great length later in the lab, but for the time being, notice the kernel is launching with `1` block of threads (the first execution configuration argument) which contains `1` thread (the second configuration argument).

`hipDeviceSynchronize();`
  - Unlike much C/C++ code, launching kernels is **asynchronous**: the CPU code will continue to execute *without waiting for the kernel launch to complete*.
  - A call to `HIPDeviceSynchronize`, a function provided by the HIP runtime, will cause the host (CPU) code to wait until the device (GPU) code completes, and only then resume execution on the CPU.

---
### Compiling and Running Accelerated HIP Code

The HIP platform ships with the [**HIP Compiler Driver**](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html) `hipcc`, which can compile HIP accelerated applications, both the host, and the device code they contain. For the purposes of this lab, `hipcc` discussion with be pragmatically scoped to suit our immediate needs. After completing the lab, For anyone interested in a deeper dive into `hipcc`, start with [the documentation](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html).

`hipcc` will be very familiar to experienced `gcc` users. Compiling, for example, a `some-HIP.cpp` file, is simply:

`hipcc -o out some-HIP.cpp `
  - `hipcc` is the command line command for using the `hipcc` compiler.
  - `some-HIP.cpp` is passed as the file to compile.
  - The `-o` flag is used to specify the output file for the compiled program.

---
### Example: HIP Vector Addition

The following example serves 2 purposes; i) to show how to compile and run a HIP code from within the notebook, and ii) to serve as a helpful reference when completing the exercises further below.

> NOTE: The code shown below is simply for reference. It is a copy-paste from the actual code that you will compile and run.

You can find the source code by following this link: [`./examples/02_vector_addition_error_check/vector_addition.cpp`](./examples/02_vector_addition_error_check/vector_addition.cpp)

[comment]: < [`01_vector_addition.cpp`](./examples/01_vector_addition/vector_addition.cpp) contains a functioning CPU-only vector addition application. You need to write a HIP kernel on the GPU and to do its work in parallel. >

[comment]: < ![vec_add_01](./images/vec_add_01.png) >

[comment]: < ![vec_add_02](./images/vec_add_02.png) >


<br/>

```c
#include <stdio.h>
#include <math.h>
#include "hip/hip_runtime.h"

/* Macro for checking GPU API return values */
#define gpuCheck(call)                                                                          \
do{                                                                                             \
    hipError_t gpuErr = call;                                                                   \
    if(hipSuccess != gpuErr){                                                                   \
        printf("GPU API Error - %s:%d: '%s'\n", __FILE__, __LINE__, hipGetErrorString(gpuErr)); \
        exit(1);                                                                                \
    }                                                                                           \
}while(0)

/* --------------------------------------------------
Vector addition kernel
-------------------------------------------------- */
__global__ void vector_addition(double *A, double *B, double *C, int n)
{
    int id = blockDim.x * blockIdx.x + threadIdx.x;
    if (id < n) C[id] = A[id] + B[id];
}

/* --------------------------------------------------
Main program
-------------------------------------------------- */
int main(int argc, char *argv[]){

    /* Size of array */
    int N = 1024 * 1024;

    /* Bytes in array in double precision */
    size_t bytes = N * sizeof(double);

    /* Allocate memory for host arrays */
    double *h_A = (double*)malloc(bytes);
    double *h_B = (double*)malloc(bytes);
    double *h_C = (double*)malloc(bytes);

    /* Initialize host arrays */
    for(int i=0; i<N; i++){
        h_A[i] = sin(i) * sin(i);
        h_B[i] = cos(i) * cos(i);
        h_C[i] = 0.0;
    }

    /* Allocate memory for device arrays */
    double *d_A, *d_B, *d_C;
    gpuCheck( hipMalloc(&d_A, bytes) );
    gpuCheck( hipMalloc(&d_B, bytes) );
    gpuCheck( hipMalloc(&d_C, bytes) );

    /* Copy data from host arrays to device arrays */
    gpuCheck( hipMemcpy(d_A, h_A, bytes, hipMemcpyHostToDevice) );
    gpuCheck( hipMemcpy(d_B, h_B, bytes, hipMemcpyHostToDevice) );
    gpuCheck( hipMemcpy(d_C, h_C, bytes, hipMemcpyHostToDevice) );

    /* Set kernel configuration parameters
           thr_per_blk: number of threads per thread block
           blk_in_grid: number of thread blocks in grid */
    int thr_per_blk = 256;
    int blk_in_grid = ceil( float(N) / thr_per_blk );

    /* Launch vector addition kernel */
    vector_addition<<<blk_in_grid, thr_per_blk>>>(d_A, d_B, d_C, N);

    /* Check for kernel launch errors */
    gpuCheck( hipGetLastError() );

    /* Check for kernel execution errors */
    gpuCheck ( hipDeviceSynchronize() );

    /* Copy data from device array to host array (only need result, d_C) */
    gpuCheck( hipMemcpy(h_C, d_C, bytes, hipMemcpyDeviceToHost) );

    /* Check for correct results */
    double sum       = 0.0;
    double tolerance = 1.0e-14;

    for(int i=0; i<N; i++){
        sum = sum + h_C[i];
    }

    if( fabs( (sum / N) - 1.0 ) > tolerance ){
        printf("Error: Sum/N = %0.2f instead of ~1.0\n", sum / N);
        exit(1);
    }

    /* Free CPU memory */
    free(h_A);
    free(h_B);
    free(h_C);

    /* Free Device memory */
    gpuCheck( hipFree(d_A) );
    gpuCheck( hipFree(d_B) );
    gpuCheck( hipFree(d_C) );

    printf("\n==============================\n");
    printf("__SUCCESS__\n");
    printf("------------------------------\n");
    printf("N                : %d\n", N);
    printf("Blocks in Grid   : %d\n",  blk_in_grid);
    printf("Threads per Block: %d\n",  thr_per_blk);
    printf("==============================\n\n");

    return 0;
}
```

<br/>
This is where you actually compile and run the vector addition code:

In [None]:
!mkdir -p build
!hipcc --offload-arch=gfx908 -o build/vector_add_basic examples/02_vector_addition_error_check/vector_addition.cpp && build/vector_add_basic

---

## Exercises

The following exercises are meant to help solidify the information you learned in the lecture. Please proceed through each exercise and ask for help where needed.

### Exercise01: Error check

In this exercise, you will be working with a HIP program that adds 2 vectors together element-wise. This is the same code as the example above.

You can find the source code by following this link: [`./exercises/01_error_check/vector_addition.cpp`](./exercises/01_error_check/vector_addition.cpp)

#### Background:

The program performs the following steps:

1. Defines a kernel function vector_addition that adds two arrays together element-wise.
2. Allocates memory on the host (CPU) for 3 arrays (h_A, h_B, h_C) of size N.
3. Initializes the host array elements to `sin*sin`, `cos*cos`, and `0.0`, respectively.
4. Allocates memory on the device (GPU) for 3 arrays (d_A, d_B, d_C) of the same size.
5. Copies the data from the host arrays to the device arrays.
6. Configures the kernel launch parameters (number of threads per block and number of blocks in the grid).
7. Launches the kernel to execute on the device (adds two arrays; or vectors).
8. Copies the results back from the device array to the host array.
9. Verifies the results.
10. Frees the allocated memory on both the host and the device.

#### Task:

Your task is to find the error in the code.

#### Instructions:

1. Compile and run the code (using the code cell below) and make note of the error message (generated from our error-checking macro).
2. Open the source file and use the error message to help locate the problem.
3. Fix the problem by editing the file.
4. Re-compile and run the program to ensure it executes correctly and prints SUCCESS if the problem has been resolved.
5. If the program does not compile, or the program gives incorrect results (an error message would occur for this), check the code and try again.

By completing this task, you will demonstrate your understanding of using a GPU error checking macro to find and resolve a simple code error.


In [None]:
# Run and test your program 
!hipcc --offload-arch=gfx908 -o ./build/error_check ./exercises/01_error_check/vector_addition.cpp && ./build/error_check

---
### Exercise02: Add the device-to-host data transfer

In this exercise, you will be working with a HIP program that adds `1` to all elements of a 1D array of `int`s. As you can see below, the program has a very similar structure as the vector addition program from the exercise above.

You can find the source code by following this link: [`./exercises/02_add_d2h_data_transfer/add_one.cpp`](./exercises/02_add_d2h_data_transfer/add_one.cpp)

#### Background:

The program performs the following steps:

1. Defines a kernel function add_one that adds one to each element of an integer array.
2. Allocates memory on the host (CPU) for an array (h_A) of size N.
3. Initializes the host array elements to zero.
4. Allocates memory on the device (GPU) for an array (d_A) of the same size.
5. Copies the data from the host array to the device array.
6. Configures the kernel launch parameters (number of threads per block and number of blocks in the grid).
7. Launches the kernel to execute on the device (adds 1 to each array element).
8. Copies the results back from the device array to the host array.
9. Verifies the results.
10. Frees the allocated memory on both the host and the device.

#### Task:

Your task is to implement the device-to-host data transfer by adding the HIP API call.

#### Instructions:

1. Locate the section in the main program marked TODO.
2. Add the HIP device-to-host data transfer API call (see call below).
3. Compile and run the program (using the code cell below) to ensure it executes correctly and prints SUCCESS if all elements in the array have been increased by 1.
4. If any element is not correctly squared, the program should print an error message specifying the index and incorrect value. Find the problem, then compile and run it again.

By completing this task, you will demonstrate your understanding of performing basic data transfers using a HIP API call.

Below is the [hipMemcpy](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html#gac1a055d288302edd641c6d7416858e1e) API call you will use for the data transfer. You can also click the link to the official documentation to see the full API. Use the existing `hipMemcpy` call from step 5. as a reference.

```c
hipError_t hipMemcpy( void*  destination_buffer,
                      void*  source_buffer,
                      size_t num_bytes_to_copy,
                      hipMemcpyKind kind
                    )
```



In [None]:
# Run and test your program 
!hipcc --offload-arch=gfx908 -o build/add_one exercises/02_add_d2h_data_transfer/add_one.cpp && build/add_one

---
### Exercise03: Square each element of an array

In this exercise, you will be working with a HIP program that squares each element of an array using the GPU. The program is mostly complete, but there is one part that you need to implement. 

You can find the source code by following this link: [`./exercises/03_complete_square_elements/square_elements.cpp`](./exercises/03_complete_square_elements/square_elements.cpp).

#### Background:

The program performs the following steps:

1. Defines a kernel function square_elements that squares each element of an integer array.
2. Allocates memory on the host (CPU) for an array of size N.
3. Initializes the host array elements to their index values.
4. Allocates memory on the device (GPU) for an array of the same size.
5. Copies the data from the host array to the device array.
6. Configures the kernel launch parameters (number of threads per block and number of blocks in the grid).
7. Launches the kernel to execute on the device (squares all elements of the array).
8. Copies the results back from the device array to the host array.
9. Verifies the results.
10. Frees the allocated memory on both the host and the device.

#### Task:

Your task is to complete the kernel function that squares each element of the array. Remember that the kernel should only square the element if its index is within bounds.

#### Instructions:

1. Locate the section in the kernel function square_elements marked TODO.
2. Implement the logic to square each element of the array.
3. Compile and run the program (using the code cell below) to ensure it executes correctly and prints __SUCCESS__ if all elements in the array have been squared properly.
4. If any element is not correctly squared, the program should print an error message specifying the index and incorrect value.

By completing this task, you will demonstrate your understanding of writing and executing basic HIP kernels for element-wise operations on arrays.

In [None]:
# Run and test your program 
!hipcc --offload-arch=gfx908 -o build/squares exercises/03_complete_square_elements/square_elements.cpp && build/squares

---
### Exercise04: Multiply two square matrices

In this exercise, you will be working with a HIP program that multiplies two square matrices using GPU acceleration. The program is mostly complete, but there are a few parts that you need to implement. 

You can find the source code in [`./exercises/04_complete_matrix_multiply/matrix_multiply.cpp`](./exercises/04_complete_matrix_multiply/matrix_multiply.cpp).

#### Background:

The program performs the following steps:

1. Defines a kernel function matrix_multiply that multiplies two NxN matrices.
2. Allocates memory on the host (CPU) for three NxN matrices: A, B, and C.
3. Initializes the host matrices A and B with specific values and sets C to zero.
4. Allocates memory on the device (GPU) for the matrices.
5. Copies the data from the host matrices to the device matrices.
6. Configures the kernel launch parameters (number of threads per block and number of blocks in the grid).
7. Launches the kernel to execute on the device (multiply 2 matrices).
8. Copies the result back from the device matrix C to the host matrix.
9. Verifies the results.
10. Frees the allocated memory on both the host and the device.

#### Task:

Your task is to complete the kernel function that performs the matrix multiplication. Specifically, you need to:

1. Identify the correct elements of matrices A and B to multiply.

2. Store the result back into matrix C.

#### Instructions:

1. Locate the TODO comments in the kernel function matrix_multiply.
2. Determine the correct indices for accessing the elements of matrices A and B.
3. Write the code to store the computed value into matrix C.
4. Compile and run the program to ensure it executes correctly and prints __SUCCESS__ if all elements in the result matrix C are computed correctly.
5. If any element is not computed correctly, the program should print an error message specifying the index and incorrect value.

By completing this task, you will demonstrate your understanding of writing and executing basic HIP kernels for matrix operations.

You can find the source code here: [./exercises/04_complete_matrix_multiply/matrix_multiply.cpp](./exercises/04_complete_matrix_multiply/matrix_multiply.cpp)

In [None]:
# Run and test your program 
!hipcc --offload-arch=gfx908 -o build/matrix_multiply exercises/04_complete_matrix_multiply/matrix_multiply.cpp && build/matrix_multiply

---
### Exercise05: Compare performance against the hipBLAS version of DGEMM

In this exercise, you will be working with a HIP program that performs 2 matrix multiplies, 1 that is the same `matrix_multiply` as in previous exercise and 1 that is performed using the HIP BLAS GPU-accelerated linear algebra library. The performance of the 2 implementations is then compared.

> NOTE: You will not need to make any code changes. Instead, you will simply compile the program, run it under the `rocprof` profiling tool, and parse the results. 

#### Background:

The program performs the following steps:

1. Defines a kernel function matrix_multiply that multiplies two NxN matrices.
2. Allocates memory on the host (CPU) for four NxN matrices: h_A, h_B, h_C, and hlib_C (to capture HIP BLAS result passed back from GPU)
3. Initializes the host matrices h_A and h_B so their matrix product will be 1.0, and the others to 0.0.
4. Allocates memory on the device (GPU) for 6 NxN matrics: d_A, d_B, d_C (for matrix multiply kernel), and dlib_A, dlib_B, dlib_C (for HIP BLAS call)
5. Copies the data from host matrices to d_A, d_B, d_C, and dlib_A, dlib_B, and dlib_C).
6. Configures the kernel launch parameters (number of threads per block and number of blocks in the grid).
7. Launches the matrix_multiply kernel on the device (using d_A, d_B, d_C).
8. Launches the `hipblasDgemm` call on the device (using dlib_A, dlib_B, dlib_C).
9. Copies the results back from the device matrices d_C and dlib_C to the corresponding host matrices.
10. Verifies the results.
11. Frees the allocated memory on both the host and the device.

#### Task:

Your task is to compile and run the program and compare the results.

#### Instructions:

1. Compile and run the program using the first code cell below.
2. Copy the `parse_output.py` script from its source directory to the current directory by running the second code cell below.
3. Run the `parse_output.py` script in the third code cell below to compare the results.

It should be clear from the performance difference that using the optimized libraries is typically preferred over writing your own kernel. By completing this task, you will have gained knowledge on compiling and running HIP programs that use GPU-accelerated libraries.

> NOTES: See first code cell below.
> * To use the HIP BLAS API call, when compiling we needed to link to the HIP BLAS library using `-L${ROCM_PATH}/hipblas/lib -lhipblas`
> * To profile the code, when running we needed to add `rocprof --stats --hip-trace` just before the executable.

In [None]:
# Run and test your program 
!export HIPBLAS=../
!export LD_LIBRARY_PATH=/mnt:$LD_LIBRARY_PATH
!hipcc --offload-arch=gfx908 -I${HIPBLAS} -L/mnt -lhipblas -o build/matrix_multiply_compare exercises/05_compare_with_library/matrix_multiply.cpp && rocprof --stats --hip-trace build/matrix_multiply_compare

In [None]:
!cp ./exercises/05_compare_with_library/parse_output.py .

In [None]:
!python parse_output.py > output.txt && cat output.txt

---
### Exercise06: Hipify the CUDA pingpong code

In this exercise, you will be working with a CUDA (and then HIP) program that copies data back and forth between the host (CPU) and device (GPU) 50 times to measure the achieved bandwidth between them (this is often referred to as a "ping pong"). The source code is a CUDA file that you will `hipify` into a HIP code.

You can find the source code in [./exercises/06_hipify_pingpong/pingpong.cu](./exercises/06_hipify_pingpong/pingpong.cu)

> NOTE: You will not need to make any changed to the CUDA code. Instead, you will generate a HIP code from the CUDA code and add an include to the HIP runtime header file.

#### Background:

The program performs the following steps within a loop over buffer sizes ranging from 8 kiB to 1 GiB:

> NOTE: The steps below are performed for both host-to-device (H2D) and device-to-host

1. Allocates memory on the host (CPU) in pinned (page-locked) memory for a buffer (array; `h_A`) of size `bytes`.
2. Allocates memory on the device (GPU) for a buffer (array; `d_A`) of size `bytes`.
3. Initializes the host array with random numbers.
4. Runs a warm-up loop that ping pongs the data between the host and device 5 times.
5. Runs a timed loop that ping pongs the data between the host and device 50 times.
6. Deallocates memory on the host and device.
7. Prints achieved bandwidth for ping pong on buffer of size `bytes`.

#### Task:

Your task is to `hipify` the CUDA source file into a HIP source file, add the include line to the HIP runtime header file, and compile and run the resulting HIP code.

#### Instructions:

1. Run the `ls` command in the first code cell below to see the contents of the `06_hipify_pingpong` directory.
   * The contents should originally be `Makefile`, `pingpong.cu`, `README.md`, and `submit.sh`.
2. Run `hipify` with the `-examine` flag on the `.cu` file in the second code cell below to show "what would happen" if you actually `hipify` it. 
3. Now actually `hipify` the `.cu` file by running the third code cell below.
4. Run the `ls` command again by running the fourth code cell below to see the newly-created HIP (`.cpp`) file.
5. Open the newly-created [./exercises/06_hipify_pingpong/pingpong.cpp](./exercises/06_hipify_pingpong/pingpong.cpp) file and add `#include <hip/hip_runtime.h` to include the HIP runtime header file.
   * NOTE: This file will not exist until you `hipify` the CUDA source code.
6. Compile and run the HIP program using the fifth code cell below.

Recall that the CPU and GPU are connected with PCIe4 (x16), which has a peak bandwidth of 32 GB/s. What percentage of the peak performance do we achieve? By completing this task, you will demonstrate your understanding of `hipify`-ing basic CUDA codes into HIP.

In [None]:
!ls ./exercises/06_hipify_pingpong/

In [None]:
!hipify-perl -examine ./exercises/06_hipify_pingpong/pingpong.cu

In [None]:
!hipify-perl ./exercises/06_hipify_pingpong/pingpong.cu > ./exercises/06_hipify_pingpong/pingpong.cpp

In [None]:
!ls ./exercises/06_hipify_pingpong/

> NOTE: Remember to add the HIP runtime header file to the [./exercises/06_hipify_pingpong/pingpong.cpp](./exercises/06_hipify_pingpong/pingpong.cpp) file before attempting to compile and run.

In [None]:
!hipcc --offload-arch=gfx908 -o build/pingpong exercises/06_hipify_pingpong/pingpong.cpp && build/pingpong

---
### Exercise07: Multiply two square matrices using shared memory

In this exercise, you will be working with a HIP program that multiplies two square matrices using shared memory on the GPU for improved performance. The program is mostly complete, but there are a few parts that you need to implement. You can find the source code in [`07_multiply_shared.cpp`](./exercises/07_matrix_multiply_shared/matrix_multiply.cpp).

#### Background:

The program performs the following steps:

1. Defines a kernel function matrix_multiply that multiplies two NxN matrices using shared memory.
2. Allocates memory on the host (CPU) for three NxN matrices: A, B, and C.
3. Initializes the host matrices A and B with specific values and sets C to zero.
4. Allocates memory on the device (GPU) for the matrices.
5. Copies the data from the host matrices to the device matrices.
6. Configures the kernel launch parameters (number of threads per block and number of blocks in the grid).
7. Launches the kernel to execute on the device.
8. Copies the result back from the device matrix C to the host matrix.
9. Verifies the results.
10. Frees the allocated memory on both the host and the device.

#### Task:

Your task is to implement the part of the kernel where data is read in from global memory to shared memory.

#### Instructions:

1. Locate the TODO comment in the kernel function matrix_multiply.
2. Read data from global memory into the shared memory arrays s_A and s_B.
3. Compile and run the program to ensure it executes correctly and prints __SUCCESS__ if all elements in the result matrix C are computed correctly.
4. If any element is not computed correctly, the program should print an error message specifying the index and incorrect value.

By completing this task, you will demonstrate your understanding of writing and executing basic HIP kernels for matrix operations using shared memory to optimize performance.

In [None]:
# Run and test your program 
!hipcc --offload-arch=gfx908 -Wno-unused-result -o build/matrix_multiply_shared exercises/07_matrix_multiply_shared/matrix_multiply.cpp && build/matrix_multiply_shared