**GPU COMPUTING EXERCISE 1**

Author: Olivier Kanamugire

Date : January 08, 2024.

**Task 1.**
Create a custom kernel which takes an empty vector as an input and each thread writes its thread index to the empty vector that is provided to the kernel.

So we would input a vector of zeroes:

[0, 0, 0, 0, 0, 0, ...]

And the output would be a vector with:

[0, 1, 2, 3, 4, 5, ...]

**Explanation of implementation**

We begin by importing cupy library to manipulate arrays on GPU. Second, we define the kernel function. There are three ypes of kernes of whichh Raw kernel was used [see documentation](https://docs.cupy.dev/en/stable/user_guide/kernel.html). We then initialize the arguments aand lauch our kernel


In [2]:
import cupy as cp

MyRawKernel = """
extern "C" __global__
void write_thread_index(int* input_vector) {
    int thread_index = blockIdx.x * blockDim.x + threadIdx.x;
    input_vector[thread_index] = thread_index;
}
"""

write_thread_index_kernel = cp.RawKernel(MyRawKernel, 'write_thread_index')



In [4]:
#Initializing the argumments

input_vector = cp.zeros(10, dtype=cp.int32)
threads_per_block = 256
blocks_per_grid = (input_vector.size + threads_per_block - 1) // threads_per_block

write_thread_index_kernel((blocks_per_grid,), (threads_per_block,), (input_vector,))  # launch the kernel


print(input_vector)

[0 1 2 3 4 5 6 7 8 9]


**Task 2.**
Implement a kernel which takes two vectors A and B and adds them together to form a vector C.





**Explanation of implementation**
Here we are going to use Elementwise kernel. It is simple and suitable for this task.

In [10]:
addition_kernel = cp.ElementwiseKernel(
    'float32 a, float32 b',
    'float32 c',
    'c = a + b',
    'addition_kernel'
)


Lets define two vectors A and B, the inputs to the kernel.

In [11]:
A = cp.arange(10, dtype=cp.float32)
B = cp.arange(10, dtype=cp.float32)

addition_kernel(A, B)
print(C)

[ 0.  2.  4.  6.  8. 10. 12. 14. 16. 18.]


**Task 3.** Implement a kernel which takes in vectors A and B and C and adds A and B together and multiplies the resulting vector values with the values from C to form a vector D.

**Explanation of implementation**

We can use elementwise kernel too. Its the same as the above, the difference is kernel operation. We specify the data types of the input arguments (a, b, c) and the output argument (d) as float32. Kernel Operation: The 'd = (a + b) * c' expression defines the operation performed by the kernel on each element of the input arrays. It adds a and b, and then multiplies the result by c, storing the final value in d.

In [12]:
multiply_add_kernel = cp.ElementwiseKernel(
    'float32 a, float32 b, float32 c',
    'float32 d',
    'd = (a + b) * c',
    'multiply_add_kernel'
)


In [14]:
# Example usage

A = cp.arange(10, dtype=cp.float32)
B = cp.arange(10, dtype=cp.float32)
C = cp.arange(10, dtype=cp.float32)

D = multiply_add_kernel(A, B, C) # kernel execution

print(D)

[  0.   2.   8.  18.  32.  50.  72.  98. 128. 162.]
