<a href="https://colab.research.google.com/github/Mamoro98/Cuda-Programming/blob/main/omer_GPU_Practical4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Omer Kamal Ali Ebead</h1>

# **CUDA Programming on NVIDIA GPUs**

# **Practical 4**

Again make sure the correct Runtime is being used, by clicking on the Runtime option at the top, then "Change runtime type", and selecting an appropriate GPU such as the T4.

Then verify the details of the GPU which is available to you, and upload the usual two header files.

In [None]:
!nvidia-smi

Thu Feb 13 09:36:14 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   39C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h


--2025-02-13 09:36:16--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27832 (27K) [text/x-chdr]
Saving to: ‘helper_cuda.h’


2025-02-13 09:36:17 (200 KB/s) - ‘helper_cuda.h’ saved [27832/27832]

--2025-02-13 09:36:17--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14875 (15K) [text/x-chdr]
Saving to: ‘helper_string.h’


2025-02-13 09:36:18 (362 KB/s) - ‘helper_string.h’ saved [14875/14875]





---

The next step is to create the file reduction.cu which includes within it a reference C++ routine against which the CUDA results are compared.

# Question 4

In [None]:
%%writefile reduction.cu

////////////////////////////////////////////////////////////////////////
//
// Practical 4 -- initial code for shared memory reduction for
//                a single block which is a power of two in size
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// CPU routine
////////////////////////////////////////////////////////////////////////

float reduction_gold(float* idata, int len)
{
  float sum = 0.0f;
  for(int i=0; i<len; i++) sum += idata[i];

  return sum;
}

////////////////////////////////////////////////////////////////////////
// GPU routine
////////////////////////////////////////////////////////////////////////

__global__ void reduction(float *g_odata, float *g_idata)
{
    // dynamically allocated shared memory

    extern  __shared__  float temp[];

    int tid = threadIdx.x;

    // first, each thread loads data into shared memory

    temp[tid] = g_idata[tid];

    // next, we perform binary tree reduction

    for (int d=blockDim.x/2; d>0; d=d/2) {
      __syncthreads();  // ensure previous step completed
      if (tid<d)  temp[tid] += temp[tid+d];
    }

    // finally, first thread puts result into global memory
    if (tid==0) g_odata[0] = temp[0];
}


////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////

int main( int argc, const char** argv)
{
  int num_blocks, num_threads, num_elements, mem_size, shared_mem_size;

  float *h_data, *d_idata, *d_odata;

  // initialise card

  findCudaDevice(argc, argv);

  num_blocks   = 1;  // start with only 1 thread block
  num_threads  = 512;
  num_elements = num_blocks*num_threads;
  mem_size     = sizeof(float) * num_elements;

  // allocate host memory to store the input data
  // and initialize to integer values between 0 and 10

  h_data = (float*) malloc(mem_size);

  for(int i = 0; i < num_elements; i++)
    h_data[i] = floorf(10.0f*(rand()/(float)RAND_MAX));

  // compute reference solution

  float sum = reduction_gold(h_data, num_elements);

  // allocate device memory input and output arrays

  checkCudaErrors( cudaMalloc((void**)&d_idata, mem_size) );
  checkCudaErrors( cudaMalloc((void**)&d_odata, sizeof(float)) );

  // copy host memory to device input array

  checkCudaErrors( cudaMemcpy(d_idata, h_data, mem_size,
                              cudaMemcpyHostToDevice) );

  // execute the kernel

  shared_mem_size = sizeof(float) * num_threads;
  reduction<<<num_blocks,num_threads,shared_mem_size>>>(d_odata,d_idata);
  getLastCudaError("reduction kernel execution failed");

  // copy result from device to host

  checkCudaErrors( cudaMemcpy(h_data, d_odata, sizeof(float),
                              cudaMemcpyDeviceToHost) );

  // check results

  printf("reduction error = %f\n",h_data[0]-sum);
  printf("reduction result = %f\n",h_data[0]);
  printf("reference result = %f\n",sum);

  // cleanup memory

  free(h_data);
  checkCudaErrors( cudaFree(d_idata) );
  checkCudaErrors( cudaFree(d_odata) );

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();
}


Overwriting reduction.cu



---

We can now compile and run the executable.  Note that the compilation links in the CUDA random number generation library cuRAND.


In [None]:
!nvcc reduction.cu -o reduction -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z9reductionPfS_' for 'sm_70'
ptxas info    : Function properties for _Z9reductionPfS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 368 bytes cmem[0]


In [None]:
!./reduction

GPU Device 0: "Turing" with compute capability 7.5

reduction error = 0.000000
reduction result = 2351.000000
reference result = 2351.000000


**explain why this shows the result is correct, and how the code has performed the required check.**

We are using the reduction technique to estimate the sum of the elements in the block
**TO DO**

# Question 5

### Round up code

In [None]:
%%writefile reduction.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main()
{
  int m1, m2, m3;

  for (int n=2; n<1029; n++){

    for (m1=1; m1<n; m1=2*m1) {}

    m2 = n-1;
    m2 = m2 | (m2>>1);
    m2 = m2 | (m2>>2);
    // m2 = m2 | (m2>>4);
    // m2 = m2 | (m2>>8);  // this handles up to 16-bit integers
    // m2 = m2 | (m2>>16); // needed to go up to 32-bit integers
    m2 = m2 + 1;

    // in line below need to rename to  __clz() in CUDA; see
    // 1.10 in https://docs.nvidia.com/cuda/cuda-math-api/
    m3 = 1 << (64 - __builtin_clz(n-1));  //  needs n>1

    printf("n, m1, m2, m3 = %2d, %2d, %2d, %2d \n",n,m1,m2,m3);
  }

}


Overwriting reduction.cu


In [None]:
!nvcc reduction.cu -o reduction -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem


In [None]:
!./reduction

n, m1, m2, m3 =  2,  2,  2,  2 
n, m1, m2, m3 =  3,  4,  4,  4 
n, m1, m2, m3 =  4,  4,  4,  4 
n, m1, m2, m3 =  5,  8,  8,  8 
n, m1, m2, m3 =  6,  8,  8,  8 
n, m1, m2, m3 =  7,  8,  8,  8 
n, m1, m2, m3 =  8,  8,  8,  8 
n, m1, m2, m3 =  9, 16, 16, 16 
n, m1, m2, m3 = 10, 16, 16, 16 
n, m1, m2, m3 = 11, 16, 16, 16 
n, m1, m2, m3 = 12, 16, 16, 16 
n, m1, m2, m3 = 13, 16, 16, 16 
n, m1, m2, m3 = 14, 16, 16, 16 
n, m1, m2, m3 = 15, 16, 16, 16 
n, m1, m2, m3 = 16, 16, 16, 16 
n, m1, m2, m3 = 17, 32, 31, 32 
n, m1, m2, m3 = 18, 32, 32, 32 
n, m1, m2, m3 = 19, 32, 32, 32 
n, m1, m2, m3 = 20, 32, 32, 32 
n, m1, m2, m3 = 21, 32, 32, 32 
n, m1, m2, m3 = 22, 32, 32, 32 
n, m1, m2, m3 = 23, 32, 32, 32 
n, m1, m2, m3 = 24, 32, 32, 32 
n, m1, m2, m3 = 25, 32, 32, 32 
n, m1, m2, m3 = 26, 32, 32, 32 
n, m1, m2, m3 = 27, 32, 32, 32 
n, m1, m2, m3 = 28, 32, 32, 32 
n, m1, m2, m3 = 29, 32, 32, 32 
n, m1, m2, m3 = 30, 32, 32, 32 
n, m1, m2, m3 = 31, 32, 32, 32 
n, m1, m2, m3 = 32, 32, 32, 32 
n, m1, m

### Extending to handle non-power of 2

In [None]:
%%writefile reduction.cu

////////////////////////////////////////////////////////////////////////
//
// Practical 4 -- initial code for shared memory reduction for
//                a single block which is a power of two in size
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// CPU routine
////////////////////////////////////////////////////////////////////////

float reduction_gold(float* idata, int len)
{
  float sum = 0.0f;
  for(int i=0; i<len; i++) sum += idata[i];

  return sum;
}

////////////////////////////////////////////////////////////////////////
// GPU routine
////////////////////////////////////////////////////////////////////////

__global__ void reduction(float *g_odata, float *g_idata)
{
    // dynamically allocated shared memory

    extern  __shared__  float temp[];

    int tid = threadIdx.x;

    // first, each thread loads data into shared memory

    int m3 = 1 << (32 - __builtin_clz(blockDim.x-1));
    m3 = m3/2;
    int diff = blockDim.x - m3;

    if (tid < diff){
      temp[tid] = g_idata[tid];
      temp[tid] += g_idata[tid + m3];
    }
    else {
      temp[tid] = g_idata[tid];
    }



    // next, we perform binary tree reduction

    for (int d=m3/2; d>0; d=d/2) {
      __syncthreads();  // ensure previous step completed
      if (tid<d)  temp[tid] += temp[tid+d];
    }

    // finally, first thread puts result into global memory
    if (tid==0) g_odata[0] = temp[0];
}


////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////

int main( int argc, const char** argv)
{
  int num_blocks, num_threads, num_elements, mem_size, shared_mem_size;

  float *h_data, *d_idata, *d_odata;

  // initialise card

  findCudaDevice(argc, argv);

  num_blocks   = 1;  // start with only 1 thread block
  num_threads  = 192;
  num_elements = num_blocks*num_threads;
  mem_size     = sizeof(float) * num_elements;

  // allocate host memory to store the input data
  // and initialize to integer values between 0 and 10

  h_data = (float*) malloc(mem_size);

  for(int i = 0; i < num_elements; i++)
    h_data[i] = floorf(10.0f*(rand()/(float)RAND_MAX));

  // compute reference solution

  float sum = reduction_gold(h_data, num_elements);

  // allocate device memory input and output arrays

  checkCudaErrors( cudaMalloc((void**)&d_idata, mem_size) );
  checkCudaErrors( cudaMalloc((void**)&d_odata, sizeof(float)) );

  // copy host memory to device input array

  checkCudaErrors( cudaMemcpy(d_idata, h_data, mem_size,
                              cudaMemcpyHostToDevice) );

  // execute the kernel

  shared_mem_size = sizeof(float) * num_threads;
  reduction<<<num_blocks,num_threads,shared_mem_size>>>(d_odata,d_idata);
  getLastCudaError("reduction kernel execution failed");

  // copy result from device to host

  checkCudaErrors( cudaMemcpy(h_data, d_odata, sizeof(float),
                              cudaMemcpyDeviceToHost) );

  // check results

  printf("reduction error = %f\n",h_data[0]-sum);
  printf("reduction result = %f\n",h_data[0]);
  printf("reference result = %f\n",sum);

  // cleanup memory

  free(h_data);
  checkCudaErrors( cudaFree(d_idata) );
  checkCudaErrors( cudaFree(d_odata) );

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();
}


Overwriting reduction.cu


In [None]:
!nvcc reduction.cu -o reduction -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z9reductionPfS_' for 'sm_70'
ptxas info    : Function properties for _Z9reductionPfS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 368 bytes cmem[0]


In [None]:
!./reduction

GPU Device 0: "Turing" with compute capability 7.5

reduction error = 0.000000
reduction result = 896.000000
reference result = 896.000000


# Question 6

In [None]:
%%writefile reduction.cu

////////////////////////////////////////////////////////////////////////
//
// Practical 4 -- initial code for shared memory reduction for
//                a single block which is a power of two in size
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// CPU routine
////////////////////////////////////////////////////////////////////////

float reduction_gold(float* idata, int len)
{
  float sum = 0.0f;
  for(int i=0; i<len; i++) sum += idata[i];

  return sum;
}

////////////////////////////////////////////////////////////////////////
// GPU routine
////////////////////////////////////////////////////////////////////////
__global__ void reduction(float *g_odata, float *g_idata, int num_elements)
{
    // Dynamically allocated shared memory
    extern __shared__ float temp[];

    int tid = threadIdx.x; // Local thread index
    int gid = threadIdx.x + blockDim.x * blockIdx.x; // Global index

    // Load data into shared memory
    if (gid < num_elements) {
        temp[tid] = g_idata[gid];
    } else {
        temp[tid] = 0.0f; // Prevent out-of-bounds access
    }
    __syncthreads();

    // Perform reduction in shared memory
    for (int d = blockDim.x / 2; d > 0; d >>= 1) {
        __syncthreads();
        if (tid < d && (tid + d) < blockDim.x) {
            atomicAdd(&temp[tid], temp[tid + d]);
        }
    }

    // Write result for this block to global memory
    if (tid == 0) g_odata[blockIdx.x] = temp[0];
}

////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////

int main( int argc, const char** argv)
{
  int num_blocks, num_threads, num_elements, mem_size, shared_mem_size;

  float *h_data, *d_idata, *d_odata;

  // initialise card

  findCudaDevice(argc, argv);

  num_blocks   = 1000;  // start with only 1 thread block
  num_threads  = 512;
  num_elements = num_blocks*num_threads;
  mem_size     = sizeof(float) * num_elements;

  // allocate host memory to store the input data
  // and initialize to integer values between 0 and 10

  h_data = (float*) malloc(mem_size);

  for(int i = 0; i < num_elements; i++)
    h_data[i] = floorf(10.0f*(rand()/(float)RAND_MAX));

  // compute reference solution

  float sum = reduction_gold(h_data, num_elements);

  // allocate device memory input and output arrays

  checkCudaErrors( cudaMalloc((void**)&d_idata, mem_size) );
  checkCudaErrors( cudaMalloc((void**)&d_odata, sizeof(float)) );

  // copy host memory to device input array

  checkCudaErrors( cudaMemcpy(d_idata, h_data, mem_size,
                              cudaMemcpyHostToDevice) );

  // execute the kernel

  shared_mem_size = sizeof(float) * num_threads;
  reduction<<<num_blocks,num_threads,shared_mem_size>>>(d_odata,d_idata);
  getLastCudaError("reduction kernel execution failed");

  // copy result from device to host

  checkCudaErrors( cudaMemcpy(h_data, d_odata, sizeof(float),
                              cudaMemcpyDeviceToHost) );

  // check results

  printf("reduction error = %f\n",h_data[0]-sum);
  printf("reduction result = %f\n",h_data[0]);
  printf("reference result = %f\n",sum);

  // cleanup memory

  free(h_data);
  checkCudaErrors( cudaFree(d_idata) );
  checkCudaErrors( cudaFree(d_odata) );

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();
}


Overwriting reduction.cu


In [None]:
!nvcc reduction.cu -o reduction -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z9reductionPfS_i' for 'sm_70'
ptxas info    : Function properties for _Z9reductionPfS_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 372 bytes cmem[0]


In [None]:
!./reduction

GPU Device 0: "Turing" with compute capability 7.5

reduction error = -2302950.000000
reduction result = 2351.000000
reference result = 2305301.000000


# Un-Assign



---
By going back to the previous code block you can modify the code to complete the initial Practical 4 exercises. Remember to first make your own copy of the notebook so that you are able to edit it.

For the first exercise, it may be useful to know that the following line of code will round up the input n to the nearest power of 2, so then dividing it by 2 gives the largest power of 2 less than n.

`for (m=1; m<n; m=2*m) {} `

For students doing this as an assignment to be assessed, you should again add your name to the title of the notebook (as in "Practical 4 -- Mike Giles.ipynb"), make it shared (see the Share option in the top-right corner) and provide the shared link as the submission mechanism.



In [None]:
from google.colab import runtime
runtime.unassign()