<a href="https://colab.research.google.com/github/Mamoro98/Cuda-Programming/blob/main/Omer_Practical_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CUDA Programming on NVIDIA GPUs, July 22-26, 2024**

# **Practical 1**

First of all, make sure the correct Runtime is being used, by clicking on the Runtime option at the top, then "Change runtime type", selecting an appropriate GPU such as the T4, then clicking Save.

A Colab Pro or Pro+ account will allow you to use a more powerful GPU, but the freely available T4 is perfectly adequate for the practicals in this course. It has good single precision capabilities and corresponds to Compute Capability 7.5.

To check that this has been done successfully, the first instruction below returns information on the version of the available NVIDIA compiler, and the second instruction returns information on the GPU which is available to you.  

To "execute" each cell, click on the little triangle to the left of the instructions.  The ! tells Colab that these are system commands to be executed.

In [1]:
!nvcc --version


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [2]:
!nvidia-smi

Tue Feb  4 12:13:04 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   59C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

---

The first step is to upload two header files from the course webpage.

In [3]:
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h

--2025-02-04 12:13:25--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27832 (27K) [text/x-chdr]
Saving to: ‘helper_cuda.h’


2025-02-04 12:13:27 (158 KB/s) - ‘helper_cuda.h’ saved [27832/27832]

--2025-02-04 12:13:27--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14875 (15K) [text/x-chdr]
Saving to: ‘helper_string.h’


2025-02-04 12:13:28 (374 KB/s) - ‘helper_string.h’ saved [14875/14875]




---
Next we create the file prac1a.cu by using the %%writefile instruction at the top of the code block.

In doing this, we are following the helpful information provided here:
https://colab.research.google.com/drive/1GJOfTp56OeQRdE4u2_S7pUNRcJb4ik9X?usp=sharing

In [64]:
%%writefile prac1a.cu

//
// include files
//

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

//
// kernel routine
//

__global__ void my_first_kernel(float *x)
{
  int tid = threadIdx.x + blockDim.x*blockIdx.x;

  printf("tid = %i , thred index = %i , block idx = %i , block dim = %i \n",tid,threadIdx.x,blockIdx.x,blockDim.x);

  x[tid] = (float) threadIdx.x;
}


//
// main code
//

int main(int argc, char **argv)
{
  float *h_x, *d_x;
  int   nblocks, nthreads, nsize, n;

  // set number of blocks, and threads per block

  nblocks  = 2;
  nthreads = 8;
  nsize    = nblocks*nthreads ;

  // allocate memory for array

  h_x = (float *)malloc(nsize*sizeof(float));
  cudaMalloc((void **)&d_x, nsize*sizeof(float));

  // execute kernel

  my_first_kernel<<<nblocks,nthreads>>>(d_x);

  // copy back results and print them out

  cudaMemcpy(h_x,d_x,nsize*sizeof(float),cudaMemcpyDeviceToHost);

  for (n=0; n<nsize; n++) printf(" n,  x  =  %d  %f \n",n,h_x[n]);

  // free memory

  cudaFree(d_x);
  free(h_x);

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();

  return 0;
}


Overwriting prac1a.cu




---

We use the following instruction to compile prac1a.cu to create the executable output prac1a.  The other flags are as follows:

-I. says to look in the current directory for header files

-lineinfo helps with debugging if there's a run-time problem

-arch=sm_70 says it is for GPUs of Compute Capability 7.0 or later

--ptxas=-v gives us additional information such as how many registers are used

--use_fast_math generates faster code which might sometimes be a little less accurate

-lcudart links in the run-time CUDA library



In [65]:
!nvcc prac1a.cu -o prac1a -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 64 bytes gmem, 16 bytes cmem[4]
ptxas info    : Compiling entry function '_Z15my_first_kernelPf' for 'sm_70'
ptxas info    : Function properties for _Z15my_first_kernelPf
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 24 registers, 16 bytes cumulative stack size, 360 bytes cmem[0]


---

Now we can execute the code.


In [66]:
!./prac1a

tid = 8 , thred index = 0 , block idx = 1 , block dim = 8 
tid = 9 , thred index = 1 , block idx = 1 , block dim = 8 
tid = 10 , thred index = 2 , block idx = 1 , block dim = 8 
tid = 11 , thred index = 3 , block idx = 1 , block dim = 8 
tid = 12 , thred index = 4 , block idx = 1 , block dim = 8 
tid = 13 , thred index = 5 , block idx = 1 , block dim = 8 
tid = 14 , thred index = 6 , block idx = 1 , block dim = 8 
tid = 15 , thred index = 7 , block idx = 1 , block dim = 8 
tid = 0 , thred index = 0 , block idx = 0 , block dim = 8 
tid = 1 , thred index = 1 , block idx = 0 , block dim = 8 
tid = 2 , thred index = 2 , block idx = 0 , block dim = 8 
tid = 3 , thred index = 3 , block idx = 0 , block dim = 8 
tid = 4 , thred index = 4 , block idx = 0 , block dim = 8 
tid = 5 , thred index = 5 , block idx = 0 , block dim = 8 
tid = 6 , thred index = 6 , block idx = 0 , block dim = 8 
tid = 7 , thred index = 7 , block idx = 0 , block dim = 8 
 n,  x  =  0  0.000000 
 n,  x  =  1  1.000000 
 n



---

We can now perform the same steps for the second code, prac1b.cu, which does lots of error-checking.

In [67]:
%%writefile prac1b.cu

//
// include files
//

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>


//
// kernel routine
//

__global__ void my_first_kernel(float *x)
{
  int tid = threadIdx.x + blockDim.x*blockIdx.x;

  x[tid] = (float) threadIdx.x;
}


//
// main code
//

int main(int argc, const char **argv)
{
  float *h_x, *d_x;
  int   nblocks, nthreads, nsize, n;

  // initialise card

  findCudaDevice(argc, argv);

  // set number of blocks, and threads per block

  nblocks  = 2;
  nthreads = 8;
  nsize    = nblocks*nthreads ;

  // allocate memory for array

  h_x = (float *)malloc(nsize*sizeof(float));
  checkCudaErrors(cudaMalloc((void **)&d_x, nsize*sizeof(float)));

  // execute kernel

  my_first_kernel<<<nblocks,nthreads>>>(d_x);
  getLastCudaError("my_first_kernel execution failed\n");

  // copy back results and print them out

  checkCudaErrors( cudaMemcpy(h_x,d_x,nsize*sizeof(float),
                 cudaMemcpyDeviceToHost) );

  for (n=0; n<nsize; n++) printf(" n,  x  =  %d  %f \n",n,h_x[n]);

  // free memory

  checkCudaErrors(cudaFree(d_x));
  free(h_x);

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();

  return 0;
}


Overwriting prac1b.cu


In [68]:
!nvcc prac1b.cu -o prac1b -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z15my_first_kernelPf' for 'sm_70'
ptxas info    : Function properties for _Z15my_first_kernelPf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 360 bytes cmem[0]


In [69]:
!./prac1b

GPU Device 0: "Turing" with compute capability 7.5

 n,  x  =  0  0.000000 
 n,  x  =  1  1.000000 
 n,  x  =  2  2.000000 
 n,  x  =  3  3.000000 
 n,  x  =  4  4.000000 
 n,  x  =  5  5.000000 
 n,  x  =  6  6.000000 
 n,  x  =  7  7.000000 
 n,  x  =  8  0.000000 
 n,  x  =  9  1.000000 
 n,  x  =  10  2.000000 
 n,  x  =  11  3.000000 
 n,  x  =  12  4.000000 
 n,  x  =  13  5.000000 
 n,  x  =  14  6.000000 
 n,  x  =  15  7.000000 


In [122]:
%%writefile prac1b.cu

//
// include files
//

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>


//
// kernel routine
//

__global__ void my_first_kernel(float *a,float *b,float *c)
{
  int tid = threadIdx.x + blockDim.x*blockIdx.x;

  c[tid] = (float) a[tid]+b[tid];

}


//
// main code
//

int main(int argc, const char **argv)
{
  float h_a[16] ={10,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
  float h_b[16] ={1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
  float *d_c,*d_a,*d_b;
  float *h_c;
  int   nblocks, nthreads, nsize, n;

  // initialise card

  findCudaDevice(argc, argv);

  // set number of blocks, and threads per block

  nblocks  = 2;
  nthreads = 8;
  nsize    = nblocks*nthreads ;

  // allocate memory for array

  h_c = (float *)malloc(nsize*sizeof(float));
  checkCudaErrors(cudaMalloc((void **)&d_a, nsize*sizeof(float)));
  checkCudaErrors(cudaMalloc((void **)&d_b, nsize*sizeof(float)));
  checkCudaErrors(cudaMalloc((void **)&d_c, nsize*sizeof(float)));


  checkCudaErrors( cudaMemcpy(d_a,h_a,nsize*sizeof(float),
                 cudaMemcpyDefault) );
  checkCudaErrors( cudaMemcpy(d_b,h_b,nsize*sizeof(float),
                 cudaMemcpyDefault) );
  // execute kernel

  my_first_kernel<<<nblocks,nthreads>>>(d_a,d_b,d_c);
  getLastCudaError("my_first_kernel execution failed\n");

  // copy back results and print them out

  checkCudaErrors( cudaMemcpy(h_c,d_c,nsize*sizeof(float),
                 cudaMemcpyDefault) );

  for (n=0; n<nsize; n++) printf(" n,  c  =  %d  %f \n",n,h_c[n]);

  // free memory

  checkCudaErrors(cudaFree(d_a));

  checkCudaErrors(cudaFree(d_b));

  checkCudaErrors(cudaFree(d_c));
  free(h_c);

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();

  return 0;
}


Overwriting prac1b.cu


In [123]:
!nvcc prac1b.cu -o prac1b -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z15my_first_kernelPfS_S_' for 'sm_70'
ptxas info    : Function properties for _Z15my_first_kernelPfS_S_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 376 bytes cmem[0]


In [124]:
!./prac1b

GPU Device 0: "Turing" with compute capability 7.5

 n,  c  =  0  11.000000 
 n,  c  =  1  4.000000 
 n,  c  =  2  6.000000 
 n,  c  =  3  8.000000 
 n,  c  =  4  10.000000 
 n,  c  =  5  12.000000 
 n,  c  =  6  14.000000 
 n,  c  =  7  16.000000 
 n,  c  =  8  18.000000 
 n,  c  =  9  20.000000 
 n,  c  =  10  22.000000 
 n,  c  =  11  24.000000 
 n,  c  =  12  26.000000 
 n,  c  =  13  28.000000 
 n,  c  =  14  30.000000 
 n,  c  =  15  32.000000 




---
By going back to the previous code blocks you can modify the codes to complete the Practical 1 exercises.  Alternatively you can copy a Code cell (on my system I do this by using the mouse right-click) and paste it (control-V on my system) to form a new Code cell -- this is best for the final exercise in which you are to write a new code to add two vectors.

However, this copy of the notebook is read-only for everyone except the owner (me!) so you will need to make your own copy of the notebook by going to the File option at the top and then clicking on "Save a copy in Drive" which will make a copy of it in your Google Drive.  You are then the owner of the copy and can edit it freely.

For students doing this as an assignment to be assessed, you should add your name to the title of the notebook (as in "Practical 1 -- Mike Giles.ipynb"), make it shared (see the Share option in the top-right corner) and provide the shared link as the submission mechanism.

This final piece of code terminates the runtime for this notebook so that you can switch to a new notebook without any problems -- you will get an error message if you try to keep two runtimes going at the same time with the free Colab account.

It's particularly convenient if you are executing the whole notebook to check everything works correctly, using the "Run all" option in the Runtime tab.

In [None]:
from google.colab import runtime
runtime.unassign()