<a href="https://colab.research.google.com/github/GerardAlba/CUDA-Programming-on-NVIDIA-GPUs/blob/main/Practical_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CUDA Programming on NVIDIA GPUs, July 22-26, 2024**

# **Practical 1**

First of all, make sure the correct Runtime is being used, by clicking on the Runtime option at the top, then "Change runtime type", selecting an appropriate GPU such as the T4, then clicking Save.

A Colab Pro or Pro+ account will allow you to use a more powerful GPU, but the freely available T4 is perfectly adequate for the practicals in this course. It has good single precision capabilities and corresponds to Compute Capability 7.5.

To check that this has been done successfully, the first instruction below returns information on the version of the available NVIDIA compiler, and the second instruction returns information on the GPU which is available to you.  

To "execute" the cell, click on the little triangle to the left of the instructions.  The ! tells Colab that these are system commands to be executed.

In [1]:
!nvcc --version
!nvidia-smi


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Thu Jun  6 17:24:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                      

---

The next step is to get the codes for Practical 1 by uploading them from the course webpage, and then executing the Makefile to generate the executable files.

In [2]:
!wget https://people.maths.ox.ac.uk/gilesm/cuda/prac1/prac1a.cu
!wget https://people.maths.ox.ac.uk/gilesm/cuda/prac1/prac1b.cu
!wget https://people.maths.ox.ac.uk/gilesm/cuda/prac1/prac1c.cu
!wget https://people.maths.ox.ac.uk/gilesm/cuda/prac1/helper_cuda.h
!wget https://people.maths.ox.ac.uk/gilesm/cuda/prac1/helper_string.h
!wget https://people.maths.ox.ac.uk/gilesm/cuda/prac1/Makefile
!make
!ls

--2024-06-06 17:24:24--  https://people.maths.ox.ac.uk/gilesm/cuda/prac1/prac1a.cu
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1002 [application/cu-seeme]
Saving to: ‘prac1a.cu’


2024-06-06 17:24:25 (436 MB/s) - ‘prac1a.cu’ saved [1002/1002]

--2024-06-06 17:24:25--  https://people.maths.ox.ac.uk/gilesm/cuda/prac1/prac1b.cu
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1219 (1.2K) [application/cu-seeme]
Saving to: ‘prac1b.cu’


2024-06-06 17:24:25 (625 MB/s) - ‘prac1b.cu’ saved [1219/1219]

--2024-06-06 17:24:25--  https://people.maths.ox.ac.uk/gilesm/cuda/prac1/prac1c.cu
Resolv



---
To see the codes and the Makefile we can use the Unix "cat" command (a strange name -- it's short for "concatenate" and can also be used to concatenate multiple files and display them on the screen)


In [3]:
!cat prac1a.cu

//
// include files
//

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

//
// kernel routine
// 

__global__ void my_first_kernel(float *x)
{
  int tid = threadIdx.x + blockDim.x*blockIdx.x;

  x[tid] = (float) threadIdx.x;
}


//
// main code
//

int main(int argc, char **argv)
{
  float *h_x, *d_x;
  int   nblocks, nthreads, nsize, n; 

  // set number of blocks, and threads per block

  nblocks  = 2;
  nthreads = 8;
  nsize    = nblocks*nthreads ;

  // allocate memory for array

  h_x = (float *)malloc(nsize*sizeof(float));
  cudaMalloc((void **)&d_x, nsize*sizeof(float));

  // execute kernel

  my_first_kernel<<<nblocks,nthreads>>>(d_x);

  // copy back results and print them out

  cudaMemcpy(h_x,d_x,nsize*sizeof(float),cudaMemcpyDeviceToHost);

  for (n=0; n<nsize; n++) printf(" n,  x  =  %d  %f \n",n,h_x[n]);

  // free memory 

  cudaFree(d_x);
  free(h_x);

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();

  return 0;




---

Instead of using the Makefile (which is a powerful utility with lots of capabilities but can also be rather confusing), we can also use an explicit compilation command.  

In the example below we are compiling prac1a.cu and naming the executable output prac1a.  The other flags are as follows:

-lineinfo helps with debugging if there's a run-time problem

-arch=sm_70 says it is for GPUs of Compute Capability 7.0 or later

--ptxas=-v gives us additional information such as how many registers are used

--use_fast_math generates faster code which might sometimes be a little less accurate

-lcudart links in the run-time CUDA library



In [4]:
!nvcc prac1a.cu -o prac1a -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z15my_first_kernelPf' for 'sm_70'
ptxas info    : Function properties for _Z15my_first_kernelPf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 360 bytes cmem[0]


---

Now we can execute the codes.


In [None]:
!./prac1a

 n,  x  =  0  0.000000 
 n,  x  =  1  1.000000 
 n,  x  =  2  2.000000 
 n,  x  =  3  3.000000 
 n,  x  =  4  4.000000 
 n,  x  =  5  5.000000 
 n,  x  =  6  6.000000 
 n,  x  =  7  7.000000 
 n,  x  =  8  0.000000 
 n,  x  =  9  1.000000 
 n,  x  =  10  2.000000 
 n,  x  =  11  3.000000 
 n,  x  =  12  4.000000 
 n,  x  =  13  5.000000 
 n,  x  =  14  6.000000 
 n,  x  =  15  7.000000 


In [None]:
!./prac1b

 n,  x  =  0  0.000000 
 n,  x  =  1  1.000000 
 n,  x  =  2  2.000000 
 n,  x  =  3  3.000000 
 n,  x  =  4  4.000000 
 n,  x  =  5  5.000000 
 n,  x  =  6  6.000000 
 n,  x  =  7  7.000000 
 n,  x  =  8  0.000000 
 n,  x  =  9  1.000000 
 n,  x  =  10  2.000000 
 n,  x  =  11  3.000000 
 n,  x  =  12  4.000000 
 n,  x  =  13  5.000000 
 n,  x  =  14  6.000000 
 n,  x  =  15  7.000000 




---

The next step is to modify prac1a.cu.  The simplest way to do that seems to be to include the whole source in one of these Code cells, and use a **writefile** statement at the beginning to update the prac1a.cu file.

In doing this, we are following the helpful information provided here:
https://colab.research.google.com/drive/1GJOfTp56OeQRdE4u2_S7pUNRcJb4ik9X?usp=sharing

In [5]:
%%writefile prac1a.cu

//
// include files
//

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

//
// kernel routine
//

__global__ void my_first_kernel(float *x)
{
  int tid = threadIdx.x + blockDim.x*blockIdx.x;

  x[tid] = (float) threadIdx.x;
}


//
// main code
//

int main(int argc, char **argv)
{
  float *h_x, *d_x;
  int   nblocks, nthreads, nsize, n;

  // set number of blocks, and threads per block

  nblocks  = 2;
  nthreads = 8;
  nsize    = nblocks*nthreads ;

  // allocate memory for array

  h_x = (float *)malloc(nsize*sizeof(float));
  cudaMalloc((void **)&d_x, nsize*sizeof(float));

  // execute kernel

  my_first_kernel<<<nblocks,nthreads>>>(d_x);

  // copy back results and print them out

  cudaMemcpy(h_x,d_x,nsize*sizeof(float),cudaMemcpyDeviceToHost);

  for (n=0; n<nsize; n++) printf(" n,  x  =  %d  %f \n",n,h_x[n]);

  printf(" \n This is my new, updated code \n");

  // free memory

  cudaFree(d_x);
  free(h_x);

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();

  return 0;
}


Overwriting prac1a.cu




---

We can now re-make the executable and run it again.


In [6]:
!make
!./prac1a

nvcc prac1a.cu -o prac1a -I/include -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -L/lib64 -lcudart
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z15my_first_kernelPf' for 'sm_70'
ptxas info    : Function properties for _Z15my_first_kernelPf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 360 bytes cmem[0]
 n,  x  =  0  0.000000 
 n,  x  =  1  1.000000 
 n,  x  =  2  2.000000 
 n,  x  =  3  3.000000 
 n,  x  =  4  4.000000 
 n,  x  =  5  5.000000 
 n,  x  =  6  6.000000 
 n,  x  =  7  7.000000 
 n,  x  =  8  0.000000 
 n,  x  =  9  1.000000 
 n,  x  =  10  2.000000 
 n,  x  =  11  3.000000 
 n,  x  =  12  4.000000 
 n,  x  =  13  5.000000 
 n,  x  =  14  6.000000 
 n,  x  =  15  7.000000 
 
 This is my new, updated code 




---
By going back to the previous code block you can modify the code to complete the Practical 1 exercises.  Alternatively you can copy the Code cell (on my system this by using the mouse right-click) and paste it to form a new Code cell.

However, this copy of the notebook is read-only for everyone except the owner (me!) so you will need to make your own copy of the notebook by going to the File option at the top and then clicking on "Save a copy in Drive" which will make a copy of it in your Google Drive.  You are then the owner of the copy and can edit it freely.

For students doing this as an assignment to be assessed, you should add your name to the title of the notebook (as in "Practical 1 -- Mike Giles.ipynb"), make it shared (see the Share option in the top-right corner) and provide the shared link as the submission mechanism.