<a href="https://colab.research.google.com/github/NDU-CSC413/cuda1/blob/master/cuda1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Google colab creates a Linux instance that can be accesses from a Jupyter notebook. A Jupyter notebook consists of a sequence of cells. For our purpose there are two types of cells: **code** and **text**. For both cases you can "execute" the cell by pressing SHIFT-ENTER. For a text cell it will format it. You can always edit a text cell by double-clicking it.
A code cell executes Python by default. In our case we would like to edit c++ (with CUDA extensions) files, compile and run them.

First make sure that the runtime is GPU. From the menu choose Runtime->change runtime type-> Hardware accelerator choose GPU.

We can run arbitrary shell commands by either preceding each one of them with a "!" or we can use the magic characters "%%bash" in the begining of the cell.

As a first example let us determine the type of GPU that the instance has using the nvidia-smi command

In [None]:
%%bash
nvidia-smi

Mon Apr  4 08:43:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Example 0: Querying the device

In [1]:
%%writefile device.cu
#include <iostream>


int main()
{
	int device;

	cudaDeviceProp properties;
	cudaError_t err = cudaSuccess;
	err = cudaGetDevice(&device);
	err = cudaGetDeviceProperties(&properties, device);
	std::cout << "processor count" << properties.multiProcessorCount << std::endl;
	std::cout << "warp size " << properties.warpSize << std::endl;
	std::cout << "name=" << properties.name << std::endl;
	std::cout << "Compute capability " << properties.major << "." << properties.minor << "\n";
	std::cout << "shared Memory/SM " << properties.sharedMemPerMultiprocessor
		<< std::endl;

	if (err == cudaSuccess)
		printf("device =%d\n", device);
	else
		printf("error getting deivce\n");
	return 0;
}

Writing device.cu


In [3]:
!nvcc -o device device.cu -run

processor count=40
warp size= 32
name=Tesla T4
Compute capability=7.5
shared Memory/SM=65536
device =0


## Example1

To write code, create a code cell and write %%writefile filename.cu at the beginning. To actually create/modify the file "run" the cell

In [4]:
%%writefile example1.cu
#include <iostream>
__global__ void kernel(){
  printf("Hello from GPU\n");
}
int main(){
    kernel<<<1,2>>>();
    cudaDeviceSynchronize();

    
}

Writing example1.cu


At this point we can open the file from the left pane. Click on the folder icon in the left pane to see all the files.
Alternatively, we can list them using the unix "ls" command.

**IMPORTANT**: any file created from a Jupyter notebook will be lost once the instance is restarted.

A colab instance is a actually a VM running linux so each time it starts "from scratch". If you want to save files, mount your Google drive from the left pane.

In [None]:
!ls -la

total 20
drwxr-xr-x 1 root root 4096 Apr  4 08:44 .
drwxr-xr-x 1 root root 4096 Apr  4 08:37 ..
drwxr-xr-x 4 root root 4096 Mar 23 14:21 .config
-rw-r--r-- 1 root root  111 Apr  4 08:44 example1.cu
drwxr-xr-x 1 root root 4096 Mar 23 14:22 sample_data


NVIDIA nvcc compiler. The file extension MUST be .cu, otherwise it compiles it with a "regular" c++ compiler.

In [5]:
!nvcc example1.cu -o example1 
!./example1

Hello from GPU
Hello from GPU


## Example2

Most of the examples in the workshop use **managed memory**. In the following examples we will manage the memory manually. We will revisit managed memory later.

Before computing on the GPU we need to transfer the data from host memory to device memory. Once the computation is done we transfer it back to the host.
Below is a simple example of that process.

In [6]:
%%writefile example2.cu
#include <iostream>
__global__ void kernel(int *x,int *y,int *z){
    *z=*x+*y;
}
int main(){
    int a=1,b=2,c=0; //host variables
    int *d_a,*d_b,*d_c;//will hold device addresses
    // allocate memory for one integer and store the
     // address in d_a 
    cudaMalloc(&d_a,sizeof(int));
    cudaMalloc(&d_b,sizeof(int));
    cudaMalloc(&d_c,sizeof(int));
    // copy the value of a and b
    // TO device FROM host
    cudaMemcpy(d_a,&a,sizeof(int),cudaMemcpyHostToDevice);
    cudaMemcpy(d_b,&b,sizeof(int),cudaMemcpyHostToDevice);
    kernel<<<1,1>>>(d_a,d_b,d_c);
    // copy the result TO host FROM device
    cudaMemcpy(&c,d_c,sizeof(int),cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    
std::cout<<"value of c is "<<c<<"\n";
    
}

Writing example2.cu


In [7]:
!nvcc example2.cu -o example2
!./example2

value of c is 3


# More details
Below is a similar code but with more details. In particular, it describes what and what cannot be accessed from the device and the explains the \_\_device\_\_ and \_\_global\_\_ modifiers.

In [8]:
%%writefile example3.cu
#include <iostream>

/**
 * @brief constant global variables are accessible from both the CPU and GPU.
 * if the constant modified is removed from the declaration of g the compiler 
 * will give an error.
 * NOTE: global variables in the sense of C++ are DIFFERENT than the __global__ 
 * modifier.
 * NOTE: GPU access of const global variables other than int in Windows gives an error
 * on Linux it works. 
 */
const int g=10;
#ifdef WINDOWS
__device__ 
#endif 
const float a=2.3;

/* __device__ means a function called from the device
* AND runs on the device
*/

__device__ float useless(){
    return 2*a;
}

/* __global__ means a function called from the host 
 * and runs on device
 */
__global__ void example1(int *address){
    *address=useless()*g*17;

}


int main(){
    int x;
    /* variable that holds device address */
    int * device_address;
    /* allocate space for one int on the device 
     * and store the allocated address in device_address
     * NOTE the passing of the address of device_address variable
     */
    cudaMalloc(&device_address,sizeof(int));
    /* set the value at device address to 17 using a kernel launch*/
    std::cout<<"address of x="<<std::hex<<&x<<"\n";
    std::cout<<"device address="<<device_address<<"\n";
    /* launch function example1 with one block containing 1 thread 
     * @NOTE: all kernel launches are asynchronous
     * they return immediately to host code
     **/
    example1 <<<1,1>>>(device_address);
    /* copy the result back from device to host */
    cudaMemcpy(&x,device_address,sizeof(int),cudaMemcpyDeviceToHost);
    cudaFree(device_address);
    std::cout<<"The value of x="<<std::dec<<x<<"\n";

}

Writing example3.cu


## Example4
This is the first example where we use parallelism, computing the sum of two arrays.
The computation is performed where each thread computes the sum of two elements. To accomplish that we map the thread id to the array index. In this example we use a __single__, __linear__, block therefore the thread id is equal to the builtin variable threadIdx.x

In [10]:
%%writefile example4.cu
/**
 * @file example2.cu
 * @author Hikmat Farhat (hfarhat@ndu.edu.lb)
 * @brief SAXPY example in CUDA
 * given vectors X and Y and a constant a compute a*X+Y
 * @version 0.1
 * @date 2021-12-20
 * 
 * @copyright Copyright (c) 2021
 * 
 */
#include <iostream>
#include <cuda_runtime.h>


__global__ void saxpy(float *z,float *x,float *y,float a,int n){
   
     int i=blockDim.x*blockIdx.x+threadIdx.x;
     if( i<n)
        z[i]=a*x[i]+y[i];

}
int main(){
    /* number of elements */
    const int n=1<<20;
    const float a=3.0;
/* create 3 arrays on host */
    float *x, *y, *z;
    x=(float *)malloc(n*sizeof(float));
    y=(float *)malloc(n*sizeof(float));
    z=(float *)malloc(n*sizeof(float));
/* populate x and y */
    for(int i=0;i<n;++i){
        x[i]=2;
        y[i]=4;
    }
/* create 3 arrays on device */
    float *dx,*dy,*dz;
    cudaMalloc(&dx,n*sizeof(float));
    cudaMalloc(&dy,n*sizeof(float));
    cudaMalloc(&dz,n*sizeof(float));
/* transfer the values of x,y to device */
    cudaMemcpy(dx,x,n*sizeof(float),cudaMemcpyHostToDevice);
    cudaMemcpy(dy,y,n*sizeof(float),cudaMemcpyHostToDevice);
    /** 1-d grid 
     * in general type dim3 has 3 values (x,y,z)
     * when omitted each d defaults to 1
     * e.g. dim3 a(256)=a(256,1,1)
     * dim3 b(256,128)=b(256,128,1)
     ***/
    dim3 block (256);
    dim3 grid ((n+block.x-1)/block.x,1);
    /**
     * NOTE:  kernel launch is asynchronous with respect
     * to host code whereas cudaMemcpy is blocking.
     */
    saxpy<<<grid,block>>>(dz,dx,dy,a,n); 
    saxpy<<<grid,block>>>(dz,dx,dy,a,n); 

    cudaError_t e=cudaGetLastError();
    if(e!=cudaSuccess){
        std::cout<<cudaGetErrorString(e)<<"\n";
    }
/* transfer the result back to host */
    cudaMemcpy(z,dz,n*sizeof(float),cudaMemcpyDeviceToHost);
/* check if the result is correct. We expect all values 
 * of z=10
 */
    int sum=0;
    for(int i=0;i<n;++i)
        sum+=z[i];
    if (sum!=n*10)std::cout<<"sum error"<<sum<<"\n";
    else
        std::cout<<"check passed. Sum= "<<sum<<"\n";
    free(x);
    free(y);
    free(z);
    cudaFree(dx);
    cudaFree(dy);
    cudaFree(dz);
}

Overwriting example4.cu


In [11]:
!nvcc example4.cu -o example4 -run

check passed. Sum= 10485760


## Thread blocks

In CUDA the __maximum__ number of threads in a block is 1024. What if in the previous example we would like to compute the sum of two vectors with size bigger than 1024? We use multiple blocks.

### Matrix addition
We repeat the previous example by using multiple blocks.

In [13]:
%%writefile example5.cu
/**
 * @file example5.cu
 * @author Hikmat Farhat (hfarhat@ndu.edu.lb)
 * @brief Simple matrix addition
 * @version 0.1
 * @date 2021-12-30
 * 
 * @copyright Copyright (c) 2021
 * 
 */
#include <iostream>
const size_t width=1024;
const size_t height=1024;

//typedef float arr_t[width];
using arr_t = float[width];
__global__ void matrix_add(arr_t *c,arr_t *a,arr_t *b,size_t width,size_t height){

    size_t idx=blockIdx.x*blockDim.x+threadIdx.x;
    size_t idy=blockIdx.y*blockDim.y+threadIdx.y;
    if( idx<width && idy <height)
        //c[idx][idy]=a[idx][idy]+b[idx][idy];
        c[idy][idx]=a[idy][idx]+b[idy][idx];
}
int main(){
    arr_t *a,*b,*c;
    cudaMallocManaged(&a,height*sizeof(arr_t));
    cudaMallocManaged(&b,height*sizeof(arr_t));
    cudaMallocManaged(&c,height*sizeof(arr_t));
    for(size_t i=0;i<width;++i){
        for(size_t j=0;j<height;++j){
            a[i][j]=1;
            b[i][j]=2;
        }
    }
    dim3 block(32,32);
    dim3 grid(width/block.x+1,  height/block.y+1);    
    matrix_add<<<grid,block>>>(c,a,b,width,height);
    cudaDeviceSynchronize();
    bool e=false;
    for(size_t i=0;i<width;++i)
        for(size_t j=0;j<height;++j)
            if(c[i][j]!=3){
                e=true;
                break;
            }
    if (e==true)std::cout<<"error\n";
}

Writing example5.cu
