# Introduction to CUDA

Introduction to CUDA

## Setup

- Check CUDA version

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


- Install C-language support

In [2]:
!pip install git+https://github.com/andreinechaev/nvcc4jupyter.git

Collecting git+https://github.com/andreinechaev/nvcc4jupyter.git
  Cloning https://github.com/andreinechaev/nvcc4jupyter.git to /tmp/pip-req-build-dq9d1x67
  Running command git clone --filter=blob:none --quiet https://github.com/andreinechaev/nvcc4jupyter.git /tmp/pip-req-build-dq9d1x67
  Resolved https://github.com/andreinechaev/nvcc4jupyter.git to commit 0a71d56e5dce3ff1f0dd2c47c29367629262f527
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: NVCCPlugin
  Building wheel for NVCCPlugin (setup.py) ... [?25l[?25hdone
  Created wheel for NVCCPlugin: filename=NVCCPlugin-0.0.2-py3-none-any.whl size=4295 sha256=775da20a813f016bebc08bac5889c93333afbf532d71b81a5b95ff97c7447b56
  Stored in directory: /tmp/pip-ephem-wheel-cache-d41ixfmj/wheels/a8/b9/18/23f8ef71ceb0f63297dd1903aedd067e6243a68ea756d6feea
Successfully built NVCCPlugin
Installing collected packages: NVCCPlugin
Successfully installed NVCCPlugin-0.0.2


- Load plugin

In [3]:
%load_ext nvcc_plugin

created output directory at /content/src
Out bin /content/result.out


## Hello World program

- Hello World program

In [10]:
%%cu
// First line == using CUDA libraries in CoLab
#include <stdio.h>

void helloCPU()
{
  printf("Hello from the CPU.\n");
}

/*
 * The addition of `__global__` signifies that this function
 * should be launced on the GPU.
 */

__global__ void helloGPU()
{
  printf("Hello from the GPU.\n");
}

int main()
{
    int x= 3;
   helloCPU();


  /*
   * Add an execution configuration with the <<<...>>> syntax
   * will launch this function as a kernel on the GPU.
   */

  helloGPU<<<1, 1>>>();

  /*
   * `cudaDeviceSynchronize` will block the CPU stream until
   * all GPU kernels have completed.
   */


  cudaDeviceSynchronize(); // Doesn't work without this line
}

Hello from the CPU.
Hello from the GPU.



## SAXPY Program

- SAXPY Program. SAXPY stands for "Single-Precision A-X Plus Y"

  $z = ax + y$  
  $x, y, z : vector$  
  $a : scalar$

In [11]:
%%cu
// Using CUDA libraries in CoLab
#include <stdio.h>

__global__ // global function
void saxpy(int n, float a, float *x, float *y)
{
  printf("GPU active.....\n");

  int i = blockIdx.x*blockDim.x + threadIdx.x; // gives you an id of where you are
  if (i < n) y[i] = a*x[i] + y[i]; // SAXPY equation
}

int main(void)
{
  int N = 100; // chosen block size
  float *x, *y, *d_x, *d_y; // *d_x, *d_y: Convention for device, x & y on device
  x = (float*)malloc(N*sizeof(float)); // creating vector x holding N elements
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); // create vector x holding N elements on the device
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) { // fill up N values on the CPU
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); // copy from host to device (GPU) - x into d_x
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 100 elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); // even number of block numbers

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); // copy from device (GPU) to host (CPU) - d_y to y

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  // free memory
  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU active.....
GPU acti

### References

- References
  - [NVIDIA Developer: Six Ways to SAXPY](https://developer.nvidia.com/blog/six-ways-saxpy/)