### **CUDA** 
Below is an example that runs native CUDA code. 

1.   We investigate the CUDA version, drivers and the avaiable GPU with nvidia-smi and nvcc-version
2.   We use the IPython magic command "%%writefile filename" to save a *.cu program
3.   We then compile and run the *.cu program with nvcc







In [1]:
!nvcc --version
!nvidia-smi

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Sun Jan 15 23:41:04 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+------


## Next, we write a native CUDA code and save it as 'vectorAdd.cu'


In [2]:
%%writefile vectorAdd.cu
#include <stdio.h>
#include <stdlib.h>
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
int main() {
int a, b, c;
// host copies of variables a, b & c
int *d_a, *d_b, *d_c;
// device copies of variables a, b & c
int size = sizeof(int);
// Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Setup input values  
c = 0;
a = 3;
b = 5;
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU
add<<<1,1>>>(d_a, d_b, d_c);
// Copy result back to host
cudaError err = cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
  if(err!=cudaSuccess) {
      printf("CUDA error copying to Host: %s\n", cudaGetErrorString(err));
  }
printf("result is %d\n",c);
// Cleanup
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}

Writing vectorAdd.cu


## We compile the saved cuda code using nvcc compiler

In [3]:
!nvcc vectorAdd.cu -o vectorAdd
!ls


sample_data  vectorAdd	vectorAdd.cu


## Finally, we execute the binary of the compiled code

In [4]:
!./vectorAdd

result is 8


In [5]:
!nvcc -arch=sm_75 -I/usr/local/cuda/samples/common/inc lab3_ex1_template.cu -o vecadd

In [7]:
!/usr/local/cuda-11/bin/nv-nsight-cu-cli ./vecadd 1024

The input length is 1024
==PROF== Connected to process 1405 (/content/vecadd)
Transfer time host to device 0.000056 seconds
==PROF== Profiling "vecAdd" - 1: 0%....50%....100% - 8 passes
Kernel time 0.782115 seconds
Transfer time device to host 0.000052 seconds
==PROF== Disconnected from process 1405
[1405] vecadd@127.0.0.1
  vecAdd(double*, double*, double*, int), 2023-Jan-15 23:44:29, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           4.93
    SM Frequency                                                             cycle/usecond                         576.11
    Elapsed Cycles                                                                   cycle                          3,005
    Memory [%]                                                         

In [8]:
!/usr/local/cuda-11/bin/nv-nsight-cu-cli ./vecadd 131070

The input length is 131070
==PROF== Connected to process 2281 (/content/vecadd)
Transfer time host to device 0.001014 seconds
==PROF== Profiling "vecAdd" - 1: 0%....50%....100% - 8 passes
Kernel time 0.659832 seconds
Transfer time device to host 0.001259 seconds
==PROF== Disconnected from process 2281
[2281] vecadd@127.0.0.1
  vecAdd(double*, double*, double*, int), 2023-Jan-15 23:48:00, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           4.91
    SM Frequency                                                             cycle/usecond                         573.62
    Elapsed Cycles                                                                   cycle                          8,316
    Memory [%]                                                       

In [9]:
!/usr/local/cuda-11/bin/nv-nsight-cu-cli ./vecadd 204800

The input length is 204800
==PROF== Connected to process 5301 (/content/vecadd)
Transfer time host to device 0.001468 seconds
==PROF== Profiling "vecAdd" - 1: 0%....50%....100% - 8 passes
Kernel time 0.661228 seconds
Transfer time device to host 0.001700 seconds
==PROF== Disconnected from process 5301
[5301] vecadd@127.0.0.1
  vecAdd(double*, double*, double*, int), 2023-Jan-16 00:00:31, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           5.02
    SM Frequency                                                             cycle/usecond                         587.43
    Elapsed Cycles                                                                   cycle                         11,263
    Memory [%]                                                       

In [10]:
!/usr/local/cuda-11/bin/nv-nsight-cu-cli ./vecadd 409600

The input length is 409600
==PROF== Connected to process 5479 (/content/vecadd)
Transfer time host to device 0.002619 seconds
==PROF== Profiling "vecAdd" - 1: 0%....50%....100% - 8 passes
Kernel time 0.636049 seconds
Transfer time device to host 0.003388 seconds
==PROF== Disconnected from process 5479
[5479] vecadd@127.0.0.1
  vecAdd(double*, double*, double*, int), 2023-Jan-16 00:01:09, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           4.97
    SM Frequency                                                             cycle/usecond                         582.89
    Elapsed Cycles                                                                   cycle                         21,547
    Memory [%]                                                       

In [11]:
!/usr/local/cuda-11/bin/nv-nsight-cu-cli ./vecadd 512000

The input length is 512000
==PROF== Connected to process 5825 (/content/vecadd)
Transfer time host to device 0.003181 seconds
==PROF== Profiling "vecAdd" - 1: 0%....50%....100% - 8 passes
Kernel time 0.631751 seconds
Transfer time device to host 0.003872 seconds
==PROF== Disconnected from process 5825
[5825] vecadd@127.0.0.1
  vecAdd(double*, double*, double*, int), 2023-Jan-16 00:02:29, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           4.98
    SM Frequency                                                             cycle/usecond                         584.05
    Elapsed Cycles                                                                   cycle                         26,729
    Memory [%]                                                       

In [12]:
!/usr/local/cuda-11/bin/nv-nsight-cu-cli ./vecadd 819200

The input length is 819200
==PROF== Connected to process 5975 (/content/vecadd)
Transfer time host to device 0.005082 seconds
==PROF== Profiling "vecAdd" - 1: 0%....50%....100% - 8 passes
Kernel time 0.673848 seconds
Transfer time device to host 0.006751 seconds
==PROF== Disconnected from process 5975
[5975] vecadd@127.0.0.1
  vecAdd(double*, double*, double*, int), 2023-Jan-16 00:03:00, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           4.99
    SM Frequency                                                             cycle/usecond                         584.71
    Elapsed Cycles                                                                   cycle                         42,571
    Memory [%]                                                       

In [13]:
!/usr/local/cuda-11/bin/nv-nsight-cu-cli ./vecadd 1000000

The input length is 1000000
==PROF== Connected to process 6221 (/content/vecadd)
Transfer time host to device 0.006085 seconds
==PROF== Profiling "vecAdd" - 1: 0%....50%....100% - 8 passes
Kernel time 0.667252 seconds
Transfer time device to host 0.007503 seconds
==PROF== Disconnected from process 6221
[6221] vecadd@127.0.0.1
  vecAdd(double*, double*, double*, int), 2023-Jan-16 00:03:54, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           4.99
    SM Frequency                                                             cycle/usecond                         584.21
    Elapsed Cycles                                                                   cycle                         51,733
    Memory [%]                                                      

In [14]:
!/usr/local/cuda-11/bin/nv-nsight-cu-cli ./vecadd 2000000

The input length is 2000000
==PROF== Connected to process 6295 (/content/vecadd)
Transfer time host to device 0.012094 seconds
==PROF== Profiling "vecAdd" - 1: 0%....50%....100% - 8 passes
Kernel time 0.645282 seconds
Transfer time device to host 0.015182 seconds
==PROF== Disconnected from process 6295
[6295] vecadd@127.0.0.1
  vecAdd(double*, double*, double*, int), 2023-Jan-16 00:04:07, Context 1, Stream 7
    Section: GPU Speed Of Light
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           5.00
    SM Frequency                                                             cycle/usecond                         585.08
    Elapsed Cycles                                                                   cycle                        105,094
    Memory [%]                                                      