In [9]:
!apt update -qq;
!wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb;
!dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb;
!apt-key add /var/cuda-repo-8-0-local-ga2/7fa2af80.pub;
!apt-get update -qq;
!apt-get install cuda gcc-5 g++-5 -y -qq;
!ln -s /usr/bin/gcc-5 /usr/local/cuda/bin/gcc;
!ln -s /usr/bin/g++-5 /usr/local/cuda/bin/g++;
!apt install cuda-8.0;

74 packages can be upgraded. Run 'apt list --upgradable' to see them.
--2021-08-05 19:38:44--  https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb
Resolving developer.nvidia.com (developer.nvidia.com)... 152.199.0.24
Connecting to developer.nvidia.com (developer.nvidia.com)|152.199.0.24|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://developer.nvidia.com/compute/cuda/8.0/prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb [following]
--2021-08-05 19:38:44--  https://developer.nvidia.com/compute/cuda/8.0/prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb
Reusing existing connection to developer.nvidia.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://developer.download.nvidia.com/compute/cuda/8.0/secure/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb?qTO7rH6F5oj

In [14]:
!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git

Collecting git+git://github.com/andreinechaev/nvcc4jupyter.git
  Cloning git://github.com/andreinechaev/nvcc4jupyter.git to /tmp/pip-req-build-nrlprahd
  Running command git clone -q git://github.com/andreinechaev/nvcc4jupyter.git /tmp/pip-req-build-nrlprahd


In [16]:
%load_ext nvcc_plugin

The nvcc_plugin extension is already loaded. To reload it, use:
  %reload_ext nvcc_plugin


# Test CUDA

In [24]:
%%cu
#include <iostream>
int main() {
    std::cout << "Hello world\n";
    return 0;
}

Hello world



# Array Sum Problem

In [33]:
%%cu
#include <iostream>
#include <math.h>

// function to add the elements of two arrays
__global__ void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
 
  int N = 1<<20;

 float *x, *y;


 cudaMallocManaged(&x, N*sizeof(float));
 cudaMallocManaged(&y, N*sizeof(float));


  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
 
  add<<<1,1>>>(N, x, y); 

 cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
 

  // Free memory
  delete [] x;
  delete [] y;
 

  return 0;
}

Max error: 0
src/tcmalloc.cc:283] Attempt to free invalid pointer 0x7f7aa2000000 



In [39]:
!/usr/local/cuda/bin/nvcc -arch=sm_35 -rdc=true 1.cu -o 1 -lcudadevrt

In [40]:
!nvprof ./1

Max error: 0
src/tcmalloc.cc:283] Attempt to free invalid pointer 0x7ff63a000000 


To make out program realy fast, like not just using the GPU, we want it to realy use all the architecture of a GPU and in order to do that we need to use some of those GPU constructs that `Nvidia` has defined like threads, blocks, and grids.

Let's check that:

Neural Netoworks use matrix operations all the time and we nknow that properly.

Now to increase our speed we can use GPU. Although CPU supports multithreading, but it is not supporting a million thread. That's what GPU are made for.

Let's look at GPU. On the GPU, each single thread lives inside of what's called a block. Each block has multiple of 32 threads in itself. Also a set of blocks makes grid. That means you have `a grid of blocks of threads`.


The reason for limit to the number of threads per grid is because of memories. Global memory and shared memories in each block.

What we want to do is we want to say we need 1M threads for our array sum. Because there is a memory limit for each of these blocks then we know that we want the maximum amount of threads per block and we want the max amount of blocks that come out to 1M threads. How we do that? We can define this as a simple function:

```
numBlock = (N + blockSize - 1)/blockSize
```

Then we can run our function in the way that we want.

In [43]:
%%cu
#include <iostream>
#include <math.h>

// function to add the elements of two arrays
__global__ void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;

  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

int main(void)
{
 
  int N = 1<<20;

 float *x, *y;


 cudaMallocManaged(&x, N*sizeof(float));
 cudaMallocManaged(&y, N*sizeof(float));


  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
 
  add<<<1,256>>>(N, x, y); 

 cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
 

  // Free memory
  delete [] x;
  delete [] y;
 

  return 0;
}

Max error: 0
src/tcmalloc.cc:283] Attempt to free invalid pointer 0x7efc5e000000 



In [44]:
!/usr/local/cuda/bin/nvcc -arch=sm_35 -rdc=true 2.cu -o 2 -lcudadevrt

In [45]:
!nvprof ./2

Max error: 0
src/tcmalloc.cc:283] Attempt to free invalid pointer 0x7fdbb6000000 
