<a href="https://colab.research.google.com/github/Dyfox100/CUDA-Tutorials/blob/main/Basic_Operations_CUDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Double check to see if the CUDA compiler is installed and updated. The !(bang) operator in jupyter notebooks runs shell commands.

In [9]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243


Installs the nvcc_plugin needed to run CUDA C/C++ from notebooks.

In [10]:
!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git

Collecting git+git://github.com/andreinechaev/nvcc4jupyter.git
  Cloning git://github.com/andreinechaev/nvcc4jupyter.git to /tmp/pip-req-build-9wagm3n8
  Running command git clone -q git://github.com/andreinechaev/nvcc4jupyter.git /tmp/pip-req-build-9wagm3n8
Building wheels for collected packages: NVCCPlugin
  Building wheel for NVCCPlugin (setup.py) ... [?25l[?25hdone
  Created wheel for NVCCPlugin: filename=NVCCPlugin-0.0.2-cp36-none-any.whl size=4308 sha256=4130c8f522fb13eefc727d13f218fc889ec443571d17b8c38f8440628d5bb2f4
  Stored in directory: /tmp/pip-ephem-wheel-cache-n38gk02k/wheels/10/c2/05/ca241da37bff77d60d31a9174f988109c61ba989e4d4650516
Successfully built NVCCPlugin
Installing collected packages: NVCCPlugin
Successfully installed NVCCPlugin-0.0.2


Starts extension running in jupyter.

In [11]:
%load_ext nvcc_plugin

created output directory at /content/src
Out bin /content/result.out


Simple program to make sure the C/C++ CUDA extension works. This won't run on gpu, but if the extension isn't working, colab will try to run this in python and it will blow up.

In [12]:
%%cu
#include <stdio.h>

/*just to check if the extension is working. None of his runs on the gpu.*/
int main() {
    printf("If this prints, the CUDA etension works!\n");
    return 0;
}


If this prints, the CUDA etension works!



Basic copy to gpu and add kernel.

In [17]:
%%cu
#include <stdio.h>
#include <stdlib.h>


// Simple gpu function to add two variables.
__global__ void add(int *a, int *b, int *r){
    *r = *a + *b;
}

// Main function to run the gpu code.
int main() { 
    int a, b, r;

    // gpu copies
    int *g_a, *g_b, *g_r;

    // Allocates space on gpu for the three ints.
    // Puts pointers to this space in the variables g_a, g_b, g_r.
    cudaMalloc((void **)&g_a, sizeof(int));
    cudaMalloc((void **)&g_b, sizeof(int));
    cudaMalloc((void **)&g_r, sizeof(int));

   a = 2;
   b = 5;

    // Copy variables to gpu.
    cudaMemcpy(g_a, &a, sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(g_b, &b, sizeof(int), cudaMemcpyHostToDevice);

    // Launch kernel on gpu.

    add<<<1,1>>>(g_a, g_b, g_r);

    // Copy the result back to the host and check for errors in copy.
    cudaError err = cudaMemcpy(&r, g_r, sizeof(int), cudaMemcpyDeviceToHost);
    if(err!=cudaSuccess) {
        printf("Error copying to Host: %s\n", cudaGetErrorString(err));
    }

    printf("Adding %d with %d on the gpu yields %d\n",a, b, r);

    // Need to free memory on gpu.
    cudaFree(g_a);
    cudaFree(g_b);
    cudaFree(g_r);

    return 0;

}

Adding 2 with 5 on the gpu yields 7



In [24]:
%%cu
#include <stdio.h>
#include <stdlib.h>

__global__ void hello_cuda() {
    printf("Hello from CUDA!\n");
}

int main() {
    // kernel launch params. First is num of blocks. Second is num threads in block.
    // Should print 6 times, 3 threads per block on 2 blocks.
    // Can use dim3 type to get 3d initilization of threads /blocks.
    // Should be less than 1024 threads in x,y and 64 threads in z. And x * y * z must be less than 1024.
    // Must be less than 65536 thread blocks in y and z dirs and 2^32 - 1 in x.  
    dim3 block(3, 1, 1);
    dim3 grid(2, 1, 1);
    hello_cuda <<<grid, block>>>();
    // Waits until kernel completes. Necessary because main function will finish
    // before the kernel prints otherwise.
    cudaDeviceSynchronize();
    return 0;
}

Hello from CUDA!
Hello from CUDA!
Hello from CUDA!
Hello from CUDA!
Hello from CUDA!
Hello from CUDA!



Grid and block 

Grid -- The collection of all threads in a kernel. 

Block -- Threads in a grid are organized into a block. 

The variables threadIdx, blockIdx, blockDim, and gridDim can provide us with information about the gird/blocks/threads. 

Note that each thread runs independently, so the output is intermingled.

In [31]:
%%cu
#include <stdio.h>
#include <stdlib.h>

__global__ void print_thread_id() {
    // Kernels have access to threadIDx structs that identify threads in a block.
    printf("Thread ID is: (%d, %d, %d)\n", threadIdx.x, threadIdx.y, threadIdx.z);
    // Also have access to blockIdx which identifys blocks in the grid.
    printf("Block ID is: (%d, %d, %d)\n", blockIdx.x, blockIdx.y, blockIdx.z);
    // blockDim structs hold the number of threads in each block. Same for all
    // blocks / threads.
    printf("Each block has %d by %d by %d blocks.\n",
           blockDim.x, blockDim.y, blockDim.x);
    // There is also a gridDim struct which gives dimensions of the grid (in number of blocks).
    printf("The grid has %d by %d by %d blocks.\n", 
           gridDim.x, gridDim.y, gridDim.z);
    // We can use this info to get the total number of threads.
    printf("The total number of threads is: %d.\n", 
           (blockDim.x * blockDim.y * blockDim.z) * (gridDim.x * gridDim.y * gridDim.z));
}

int main() {
    dim3 block(2, 1, 1);
    dim3 grid(2, 2, 1);
    print_thread_id <<<grid, block>>>();
    // Waits until kernel completes. Necessary because main function will finish
    // before the kernel prints otherwise.
    cudaDeviceSynchronize();
    return 0;
}

Thread ID is: (0, 0, 0)
Thread ID is: (1, 0, 0)
Thread ID is: (0, 0, 0)
Thread ID is: (1, 0, 0)
Thread ID is: (0, 0, 0)
Thread ID is: (1, 0, 0)
Thread ID is: (0, 0, 0)
Thread ID is: (1, 0, 0)
Block ID is: (0, 1, 0)
Block ID is: (0, 1, 0)
Block ID is: (1, 1, 0)
Block ID is: (1, 1, 0)
Block ID is: (1, 0, 0)
Block ID is: (1, 0, 0)
Block ID is: (0, 0, 0)
Block ID is: (0, 0, 0)
Each block has 2 by 1 by 2 blocks.
Each block has 2 by 1 by 2 blocks.
Each block has 2 by 1 by 2 blocks.
Each block has 2 by 1 by 2 blocks.
Each block has 2 by 1 by 2 blocks.
Each block has 2 by 1 by 2 blocks.
Each block has 2 by 1 by 2 blocks.
Each block has 2 by 1 by 2 blocks.
The grid has 2 by 2 by 1 blocks.
The grid has 2 by 2 by 1 blocks.
The grid has 2 by 2 by 1 blocks.
The grid has 2 by 2 by 1 blocks.
The grid has 2 by 2 by 1 blocks.
The grid has 2 by 2 by 1 blocks.
The grid has 2 by 2 by 1 blocks.
The grid has 2 by 2 by 1 blocks.
The total number of threads is: 8.
The total number of threads is: 8.
The total 