<a href="https://colab.research.google.com/github/2003Yash/cuda_basics/blob/main/cuda_kernel_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install CUDA C++ plugin for Colab:


In [None]:
!pip install nvcc4jupyter
%load_ext nvcc4jupyter

Collecting nvcc4jupyter
  Downloading nvcc4jupyter-1.2.1-py3-none-any.whl.metadata (5.1 kB)
Downloading nvcc4jupyter-1.2.1-py3-none-any.whl (10 kB)
Installing collected packages: nvcc4jupyter
Successfully installed nvcc4jupyter-1.2.1
Detected platform "Colab". Running its setup...
Source files will be saved in "/tmp/tmp6xeof_iy".


# Detect selected GPU and its NVIDA architecture:

In [None]:

import subprocess
gpu_info = subprocess.getoutput("nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader,nounits")
if "not found" in gpu_info.lower(): raise RuntimeError("Error: No GPU found. Please select a GPU runtime environment.")
gpu_name, compute_cap = map(str.strip, gpu_info.split(','))
gpu_arch = f"sm_{compute_cap.replace('.', '')}"

print(f"{'GPU Name':<15}: {gpu_name}")
print(f"{'Architecture':<15}: {gpu_arch}")

GPU Name       : Tesla T4
Architecture   : sm_75


# Actual CUDA Code

In [None]:
%%cuda -c "--gpu-architecture $gpu_arch"
#include <stdio.h>

__global__ void hello_kernel() {
    int blockId = blockIdx.x;
    int threadId = threadIdx.x;
    int globalId = threadId + blockId * blockDim.x;

    printf("Hello from block %d, thread %d (global thread %d)\n", blockId, threadId, globalId);
}

int main() {
    int numBlocks = 2;
    int threadsPerBlock = 4;

    hello_kernel<<<numBlocks, threadsPerBlock>>>();
    cudaDeviceSynchronize();

    return 0;
}

Hello from block 0, thread 0 (global thread 0)
Hello from block 0, thread 1 (global thread 1)
Hello from block 0, thread 2 (global thread 2)
Hello from block 0, thread 3 (global thread 3)
Hello from block 1, thread 0 (global thread 4)
Hello from block 1, thread 1 (global thread 5)
Hello from block 1, thread 2 (global thread 6)
Hello from block 1, thread 3 (global thread 7)

