# CUDA Vector Addition with PyTorch C++ Extensions

This notebook demonstrates how to write and compile a custom CUDA kernel for vector addition using PyTorch's C++ extension API. This works in Google Colab without needing external files.

## 1. Check CUDA Availability and Install Dependencies

In [None]:
import torch
import sys

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
else:
    print("WARNING: CUDA is not available. This notebook requires GPU runtime.")

## 2. Write CUDA Kernel Code

We'll create the CUDA kernel file directly in the notebook using the `%%writefile` magic command.

In [None]:
pip install ninja

In [None]:
%%writefile addition.cu
#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.device().is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) \
  CHECK_CUDA(x); \
  CHECK_CONTIGUOUS(x)

__global__ void add_kernel(const float *input1, const float *input2, float *output, int size) {
  const int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx < size)
    output[idx] = input1[idx] + input2[idx];
}

torch::Tensor add(torch::Tensor input1, torch::Tensor input2) {
  CHECK_INPUT(input1);
  CHECK_INPUT(input2);
  int size = input1.numel();
  TORCH_CHECK(size == input2.numel(), "input1 and input2 must have the same size");
  torch::Tensor output = torch::empty(size, input1.options());

  int n_threads = 256;
  int n_blocks = (size + n_threads - 1) / n_threads;
  add_kernel<<<n_blocks, n_threads>>>(input1.data_ptr<float>(), input2.data_ptr<float>(), output.data_ptr<float>(), size);

  return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("add", &add, "Add two vectors");
}

## 3. Compile CUDA Extension

Now we'll compile the CUDA extension using PyTorch's JIT compiler. This may take a minute or two.

In [None]:
from torch.utils.cpp_extension import load

print("Compiling CUDA extension... This may take a few minutes on first run.")
module = load(
    name="vector_add_cuda",
    sources=["addition.cu"],
    extra_cuda_cflags=["-O3"],
    # Adding extra C++ compiler flags for potential compatibility issues
    # and to potentially provide more verbose compilation output.
    extra_cflags=["-std=c++17", "-D_GLIBCXX_USE_CXX11_ABI=0"],
    verbose=True,
)
print("\n✓ Compilation successful!")

## 4. Test Vector Addition

Let's create some test vectors and run our custom CUDA kernel.

In [None]:
# Create random input tensors on GPU
size = 1_000_000
input1 = torch.randn(size, device="cuda")
input2 = torch.randn(size, device="cuda")

print(f"Input tensor 1 shape: {input1.shape}")
print(f"Input tensor 2 shape: {input2.shape}")
print(f"Device: {input1.device}")

# Run our custom CUDA kernel
output_custom = module.add(input1, input2)

print(f"\nOutput shape: {output_custom.shape}")
print(f"First 10 elements of result: {output_custom[:10]}")

## 5. Verify Results

Compare our custom CUDA kernel output with PyTorch's built-in addition to verify correctness.

In [None]:
# Compute expected result using PyTorch
output_expected = input1 + input2

# Verify correctness
try:
    torch.testing.assert_close(output_custom, output_expected)
    print("✓ SUCCESS: Custom CUDA kernel produces correct results!")
    print(f"  Maximum difference: {(output_custom - output_expected).abs().max().item():.2e}")
except AssertionError as e:
    print("✗ FAILED: Results don't match!")
    print(e)

## 6. Benchmark Performance (Optional)

Let's compare the performance of our custom kernel vs PyTorch's built-in addition.

In [None]:
# Warmup
for _ in range(10):
    _ = module.add(input1, input2)
    _ = input1 + input2
torch.cuda.synchronize()

# Benchmark custom kernel using CUDA events
n_runs = 1000

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

start_event.record()
for _ in range(n_runs):
    _ = module.add(input1, input2)
end_event.record()
torch.cuda.synchronize()
custom_time = start_event.elapsed_time(end_event) / n_runs  # milliseconds

# Benchmark PyTorch
start_event.record()
for _ in range(n_runs):
    _ = input1 + input2
end_event.record()
torch.cuda.synchronize()
pytorch_time = start_event.elapsed_time(end_event) / n_runs  # milliseconds

print(f"Vector size: {size:,} elements")
print(f"Custom CUDA kernel: {custom_time:.4f} ms")
print(f"PyTorch built-in:   {pytorch_time:.4f} ms")
print(f"Speedup: {pytorch_time/custom_time:.2f}x {'(custom is faster)' if custom_time < pytorch_time else '(PyTorch is faster)'}")