# Assignment 2 - CUDA Programming

**Name:** [Your Name Here]  
**Roll No:** [Your Roll No]  

---

## Q1. Identify !, %, and %% in Colab

These are special commands in Colab notebooks:
- `!` = run shell/terminal commands
- `%` = magic command for single line
- `%%` = magic command for entire cell

In [None]:
# Example 1: Shell commands with !
!pwd
!ls
!nvcc --version

In [None]:
# Example 2: Line magic with %
%time sum([i for i in range(100000)])

In [None]:
%%time
# Example 3: Cell magic with %%
# this times the entire cell execution
result = 0
for i in range(1000000):
    result += i
print(f"Sum = {result}")

## Q2. nvidia-smi Commands

nvidia-smi shows GPU information. Here are some useful commands:

In [None]:
# basic command
!nvidia-smi

In [None]:
# list GPUs
!nvidia-smi -L

In [None]:
# specific info in CSV format
!nvidia-smi --query-gpu=name,memory.total,memory.free,memory.used --format=csv

In [None]:
# temperature and power usage
!nvidia-smi --query-gpu=temperature.gpu,power.draw --format=csv

Other useful options:
- `nvidia-smi -l 2` (updates every 2 seconds)
- `nvidia-smi pmon` (process monitoring)

## Q3. Debug Common CUDA Errors

Three main errors:
1. Zero output (forgot cudaDeviceSynchronize)
2. Incorrect indexing (array out of bounds)
3. PTX errors (compilation/runtime issues)

In [None]:
%%writefile debug.cu
#include <stdio.h>
#include <cuda_runtime.h>

// check for cuda errors
#define CHECK(call) \
{ \
    cudaError_t err = call; \
    if(err != cudaSuccess) { \
        printf("Error: %s\n", cudaGetErrorString(err)); \
        exit(1); \
    } \
}

// Error 1: no sync - output might not show
__global__ void test1() {
    printf("Hello from GPU\n");
}

// Error 2: bad indexing
__global__ void bad_kernel(int *arr, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    arr[i] = i;  // might be out of bounds!
}

// Fixed version
__global__ void good_kernel(int *arr, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if(i < n) {  // check bounds
        arr[i] = i;
    }
}

int main() {
    // Test 1 - forgot to sync
    printf("Test 1 - no sync:\n");
    test1<<<1,1>>>();
    // missing: cudaDeviceSynchronize();

    printf("\nTest 2 - with sync:\n");
    test1<<<1,1>>>();
    cudaDeviceSynchronize();

    // Test 2 - array bounds
    int *d_arr;
    int n = 10;
    CHECK(cudaMalloc(&d_arr, n*sizeof(int)));

    good_kernel<<<1,20>>>(d_arr, n);  // 20 threads but only 10 elements
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaGetLastError());  // check for errors

    printf("\nAll tests passed!\n");

    cudaFree(d_arr);
    return 0;
}

In [None]:
!nvcc debug.cu -o debug
!./debug

**Common fixes:**
- Always add `cudaDeviceSynchronize()` after kernel calls
- Check array bounds with `if(idx < n)`
- Use `cudaGetLastError()` to catch errors

## Q4. GPU Kernel with Thread Indexing

Task: Launch kernel with 1 block, 8 threads. Each thread prints its global ID.

Formula: `global_id = blockIdx.x * blockDim.x + threadIdx.x`

In [None]:
%%writefile hello_gpu.cu
#include <stdio.h>

// Device code (runs on GPU)
__global__ void hello() {
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    printf("Hello from GPU thread %d\n", id);
}

// Host code (runs on CPU)
int main() {
    printf("Launching kernel with 1 block, 8 threads\n\n");

    hello<<<1, 8>>>();  // 1 block, 8 threads

    cudaDeviceSynchronize();

    printf("\nDone!\n");
    return 0;
}

In [None]:
!nvcc hello_gpu.cu -o hello_gpu
!./hello_gpu

**Explanation:**
- `__global__` = function runs on GPU
- `<<<blocks, threads>>>` = kernel launch syntax
- `blockIdx.x` = which block (0 in our case)
- `blockDim.x` = threads per block (8)
- `threadIdx.x` = thread within block (0-7)

## Q5. Host and Device Memory

Demonstrate memory management between CPU (host) and GPU (device)

In [None]:
%%writefile memory_demo.cu
#include <stdio.h>
#include <cuda_runtime.h>

__global__ void printArray(int *arr, int n) {
    int i = threadIdx.x;
    if(i < n) {
        printf("GPU: arr[%d] = %d\n", i, arr[i]);
        arr[i] = arr[i] * 10;  // modify on GPU
    }
}

int main() {
    int n = 5;
    int size = n * sizeof(int);

    // Step 1: Create array on host (CPU)
    int h_arr[5] = {10, 20, 30, 40, 50};

    printf("Original array on CPU:\n");
    for(int i=0; i<n; i++) {
        printf("%d ", h_arr[i]);
    }
    printf("\n\n");

    // Step 2: Allocate memory on device (GPU)
    int *d_arr;
    cudaMalloc(&d_arr, size);

    // Step 3: Copy from host to device
    cudaMemcpy(d_arr, h_arr, size, cudaMemcpyHostToDevice);

    // Step 4: Run kernel on GPU
    printf("Running GPU kernel...\n");
    printArray<<<1, n>>>(d_arr, n);
    cudaDeviceSynchronize();

    // Step 5: Copy back from device to host
    cudaMemcpy(h_arr, d_arr, size, cudaMemcpyDeviceToHost);

    printf("\nModified array on CPU:\n");
    for(int i=0; i<n; i++) {
        printf("%d ", h_arr[i]);
    }
    printf("\n");

    // Step 6: Free GPU memory
    cudaFree(d_arr);

    return 0;
}

In [None]:
!nvcc memory_demo.cu -o memory_demo
!./memory_demo

**Key functions:**
- `cudaMalloc()` - allocate memory on GPU
- `cudaMemcpy()` - copy between CPU and GPU
- `cudaFree()` - free GPU memory

**Important:** CPU can't directly access GPU memory and vice versa!

## Q6. Compare List/Tuple vs NumPy Performance

Testing which is faster for numerical operations

In [None]:
import numpy as np
import time

sizes = [10000, 100000, 1000000]

print("Size\t\tList(ms)\tTuple(ms)\tNumPy(ms)\tSpeedup")
print("="*70)

for n in sizes:
    # Test with list
    start = time.time()
    data = list(range(n))
    result = sum([x*2 for x in data])
    list_time = (time.time() - start) * 1000

    # Test with tuple
    start = time.time()
    data = tuple(range(n))
    result = sum([x*2 for x in data])
    tuple_time = (time.time() - start) * 1000

    # Test with numpy
    start = time.time()
    data = np.arange(n)
    result = np.sum(data * 2)
    numpy_time = (time.time() - start) * 1000

    speedup = list_time / numpy_time

    print(f"{n}\t\t{list_time:.2f}\t\t{tuple_time:.2f}\t\t{numpy_time:.2f}\t\t{speedup:.1f}x")

**Observation:**
- NumPy is much faster (10-30x)
- Uses optimized C code internally
- Better for math operations
- Lists/tuples are more flexible but slower

In [None]:
# quick visualization
import matplotlib.pyplot as plt

methods = ['List', 'Tuple', 'NumPy']
times = [150, 145, 8]  # example values

plt.bar(methods, times)
plt.ylabel('Time (ms)')
plt.title('Performance Comparison (N=1M)')
plt.show()