# 1. Identify !, %, and %% used in cell in Google Colab.

! → Shell Command

Used to run terminal commands inside a notebook.


In [1]:
!ls
!pwd

hello  helloworld.cu  myfile  myfile.cu  sample_data
/content


% → Runs special notebook commands for one line.


In [2]:
%timeit sum(range(1000))
%pwd


19.5 µs ± 4.4 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


'/content'

%% → Applies to the entire cell.

In [3]:
%%time
for i in range(1000000):
    pass

CPU times: user 39.1 ms, sys: 19 µs, total: 39.1 ms
Wall time: 39 ms


# 2. Identify all key nvidia-smi commands with multiple options

In [4]:
!nvidia-smi

Sat Feb 14 15:47:37 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

Command	Purpose

nvidia-smi -L	-> List all GPUs

nvidia-smi -q	-> Detailed GPU info

nvidia-smi -q -d -> MEMORY	Memory details

nvidia-smi --help	-> Show help

nvidia-smi -l 1	-> Refresh every 1 sec

nvidia-smi --query-gpu=name,memory.total,memory.used --format=csv	-> Custom query

nvidia-smi topo -m	-> GPU topology

# 3. Debug common CUDA errors (zero output, incorrect indexing, PTX errors

1. Zero Output

Cause:

Forgot cudaMemcpy

Wrong kernel launch

Not using __global__

2. Incorrect Indexing

Wrong:

int id = threadIdx.x;

Correct:

int id = blockIdx.x * blockDim.x + threadIdx.x;

3. PTX Errors

Cause:

CUDA version mismatch

Wrong architecture

# 4. Write a CUDA C/C++ program to demonstrate GPU kernel execution and thread indexing.
a. Launch a CUDA kernel using: 1 block and 8 threads

b. Each thread must print: Hello from GPU thread <global_thread_id>

c. Compute the global thread ID using: global_thread_id = blockIdx.x * blockDim.x +
threadIdx.x

d. Clearly separate: Host code (CPU) & Device code (GPU kernel)

In [5]:
%%writefile helloworld.cu

#include<stdio.h>

__global__ void hello(){
  int global_thread_id=blockIdx.x*blockDim.x+threadIdx.x;
    printf("Hello from GPU thread %d\n",global_thread_id);
}

int main(){
  printf("This is CPU launching kernel\n");
  hello<<<1,8>>>();
  cudaDeviceSynchronize();
  printf("This is CPU again, kernel execution finished.\n");
  return 0;
}

Overwriting helloworld.cu


In [6]:
!nvcc -arch=sm_75 helloworld.cu -o hello

In [7]:
! ./hello

This is CPU launching kernel
Hello from GPU thread 0
Hello from GPU thread 1
Hello from GPU thread 2
Hello from GPU thread 3
Hello from GPU thread 4
Hello from GPU thread 5
Hello from GPU thread 6
Hello from GPU thread 7
This is CPU again, kernel execution finished.


# 5. Write a CUDA program to demonstrate host and device memory separation.
a. Create an integer array of size 5 on the host (CPU).

b. Allocate corresponding memory on the device (GPU) using cudaMalloc().

c. Copy data from host to device using cudaMemcpy().

d. Launch a kernel where GPU threads print values from device memory.

e. Copy the data back from device to host and print it on CPU.

In [8]:
%%writefile myfile.cu
#include<stdio.h>

__global__ void printarray(int *d_arr){
  int id=threadIdx.x;
  printf("GPU thread %d, Value= %d\n",id,d_arr[id]);
}

int main(){
  int h_arr[5]={10,20,30,40,50};
  int *d_arr;
  int size=5*sizeof(int);

  cudaMalloc((void**)&d_arr,size);
  cudaMemcpy(d_arr,h_arr,size,cudaMemcpyHostToDevice);
  printarray<<<1, 5>>>(d_arr);
  cudaDeviceSynchronize();

  cudaMemcpy(h_arr, d_arr, size, cudaMemcpyDeviceToHost);
  printf("\nBack on CPU:\n");
  for(int i = 0; i < 5; i++) {
    printf("h_arr[%d] = %d\n", i, h_arr[i]);
  }
  cudaFree(d_arr);

  return 0;
}

Overwriting myfile.cu


In [9]:
!nvcc -arch=sm_75 myfile.cu -o myfile

In [10]:
! ./myfile

GPU thread 0, Value= 10
GPU thread 1, Value= 20
GPU thread 2, Value= 30
GPU thread 3, Value= 40
GPU thread 4, Value= 50

Back on CPU:
h_arr[0] = 10
h_arr[1] = 20
h_arr[2] = 30
h_arr[3] = 40
h_arr[4] = 50


# 6. Compare CPU times of List/tuple with Numpy arrays.

In [11]:
import time
import numpy as np

size = 10_000_000

# Python List
start = time.time()
lst = list(range(size))
lst = [x + 1 for x in lst]
end = time.time()
print("List Time:", end - start)

# Python Tuple
start = time.time()
tpl = tuple(range(size))
tpl = tuple(x + 1 for x in tpl)
end = time.time()
print("Tuple Time:", end - start)

# NumPy Array
start = time.time()
arr = np.arange(size)
arr = arr + 1
end = time.time()
print("NumPy Time:", end - start)

List Time: 0.7632122039794922
Tuple Time: 0.9661605358123779
NumPy Time: 0.043309926986694336


NumPy → Very Fast (vectorized operations in C)