In [1]:
!apt-get update
!pip install nvcc4jupyter

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://cli.github.com/packages stable InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [9,426 kB]
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:13 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,821 kB]
Fetched 12.3 MB in 3s (4,251 kB/s)
R

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [3]:
%load_ext nvcc4jupyter

Detected platform "Colab". Running its setup...
Source files will be saved in "/tmp/tmpi5a8kk1k".


### üßæ **Report ‚Äî CUDA Hello World (Problem 1)**

**Name:** Uttkarsh Malviya

**Roll Number:** IIT2022061

---

**(a) Experimental Setup:**

* Platform: Google Colab (GPU runtime enabled)
* GPU: NVIDIA Tesla T4 (Compute Capability 7.5)
* Compiler: *nvcc* (CUDA Toolkit preinstalled on Colab)


**(b) Description:**
A simple CUDA ‚ÄúHello World‚Äù program that prints each thread‚Äôs block ID, thread ID,
and its computed global thread ID.

---

**(c) Brief Summary of Results:**
Each GPU thread successfully printed its unique identifiers.
Output order varied slightly between blocks because of parallel execution.

---

**(d) Remarks / Observations:**

* Kernel configuration: **3 blocks √ó 4 threads per block**
* Total threads launched: **12**
* Execution time: negligible (a few microseconds)
* Output order interleaved due to concurrent thread execution
* Demonstrates correct mapping between **blockIdx**, **threadIdx**, and global index


In [4]:
%%writefile hello_world_cuda.cu
#include <cstdio>
#include <cuda_runtime.h>

__global__ void helloFromGPU()
{
    int tid  = threadIdx.x;                 // Thread ID within block
    int bid  = blockIdx.x;                  // Block ID
    int gtid = bid * blockDim.x + tid;      // Global thread ID (1D)
    printf("Hello from Block %d, Thread %d, Global %d\n", bid, tid, gtid);
}

int main()
{
    int threadsPerBlock = 4;
    int numBlocks = 3;

    printf("Launching kernel with %d blocks √ó %d threads per block\n\n",
           numBlocks, threadsPerBlock);

    helloFromGPU<<<numBlocks, threadsPerBlock>>>();
    cudaDeviceSynchronize();

    printf("\nKernel execution complete.\n");
    return 0;
}


Overwriting hello_world_cuda.cu


In [5]:
%%bash
./hello_world_cuda


Launching kernel with 3 blocks √ó 4 threads per block

Hello from Block 2, Thread 0, Global 8
Hello from Block 2, Thread 1, Global 9
Hello from Block 2, Thread 2, Global 10
Hello from Block 2, Thread 3, Global 11
Hello from Block 0, Thread 0, Global 0
Hello from Block 0, Thread 1, Global 1
Hello from Block 0, Thread 2, Global 2
Hello from Block 0, Thread 3, Global 3
Hello from Block 1, Thread 0, Global 4
Hello from Block 1, Thread 1, Global 5
Hello from Block 1, Thread 2, Global 6
Hello from Block 1, Thread 3, Global 7

Kernel execution complete.


### üßæ **Report ‚Äî Problem 2: Vector Addition (CUDA)  (Problem 2)**

**Name:** Uttkarsh Malviya

**Roll Number:** IIT2022061

---

**(a) Experimental Setup:**

* **Platform:** Google Colab (GPU runtime enabled)
* **GPU:** NVIDIA Tesla T4 (Compute Capability 7.5)
* **Compiler:** `nvcc` (CUDA Toolkit 12.x preinstalled on Colab)
* **Compilation Command:**

  ```
  nvcc -arch=sm_75 vec_add.cu -o vec_add
  ```
* **Execution Command:**

  ```
  ./vec_add
  ```

---

**(b) Description:**
This program performs element-wise addition of two vectors using CUDA.
Each GPU thread adds one element from vectors **A** and **B** and stores the result in **C**.

---

**(c) Summary of Results:**
The kernel executed successfully, producing correct output for all 1 million elements.
Verification confirmed that every element matched the expected CPU result.

---

**(d) Remarks / Observations:**

* **Vector size:** 1,048,576 elements (‚âà 1 million)
* **Kernel configuration:** 256 threads per block, 3907 blocks
* **Total threads launched:** ‚âà 1 million
* **Execution time:** Negligible for this size (well under 1 ms)
* **Result:** `Vector addition successful!`
* Demonstrates correct memory transfer between host ‚Üî device and parallel computation on the GPU.
* Output order is deterministic since no device printing occurs.


In [6]:
%%writefile vec_add.cu
#include <stdio.h>
#include <cuda_runtime.h>

// CUDA Kernel: performs element-wise vector addition
__global__ void vectorAdd(float *A, float *B, float *C, int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}

int main()
{
    int N = 1 << 20; // 1 million elements
    size_t size = N * sizeof(float);

    // Allocate host memory
    float *hA = (float *)malloc(size);
    float *hB = (float *)malloc(size);
    float *hC = (float *)malloc(size);

    // Initialize host arrays
    for (int i = 0; i < N; i++)
    {
        hA[i] = 1.0f;
        hB[i] = 2.0f;
    }

    // Allocate device memory
    float *dA, *dB, *dC;
    cudaMalloc((void **)&dA, size);
    cudaMalloc((void **)&dB, size);
    cudaMalloc((void **)&dC, size);

    // Copy input data to device
    cudaMemcpy(dA, hA, size, cudaMemcpyHostToDevice);
    cudaMemcpy(dB, hB, size, cudaMemcpyHostToDevice);

    // Launch the kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(dA, dB, dC, N);

    // Copy the result back to host
    cudaMemcpy(hC, dC, size, cudaMemcpyDeviceToHost);

    // Verify the result
    int errors = 0;
    for (int i = 0; i < N; i++)
    {
        if (fabs(hC[i] - (hA[i] + hB[i])) > 1e-5)
        {
            errors++;
            if (errors < 10)
                printf("Error at index %d: GPU = %f, Expected = %f\n", i, hC[i], hA[i] + hB[i]);
        }
    }

    if (errors == 0)
        printf("Vector addition successful!\\n");
    else
        printf("Vector addition failed with %d errors.\\n", errors);

    // Free memory
    free(hA);
    free(hB);
    free(hC);
    cudaFree(dA);
    cudaFree(dB);
    cudaFree(dC);

    return 0;
}


Overwriting vec_add.cu


In [7]:
%%bash
nvcc -arch=sm_75 vec_add.cu -o vec_add
./vec_add


Vector addition successful!\n

## üßæ **Report ‚Äî Problem 3: Scalar Multiplication (CUDA) (Problem 3)**

**Name:** Uttkarsh Malviya

**Roll Number:** IIT2022061

---

**(a) Experimental Setup:**

* **Platform:** Google Colab (GPU runtime enabled)
* **GPU:** NVIDIA Tesla T4 (Compute Capability 7.5)
* **Compiler:** `nvcc` (CUDA Toolkit 12.x preinstalled)
* **Compilation Command:**

  ```
  nvcc -arch=sm_75 scalar_mul.cu -o scalar_mul
  ```
* **Execution Command:**

  ```
  ./scalar_mul
  ```

---

**(b) Description:**
This program multiplies every element of an input vector by a scalar value in parallel using CUDA.
Each GPU thread handles one element:
[
A[i] = A[i] times {scalar}
]

---

**(c) Summary of Results:**
The GPU kernel executed successfully.
Verification confirmed that all elements were multiplied correctly by 5.0.
Output message ‚Äî `Scalar multiplication successful!`

---

**(d) Remarks / Observations:**

* **Vector size:** 1,048,576 elements
* **Scalar value:** 5.0
* **Kernel configuration:** 256 threads per block, 3907 blocks
* **Execution time:** negligible (< 1 ms for this size)
* Demonstrates one-to-one mapping of GPU threads to data elements.
* Validates correct use of global memory access and kernel launch configuration.


In [8]:
%%writefile scalar_mul.cu
#include <stdio.h>
#include <cuda_runtime.h>

// CUDA kernel: multiply each element of array A by a scalar
__global__ void scalarMultiply(float *A, float scalar, int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)
        A[i] *= scalar;
}

int main()
{
    int N = 1 << 20;  // 1 million elements
    size_t size = N * sizeof(float);
    float scalar = 5.0f;

    // Allocate host memory
    float *hA = (float *)malloc(size);
    for (int i = 0; i < N; i++)
        hA[i] = 1.0f;  // initialize all to 1 for easy verification

    // Allocate device memory
    float *dA;
    cudaMalloc((void **)&dA, size);

    // Copy input data to device
    cudaMemcpy(dA, hA, size, cudaMemcpyHostToDevice);

    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    scalarMultiply<<<blocksPerGrid, threadsPerBlock>>>(dA, scalar, N);

    // Copy result back to host
    cudaMemcpy(hA, dA, size, cudaMemcpyDeviceToHost);

    // Verify results (print first 10 values)
    int errors = 0;
    for (int i = 0; i < N; i++)
    {
        if (fabs(hA[i] - 5.0f) > 1e-5f)
        {
            if (errors < 10)
                printf("Error at index %d: GPU=%f, Expected=%f\n", i, hA[i], 5.0f);
            errors++;
        }
    }

    if (errors == 0)
        printf("Scalar multiplication successful!\n");
    else
        printf("Scalar multiplication failed with %d errors.\n", errors);

    // Free memory
    free(hA);
    cudaFree(dA);

    return 0;
}


Overwriting scalar_mul.cu


In [9]:
%%bash
nvcc -arch=sm_75 scalar_mul.cu -o scalar_mul
./scalar_mul


Scalar multiplication successful!
