<a href="https://colab.research.google.com/github/A-Midhat/CUDA/blob/main/alimidhat_GPU_Practical4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> **Ali Midhat Abdelgadir Abdalla**

# **CUDA Programming on NVIDIA GPUs**

# **Practical 4**

Again make sure the correct Runtime is being used, by clicking on the Runtime option at the top, then "Change runtime type", and selecting an appropriate GPU such as the T4.

Then verify the details of the GPU which is available to you, and upload the usual two header files.

In [None]:
!nvidia-smi


Sat Jun  8 06:57:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h


--2025-02-17 16:52:29--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27832 (27K) [text/x-chdr]
Saving to: ‘helper_cuda.h’


2025-02-17 16:52:30 (200 KB/s) - ‘helper_cuda.h’ saved [27832/27832]

--2025-02-17 16:52:30--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14875 (15K) [text/x-chdr]
Saving to: ‘helper_string.h’


2025-02-17 16:52:31 (366 KB/s) - ‘helper_string.h’ saved [14875/14875]





---

The next step is to create the file reduction.cu which includes within it a reference C++ routine against which the CUDA results are compared.

# (4)

In [None]:
%%writefile reduction.cu

////////////////////////////////////////////////////////////////////////
//
// Practical 4 -- initial code for shared memory reduction for
//                a single block which is a power of two in size
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// CPU routine
////////////////////////////////////////////////////////////////////////

float reduction_gold(float* idata, int len)
{
  float sum = 0.0f;
  for(int i=0; i<len; i++) sum += idata[i];
  printf("\nCPU Result %f ", sum);
  return sum;
}

////////////////////////////////////////////////////////////////////////
// GPU routine
////////////////////////////////////////////////////////////////////////

// g_odata is output (final reduction result)
// g_idata is input

__global__ void reduction(float *g_odata, float *g_idata)
{
    // dynamically allocated shared memory

    extern  __shared__  float temp[];

    int tid = threadIdx.x;

    // first, each thread loads data into shared memory

    temp[tid] = g_idata[tid];

    // next, we perform binary tree reduction

    for (int d=blockDim.x/2; d>0; d=d/2) {
      __syncthreads();  // ensure previous step completed
      if (tid<d)  temp[tid] += temp[tid+d];
    }

    // finally, first thread puts result into global memory

    if (tid==0) {
      g_odata[0] = temp[0];
      printf("\nGPU Reduction Result %f ", temp[0]);
      }
}


////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////

int main( int argc, const char** argv)
{
  int num_blocks, num_threads, num_elements, mem_size, shared_mem_size;

  float *h_data, *d_idata, *d_odata;

  // initialise card

  findCudaDevice(argc, argv);

  num_blocks   = 1;  // start with only 1 thread block
  num_threads  = 512;
  num_elements = num_blocks*num_threads;
  mem_size     = sizeof(float) * num_elements;

  // allocate host memory to store the input data
  // and initialize to integer values between 0 and 10

  h_data = (float*) malloc(mem_size);

  for(int i = 0; i < num_elements; i++)
    h_data[i] = floorf(10.0f*(rand()/(float)RAND_MAX));

  // compute reference solution

  float sum = reduction_gold(h_data, num_elements);

  // allocate device memory input and output arrays

  checkCudaErrors( cudaMalloc((void**)&d_idata, mem_size) );
  checkCudaErrors( cudaMalloc((void**)&d_odata, sizeof(float)) );

  // copy host memory to device input array

  checkCudaErrors( cudaMemcpy(d_idata, h_data, mem_size,
                              cudaMemcpyHostToDevice) );

  // execute the kernel

  shared_mem_size = sizeof(float) * num_threads;
  reduction<<<num_blocks,num_threads,shared_mem_size>>>(d_odata,d_idata);
  getLastCudaError("reduction kernel execution failed");

  // copy result from device to host

  checkCudaErrors( cudaMemcpy(h_data, d_odata, sizeof(float),
                              cudaMemcpyDeviceToHost) );

  // check results

  printf("\nReduction error = %f\n",h_data[0]-sum);

  // cleanup memory

  free(h_data);
  checkCudaErrors( cudaFree(d_idata) );
  checkCudaErrors( cudaFree(d_odata) );

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();
}


Writing reduction.cu



---

We can now compile and run the executable.  Note that the compilation links in the CUDA random number generation library cuRAND.


In [None]:
!nvcc reduction.cu -o reduction -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 26 bytes gmem, 16 bytes cmem[4]
ptxas info    : Compiling entry function '_Z9reductionPfS_' for 'sm_70'
ptxas info    : Function properties for _Z9reductionPfS_
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 24 registers, 8 bytes cumulative stack size, 368 bytes cmem[0]


In [None]:
!./reduction

GPU Device 0: "Turing" with compute capability 7.5


Gold Reduction Result 2351.000000 
GPU Reduction Result 2351.000000 
reduction error = 0.000000


---
### 1. CPU Reference Sum
The CPU computes the expected sum using normal sequential summation.

### 2. GPU Parallel Reduction
The CUDA kernel:
- Loads data into shared memory.
- Performs parallel reduction using a binary tree approach.
- returns only a single value which holds the sum value.

### 3. Verification
- At the main function we compare both results, `reduction_gold` and `reduction` and see if there is some error difference between the two, if not (as we can see), it means our GPU reduction is working.
---

# (5)

In [None]:
%%writefile reduction.cu

////////////////////////////////////////////////////////////////////////
//
// Practical 4 -- initial code for shared memory reduction for
//                a single block which is a power of two in size
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// CPU routine
////////////////////////////////////////////////////////////////////////

float reduction_gold(float* idata, int len)
{
  float sum = 0.0f;
  for(int i=0; i<len; i++) sum += idata[i];
  printf("\nCPU Result %f ", sum);
  return sum;
}

////////////////////////////////////////////////////////////////////////
// GPU routine
////////////////////////////////////////////////////////////////////////

// g_odata is output (final reduction result)
// g_idata is input

__global__ void reduction(float *g_odata, float *g_idata)
{
    // dynamically allocated shared memory

    extern  __shared__  float temp[];

    int tid = threadIdx.x;
    // number of threads per block
    int num_threads = blockDim.x;

    int m1; // Nearest power of 2 more than num_threads
    for (m1=1; m1<num_threads; m1=2*m1) {};
    // shift it left by 1 to get the next power of 2
    m1 = m1/2;

    // remainings threads
    int diff = num_threads - m1;

    // first, each thread loads data into shared memory

    temp[tid] = g_idata[tid];

    // add the remaining threads only
    // and make temp with the size of multiple of power of 2
    if (tid < diff){
      temp[tid] += g_idata[tid + m1];
    }


    // next, we perform binary tree reduction

    for (int d=m1/2; d>0; d=d/2) {
      __syncthreads();  // ensure previous step completed
      if (tid<d)  temp[tid] += temp[tid+d];
    }

    // finally, first thread puts result into global memory

    if (tid==0) {
      g_odata[0] = temp[0];
      printf("\nGPU Result %f ", temp[0]);
      }
}


////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////

int main( int argc, const char** argv)
{
  int num_blocks, num_threads, num_elements, mem_size, shared_mem_size;

  float *h_data, *d_idata, *d_odata;

  // initialise card

  findCudaDevice(argc, argv);

  num_blocks   = 1;  // start with only 1 thread block
  num_threads  = 192; // Not a multiple of a power of 2 number
  num_elements = num_blocks*num_threads;
  mem_size     = sizeof(float) * num_elements;

  // allocate host memory to store the input data
  // and initialize to integer values between 0 and 10

  h_data = (float*) malloc(mem_size);

  for(int i = 0; i < num_elements; i++)
    h_data[i] = floorf(10.0f*(rand()/(float)RAND_MAX));

  // compute reference solution

  float sum = reduction_gold(h_data, num_elements);

  // allocate device memory input and output arrays

  checkCudaErrors( cudaMalloc((void**)&d_idata, mem_size) );
  checkCudaErrors( cudaMalloc((void**)&d_odata, sizeof(float)) );

  // copy host memory to device input array

  checkCudaErrors( cudaMemcpy(d_idata, h_data, mem_size,
                              cudaMemcpyHostToDevice) );

  // execute the kernel

  shared_mem_size = sizeof(float) * num_threads;
  reduction<<<num_blocks,num_threads,shared_mem_size>>>(d_odata,d_idata);
  getLastCudaError("reduction kernel execution failed");

  // copy result from device to host

  checkCudaErrors( cudaMemcpy(h_data, d_odata, sizeof(float),
                              cudaMemcpyDeviceToHost) );

  // check results

  printf("\nCPU/GPU error = %f\n",h_data[0]-sum);

  // cleanup memory

  free(h_data);
  checkCudaErrors( cudaFree(d_idata) );
  checkCudaErrors( cudaFree(d_odata) );

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();
}


Overwriting reduction.cu


In [None]:
!nvcc reduction.cu -o reduction -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 16 bytes gmem, 16 bytes cmem[4]
ptxas info    : Compiling entry function '_Z9reductionPfS_' for 'sm_70'
ptxas info    : Function properties for _Z9reductionPfS_
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 24 registers, 8 bytes cumulative stack size, 368 bytes cmem[0]


In [None]:
!./reduction

GPU Device 0: "Turing" with compute capability 7.5


CPU Result 896.000000 
GPU Result 896.000000 
CPU/GPU error = 0.000000


---

- To enable reduction, we first map the total number of threads to the nearest power of 2.
- We iterate, doubling `m1` until it exceeds `num_threads`, then shift it back by dividing by 2, i.e. :
  ```cpp
  int m1;
  for (m1 = 1; m1 < num_threads; m1 *= 2) {}
  m1 /= 2;

- The difference `(num_threads - m1)` identifies extra threads, which are summed first before proceeding with standard reduction.

- Finally, normal reduction algorithm is implement.
---

# (6)

In [None]:
%%writefile reduction.cu

////////////////////////////////////////////////////////////////////////
//
// Practical 4 -- initial code for shared memory reduction for
//                a single block which is a power of two in size
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// CPU routine
////////////////////////////////////////////////////////////////////////

float reduction_gold(float* idata, int len)
{
  float sum = 0.0f;
  for(int i=0; i<len; i++) sum += idata[i];
  printf("\nCPU Result %f ", sum);
  return sum;
}

////////////////////////////////////////////////////////////////////////
// GPU routine
////////////////////////////////////////////////////////////////////////

__global__ void reduction(float *g_odata, float *g_idata)
{
    // dynamically allocated shared memory

    extern  __shared__  float temp[];

    int tid = threadIdx.x;
    // Normal flatten of thereads accross all blocks
    int global_id = blockIdx.x * blockDim.x + threadIdx.x;

    // first, each thread loads data into shared memory

    temp[tid] = g_idata[global_id];

    // next, we perform binary tree reduction

    for (int d=blockDim.x/2; d>0; d=d/2) {
      __syncthreads();  // ensure previous step completed
      if (tid<d) {
     temp[tid] += temp[tid+d];  // Replace this line with:
    }
    }

    // finally, first thread puts result into global memory
    if (tid==0) g_odata[blockIdx.x] = temp[0];

}


////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////

int main( int argc, const char** argv)
{
  int num_blocks, num_threads, num_elements, mem_size, shared_mem_size;

  float *h_data, *d_idata, *d_odata;

  // initialise card

  findCudaDevice(argc, argv);

  num_blocks   = 1000; // multiple blocks
  num_threads  = 512;
  num_elements = num_blocks*num_threads;
  mem_size     = sizeof(float) * num_elements;

  // allocate host memory to store the input data
  // and initialize to integer values between 0 and 10

  h_data = (float*) malloc(mem_size);

  for(int i = 0; i < num_elements; i++)
    h_data[i] = floorf(10.0f*(rand()/(float)RAND_MAX));

  // compute reference solution

  float sum = reduction_gold(h_data, num_elements);

  // allocate device memory input and output arrays

  checkCudaErrors( cudaMalloc((void**)&d_idata, mem_size) );
  checkCudaErrors( cudaMalloc((void**)&d_odata, sizeof(float)*num_blocks) ); // changing the size of the output array

  // copy host memory to device input array

  checkCudaErrors( cudaMemcpy(d_idata, h_data, mem_size,
                              cudaMemcpyHostToDevice) );

  // execute the kernel

  shared_mem_size = sizeof(float) * num_threads;
  reduction<<<num_blocks,num_threads,shared_mem_size>>>(d_odata,d_idata);
  getLastCudaError("reduction kernel execution failed");

  // copy result from device to host
  cudaDeviceSynchronize();
  checkCudaErrors( cudaMemcpy(h_data, d_odata, sizeof(float)*num_blocks,
                              cudaMemcpyDeviceToHost) );

  // check results
  float g_sum = 0.0f;
  for (int i=0; i<num_blocks; i++){
    g_sum += h_data[i];
  }
  printf("\nGPU Result %f ", g_sum);
  printf("\nError result = %f\n",g_sum - sum);

  // cleanup memory

  free(h_data);
  checkCudaErrors( cudaFree(d_idata) );
  checkCudaErrors( cudaFree(d_odata) );

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();
}


Overwriting reduction.cu


In [None]:
!nvcc reduction.cu -o reduction -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z9reductionPfS_' for 'sm_70'
ptxas info    : Function properties for _Z9reductionPfS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 368 bytes cmem[0]


In [None]:
!./reduction

GPU Device 0: "Turing" with compute capability 7.5


CPU Result 2305301.000000 
GPU Result 2305301.000000 
Error result = 0.000000



---
- Unlike previous implementations using a **single block**, here we scales up by launching multiple blocks (`num_blocks = 1000`). Each block performs an independent reduction on its assigned data chunk.
- Then we get sum values of each block.
- Finally, we iterate through those values in the host, to get our sum.
---

# (7)

In [13]:
%%writefile reduction.cu

////////////////////////////////////////////////////////////////////////
//
// Practical 4 -- initial code for shared memory reduction for
//                a single block which is a power of two in size
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// CPU routine
////////////////////////////////////////////////////////////////////////

float reduction_gold(float* idata, int len)
{
  float sum = 0.0f;
  for(int i=0; i<len; i++) sum += idata[i];
  printf("\nCPU Result %f ", sum);
  return sum;
}

////////////////////////////////////////////////////////////////////////
// GPU routine
////////////////////////////////////////////////////////////////////////

__global__ void reduction(float *g_odata, float *g_idata)
{
    // dynamically allocated shared memory

    int global_id = blockIdx.x * blockDim.x + threadIdx.x;
    int warpIdx = threadIdx.x / 32;
    int thread_warp_idx = threadIdx.x % 32;


    float sum = g_idata[global_id];

    for (int offset = warpSize/2; offset > 0; offset /= 2) {
      sum += __shfl_down_sync(0xFFFFFFFF, sum, offset); // shuffle down Built in function
    }


    if (thread_warp_idx == 0) {
        g_odata[warpIdx]= sum;
    }


}


////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////

int main( int argc, const char** argv)
{
  int num_blocks, num_threads, num_elements, mem_size, shared_mem_size;

  float *h_data, *d_idata, *d_odata;

  // initialise card

  findCudaDevice(argc, argv);

  num_blocks   = 1; // start with only 1 thread block
  num_threads  = 512;
  num_elements = num_blocks*num_threads;
  mem_size     = sizeof(float) * num_elements;

  int num_of_warps = ceil((float) num_threads/32) ;

  // allocate host memory to store the input data
  // and initialize to integer values between 0 and 10

  h_data = (float*) malloc(mem_size);

  for(int i = 0; i < num_elements; i++)
    h_data[i] = floorf(10.0f*(rand()/(float)RAND_MAX));

  // compute reference solution

  float sum = reduction_gold(h_data, num_elements);

  // allocate device memory input and output arrays

  checkCudaErrors( cudaMalloc((void**)&d_idata, mem_size) );
  checkCudaErrors( cudaMalloc((void**)&d_odata, sizeof(float)*num_of_warps) ); //

  // copy host memory to device input array

  checkCudaErrors( cudaMemcpy(d_idata, h_data, mem_size,
                              cudaMemcpyHostToDevice) );

  // execute the kernel

  shared_mem_size = sizeof(float) * num_threads;
  reduction<<<num_blocks,num_threads,shared_mem_size>>>(d_odata,d_idata);
  getLastCudaError("reduction kernel execution failed");

  // copy result from device to host
  cudaDeviceSynchronize();
  checkCudaErrors( cudaMemcpy(h_data, d_odata, sizeof(float)*num_of_warps,
                              cudaMemcpyDeviceToHost) );

  // check results
  float g_sum = 0.0f;
  for (int i=0; i<num_of_warps; i++){
    g_sum += h_data[i];
  }
  printf("\nGPU result = %f\n",g_sum);
  printf("\nCPU/GPU error = %f\n",g_sum-sum);

  // cleanup memory

  free(h_data);
  checkCudaErrors( cudaFree(d_idata) );
  checkCudaErrors( cudaFree(d_odata) );

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();
}


Overwriting reduction.cu


In [14]:
!nvcc reduction.cu -o reduction -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z9reductionPfS_' for 'sm_70'
ptxas info    : Function properties for _Z9reductionPfS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 368 bytes cmem[0]


In [15]:
!./reduction

GPU Device 0: "Turing" with compute capability 7.5


CPU Result 2351.000000 
GPU result = 2351.000000

CPU/GPU error = 0.000000


---

- Instead of reducing across all threads in a block, this approach performs **warp-level reduction** using **CUDA shuffle intrinsics (`__shfl_down_sync`)** for optimized memory efficiency.

- **Thread and Warp Indexing**:
  - Each thread calculates a **global index**:
    ```cpp
    int global_id = blockIdx.x * blockDim.x + threadIdx.x;
    ```
  - Determines **warp index** (`warpIdx`) and **position within the warp** (`thread_warp_idx`):
    ```cpp
    int warpIdx = threadIdx.x / 32;
    int thread_warp_idx = threadIdx.x % 32;
    ```

  - Threads **shuffle values** within a warp:
    ```cpp
    sum += __shfl_down_sync(0xFFFFFFFF, sum, offset);
    ```
  - Each step halves `offset`, reducing values **within the same warp**.

  - **Thread 0 of each warp** stores the **partial sum** in global memory:
    ```cpp
    if (thread_warp_idx == 0) {
        g_odata[warpIdx] = sum;
    }
    ```

- **Partial sums** from each warp are copied to the CPU, and then we iterate through it and sum it like e did before.
---

# (7) Extra

In [19]:
%%writefile reduction.cu

////////////////////////////////////////////////////////////////////////
//
// Practical 4 -- initial code for shared memory reduction for
//                a single block which is a power of two in size
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// CPU routine
////////////////////////////////////////////////////////////////////////

float reduction_gold(float* idata, int len)
{
  float sum = 0.0f;
  for(int i=0; i<len; i++) sum += idata[i];
  printf("\nCPU Result %f ", sum);
  return sum;
}

////////////////////////////////////////////////////////////////////////
// GPU routine
////////////////////////////////////////////////////////////////////////

__global__ void reduction(float *g_odata, float *g_idata)
{
    // dynamically allocated shared memory

    int global_id = blockIdx.x * blockDim.x + threadIdx.x;
    int warpIdx = global_id / 32;
    int thread_warp_idx = threadIdx.x % 32;

    float sum = g_idata[global_id];

    for (int offset = warpSize/2; offset > 0; offset /= 2) {
      sum += __shfl_down_sync(0xFFFFFFFF, sum, offset);
    }

    // Thread 0 writes the final sum
    if (thread_warp_idx == 0) {
        g_odata[warpIdx]= sum;
    }


}


////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////

int main( int argc, const char** argv)
{
  int num_blocks, num_threads, num_elements, mem_size, shared_mem_size;

  float *h_data, *d_idata, *d_odata;

  // initialise card

  findCudaDevice(argc, argv);

  num_blocks   = 512;  // multiple Blocks
  num_threads  = 512;
  num_elements = num_blocks*num_threads;
  mem_size     = sizeof(float) * num_elements;

  int num_of_warps = ceil((float) num_threads/32) * num_blocks ;

  // allocate host memory to store the input data
  // and initialize to integer values between 0 and 10

  h_data = (float*) malloc(mem_size);

  for(int i = 0; i < num_elements; i++)
    h_data[i] = floorf(10.0f*(rand()/(float)RAND_MAX));

  // compute reference solution

  float sum = reduction_gold(h_data, num_elements);

  // allocate device memory input and output arrays

  checkCudaErrors( cudaMalloc((void**)&d_idata, mem_size) );
  checkCudaErrors( cudaMalloc((void**)&d_odata, sizeof(float)*num_of_warps) );

  // copy host memory to device input array

  checkCudaErrors( cudaMemcpy(d_idata, h_data, mem_size,
                              cudaMemcpyHostToDevice) );

  // execute the kernel

  shared_mem_size = sizeof(float) * num_threads;
  reduction<<<num_blocks,num_threads,shared_mem_size>>>(d_odata,d_idata);
  getLastCudaError("reduction kernel execution failed");

  // copy result from device to host
  cudaDeviceSynchronize();
  checkCudaErrors( cudaMemcpy(h_data, d_odata, sizeof(float)*num_of_warps,
                              cudaMemcpyDeviceToHost) );

  // check results
  float g_sum = 0.0f;
  for (int i=0; i<num_of_warps; i++){
    g_sum += h_data[i];
  }

  printf("\nGPU result = %f\n",g_sum);
  printf("\nCPU/GPU error = %f\n",g_sum-sum);


  // cleanup memory

  free(h_data);
  checkCudaErrors( cudaFree(d_idata) );
  checkCudaErrors( cudaFree(d_odata) );

  // CUDA exit -- needed to flush printf write buffer

  cudaDeviceReset();
}


Overwriting reduction.cu


In [20]:
!nvcc reduction.cu -o reduction -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z9reductionPfS_' for 'sm_70'
ptxas info    : Function properties for _Z9reductionPfS_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 10 registers, 368 bytes cmem[0]


In [21]:
!./reduction

GPU Device 0: "Turing" with compute capability 7.5


CPU Result 1179646.000000 
GPU result = 1179646.000000

CPU/GPU error = 0.000000


---


- This implementation extends the **warp-level reduction** (Q.7) approach by introducing **multiple blocks**, improving scalability and performance.

- Instead of a **single block**, we now use **multiple blocks** (`num_blocks = 512`).
- Each block performs **warp-level reduction**, independently reducing its assigned data.
- The final summation is performed on the **CPU** after collecting partial results from all warps across blocks.



# Extra

In [None]:
from google.colab import runtime
runtime.unassign()