<a href="https://colab.research.google.com/github/Mamoro98/Cuda-Programming/blob/main/omer_GPU_Practical3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Omer Kamal Ali Ebead</h1>

# **CUDA Programming on NVIDIA GPUs, July 22-26, 2024**

# **Practical 3**

Again make sure the correct Runtime is being used, by clicking on the Runtime option at the top, then "Change runtime type", and selecting an appropriate GPU such as the T4.

Then verify with the instruction below the details of the GPU which is available to you.  

In [1]:
!nvidia-smi


Wed Feb 12 07:05:23 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   49C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

---

First we upload two header files from the course webpage.

In [2]:
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
!wget https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h


--2025-02-12 07:05:23--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_cuda.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27832 (27K) [text/x-chdr]
Saving to: ‘helper_cuda.h’


2025-02-12 07:05:24 (200 KB/s) - ‘helper_cuda.h’ saved [27832/27832]

--2025-02-12 07:05:24--  https://people.maths.ox.ac.uk/gilesm/cuda/headers/helper_string.h
Resolving people.maths.ox.ac.uk (people.maths.ox.ac.uk)... 129.67.184.129, 2001:630:441:202::8143:b881
Connecting to people.maths.ox.ac.uk (people.maths.ox.ac.uk)|129.67.184.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14875 (15K) [text/x-chdr]
Saving to: ‘helper_string.h’


2025-02-12 07:05:25 (359 KB/s) - ‘helper_string.h’ saved [14875/14875]





---

The next step is to create the file laplace3d.cu which includes within it a reference C++ routine against which the CUDA results are compared.

In [3]:
%%writefile laplace3d.cu

////////////////////////////////////////////////////////////////////////
//
// Program to solve Laplace equation on a regular 3D grid
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// define kernel block size
////////////////////////////////////////////////////////////////////////

#define BLOCK_X 16
#define BLOCK_Y 16

////////////////////////////////////////////////////////////////////////
// kernel function
////////////////////////////////////////////////////////////////////////

__global__ void GPU_laplace3d(long long NX, long long NY, long long NZ,
                              const float* __restrict__ d_u1,
                                    float* __restrict__ d_u2) // not allow to overlap
{
  long long i, j, k, indg, IOFF, JOFF, KOFF;
  float     u2, sixth=1.0f/6.0f;

  //
  // define global indices and array offsets
  //

  i    = threadIdx.x + blockIdx.x*BLOCK_X;
  j    = threadIdx.y + blockIdx.y*BLOCK_Y;
  indg = i + j*NX;

  IOFF = 1;
  JOFF = NX;
  KOFF = NX*NY;

  if ( i>=0 && i<=NX-1 && j>=0 && j<=NY-1 ) {

    for (k=0; k<NZ; k++) {

      if (i==0 || i==NX-1 || j==0 || j==NY-1 || k==0 || k==NZ-1) {
        u2 = d_u1[indg];  // Dirichlet b.c.'s
      }
      else {
        u2 = ( d_u1[indg-IOFF] + d_u1[indg+IOFF]
             + d_u1[indg-JOFF] + d_u1[indg+JOFF]
             + d_u1[indg-KOFF] + d_u1[indg+KOFF] ) * sixth;
      }
      d_u2[indg] = u2;

      indg += KOFF;
    }
  }
}


////////////////////////////////////////////////////////////////////////
// Gold routine -- reference C++ code
////////////////////////////////////////////////////////////////////////
// same calc on cpu
void Gold_laplace3d(long long NX, long long NY, long long NZ, float* u1, float* u2)
{
  long long i, j, k, ind;
  float     sixth=1.0f/6.0f;  // predefining this improves performance more than 10%

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {   // i loop innermost for sequential memory access
	      ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1) {
          u2[ind] = u1[ind];          // Dirichlet b.c.'s
        }
        else {
          u2[ind] = ( u1[ind-1    ] + u1[ind+1    ]
                    + u1[ind-NX   ] + u1[ind+NX   ]
                    + u1[ind-NX*NY] + u1[ind+NX*NY] ) * sixth;
        }
      }
    }
  }
}

////////////////////////////////////////////////////////////////////////
// Main program
////////////////////////////////////////////////////////////////////////

int main(int argc, const char **argv){

  int       NX=512, NY=512, NZ=512,
            REPEAT=20, bx, by, i, j, k;
  float    *h_u1, *h_u2, *h_foo,
           *d_u1, *d_u2, *d_foo;

  size_t    ind, bytes = sizeof(float) * NX*NY*NZ;

  printf("Grid dimensions: %d x %d x %d \n\n", NX, NY, NZ);

  // initialise card

  findCudaDevice(argc, argv);

  // initialise CUDA timing

  float milli;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  // allocate memory for arrays

  h_u1 = (float *)malloc(bytes);
  h_u2 = (float *)malloc(bytes);
  checkCudaErrors( cudaMalloc((void **)&d_u1, bytes) );
  checkCudaErrors( cudaMalloc((void **)&d_u2, bytes) );

  // initialise u1

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {
        ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1)
          h_u1[ind] = 1.0f;           // Dirichlet b.c.'s
        else
          h_u1[ind] = 0.0f;
      }
    }
  }

  // copy u1 to device

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(d_u1, h_u1, bytes,
                              cudaMemcpyHostToDevice) );
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u1 to device: %.1f (ms) \n\n", milli);

  // Gold treatment

  cudaEventRecord(start);
  for (i=0; i<REPEAT; i++) {
    Gold_laplace3d(NX, NY, NZ, h_u1, h_u2);
    h_foo = h_u1; h_u1 = h_u2; h_u2 = h_foo;   // swap h_u1 and h_u2
  }

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("%dx Gold_laplace3d: %.1f (ms) \n\n", REPEAT, milli);

  // Set up the execution configuration


  // problem heeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeere !!
  bx = 1 + (NX-1)/BLOCK_X;
  by = 1 + (NY-1)/BLOCK_Y;

  dim3 dimGrid(bx,by);
  dim3 dimBlock(BLOCK_X,BLOCK_Y);

  // Execute GPU kernel

  cudaEventRecord(start);

  for (i=0; i<REPEAT; i++) {
    GPU_laplace3d<<<dimGrid, dimBlock>>>(NX, NY, NZ, d_u1, d_u2);
    getLastCudaError("GPU_laplace3d execution failed\n");

    d_foo = d_u1; d_u1 = d_u2; d_u2 = d_foo;   // swap d_u1 and d_u2
  }

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("%dx GPU_laplace3d: %.1f (ms) \n\n", REPEAT, milli);

  // Read back GPU results

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(h_u2, d_u1, bytes, cudaMemcpyDeviceToHost) );
  // if you want to check the error between the original array and the jacobia array
  // checkCudaErrors( cudaMemcpy(h_u1, d_u2, bytes, cudaMemcpyDeviceToHost) );

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u2 to host: %.1f (ms) \n\n", milli);

  // error check

  float err = 0.0;

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {
        ind = i + j*NX + k*NX*NY;
        err += (h_u1[ind]-h_u2[ind])*(h_u1[ind]-h_u2[ind]);
      }
    }
  }

  printf("rms error = %13f \n",sqrt(err/ (float)(NX*NY*NZ)));

 // Release GPU and CPU memory

  checkCudaErrors( cudaFree(d_u1) );
  checkCudaErrors( cudaFree(d_u2) );
  free(h_u1);
  free(h_u2);

  cudaDeviceReset();

}


Writing laplace3d.cu



---

We can now compile and run the executable.


In [4]:
!nvcc laplace3d.cu -o laplace3d -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z13GPU_laplace3dxxxPKfPf' for 'sm_70'
ptxas info    : Function properties for _Z13GPU_laplace3dxxxPKfPf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 392 bytes cmem[0]


In [5]:
!./laplace3d

Grid dimensions: 512 x 512 x 512 

GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 119.4 (ms) 

20x Gold_laplace3d: 27291.0 (ms) 

20x GPU_laplace3d: 181.7 (ms) 

Copy u2 to host: 113.1 (ms) 

rms error =      0.000000 


# Question 4
Increase The Gird Size to 1024

In [6]:
%%writefile laplace3d.cu

////////////////////////////////////////////////////////////////////////
//
// Program to solve Laplace equation on a regular 3D grid
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// define kernel block size
////////////////////////////////////////////////////////////////////////

#define BLOCK_X 16
#define BLOCK_Y 16

////////////////////////////////////////////////////////////////////////
// kernel function
////////////////////////////////////////////////////////////////////////

__global__ void GPU_laplace3d(long long NX, long long NY, long long NZ,
                              const float* __restrict__ d_u1,
                                    float* __restrict__ d_u2) // not allow to overlap
{
  long long i, j, k, indg, IOFF, JOFF, KOFF;
  float     u2, sixth=1.0f/6.0f;

  //
  // define global indices and array offsets
  //

  i    = threadIdx.x + blockIdx.x*BLOCK_X;
  j    = threadIdx.y + blockIdx.y*BLOCK_Y;
  indg = i + j*NX;

  IOFF = 1;
  JOFF = NX;
  KOFF = NX*NY;

  if ( i>=0 && i<=NX-1 && j>=0 && j<=NY-1 ) {

    for (k=0; k<NZ; k++) {

      if (i==0 || i==NX-1 || j==0 || j==NY-1 || k==0 || k==NZ-1) {
        u2 = d_u1[indg];  // Dirichlet b.c.'s
      }
      else {
        u2 = ( d_u1[indg-IOFF] + d_u1[indg+IOFF]
             + d_u1[indg-JOFF] + d_u1[indg+JOFF]
             + d_u1[indg-KOFF] + d_u1[indg+KOFF] ) * sixth;
      }
      d_u2[indg] = u2;

      indg += KOFF;
    }
  }
}


////////////////////////////////////////////////////////////////////////
// Gold routine -- reference C++ code
////////////////////////////////////////////////////////////////////////
// same calc on cpu
void Gold_laplace3d(long long NX, long long NY, long long NZ, float* u1, float* u2)
{
  long long i, j, k, ind;
  float     sixth=1.0f/6.0f;  // predefining this improves performance more than 10%

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {   // i loop innermost for sequential memory access
	      ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1) {
          u2[ind] = u1[ind];          // Dirichlet b.c.'s
        }
        else {
          u2[ind] = ( u1[ind-1    ] + u1[ind+1    ]
                    + u1[ind-NX   ] + u1[ind+NX   ]
                    + u1[ind-NX*NY] + u1[ind+NX*NY] ) * sixth;
        }
      }
    }
  }
}

////////////////////////////////////////////////////////////////////////
// Main program
////////////////////////////////////////////////////////////////////////

int main(int argc, const char **argv){

  int       NX=1024, NY=1024, NZ=1024,
            REPEAT=20, bx, by, i, j, k;
  float    *h_u1, *h_u2,
           *d_u1, *d_u2, *d_foo;

  size_t    ind, bytes = sizeof(float) * NX*NY*NZ;

  printf("Grid dimensions: %d x %d x %d \n\n", NX, NY, NZ);

  // initialise card

  findCudaDevice(argc, argv);

  // initialise CUDA timing

  float milli;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  // allocate memory for arrays

  h_u1 = (float *)malloc(bytes);
  h_u2 = (float *)malloc(bytes);
  checkCudaErrors( cudaMalloc((void **)&d_u1, bytes) );
  checkCudaErrors( cudaMalloc((void **)&d_u2, bytes) );

  // initialise u1

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {
        ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1)
          h_u1[ind] = 1.0f;           // Dirichlet b.c.'s
        else
          h_u1[ind] = 0.0f;
      }
    }
  }

  // copy u1 to device

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(d_u1, h_u1, bytes,
                              cudaMemcpyHostToDevice) );
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u1 to device: %.1f (ms) \n\n", milli);


  cudaEventRecord(start);
  // for (i=0; i<REPEAT; i++) {
    // Gold_laplace3d(NX, NY, NZ, h_u1, h_u2);
    // h_foo = h_u1; h_u1 = h_u2; h_u2 = h_foo;   // swap h_u1 and h_u2
  // }

  // cudaEventRecord(stop);
  // cudaEventSynchronize(stop);
  // cudaEventElapsedTime(&milli, start, stop);
  // printf("%dx Gold_laplace3d: %.1f (ms) \n\n", REPEAT, milli);

  // Set up the execution configuration


  // problem heeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeere !!
  bx = 1 + (NX-1)/BLOCK_X;
  by = 1 + (NY-1)/BLOCK_Y;

  dim3 dimGrid(bx,by);
  dim3 dimBlock(BLOCK_X,BLOCK_Y);

  // Execute GPU kernel

  cudaEventRecord(start);

  for (i=0; i<REPEAT; i++) {
    GPU_laplace3d<<<dimGrid, dimBlock>>>(NX, NY, NZ, d_u1, d_u2);
    getLastCudaError("GPU_laplace3d execution failed\n");

    d_foo = d_u1; d_u1 = d_u2; d_u2 = d_foo;   // swap d_u1 and d_u2
  }

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("%dx GPU_laplace3d: %.1f (ms) \n\n", REPEAT, milli);

  // Read back GPU results

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(h_u2, d_u1, bytes, cudaMemcpyDeviceToHost) );
  // if you want to check the error between the original array and the jacobia array
  // checkCudaErrors( cudaMemcpy(h_u1, d_u2, bytes, cudaMemcpyDeviceToHost) );

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u2 to host: %.1f (ms) \n\n", milli);

  // error check

  // float err = 0.0;

  //for (k=0; k<NZ; k++) {
  //  for (j=0; j<NY; j++) {
  //    for (i=0; i<NX; i++) {
  //      ind = i + j*NX + k*NX*NY;
  //      err += (h_u1[ind]-h_u2[ind])*(h_u1[ind]-h_u2[ind]);
  //    }
  //  }
  // }

  // printf("rms error = %13f \n",sqrt(err/ (float)(NX*NY*NZ)));

 // Release GPU and CPU memory

  checkCudaErrors( cudaFree(d_u1) );
  checkCudaErrors( cudaFree(d_u2) );
  free(h_u1);
  free(h_u2);

  cudaDeviceReset();

}


Overwriting laplace3d.cu


In [7]:
!nvcc laplace3d.cu -o laplace3d -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z13GPU_laplace3dxxxPKfPf' for 'sm_70'
ptxas info    : Function properties for _Z13GPU_laplace3dxxxPKfPf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 392 bytes cmem[0]


In [8]:
!./laplace3d

Grid dimensions: 1024 x 1024 x 1024 

GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 906.6 (ms) 

20x GPU_laplace3d: 1951.3 (ms) 

Copy u2 to host: 3638.5 (ms) 



# Question 5
Change the block size

In [9]:
%%writefile laplace3d.cu

////////////////////////////////////////////////////////////////////////
//
// Program to solve Laplace equation on a regular 3D grid
//
////////////////////////////////////////////////////////////////////////

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// define kernel block size
////////////////////////////////////////////////////////////////////////

#define BLOCK_X 128
#define BLOCK_Y 8

////////////////////////////////////////////////////////////////////////
// kernel function
////////////////////////////////////////////////////////////////////////

__global__ void GPU_laplace3d(long long NX, long long NY, long long NZ,
                              const float* __restrict__ d_u1,
                                    float* __restrict__ d_u2) // not allow to overlap
{
  long long i, j, k, indg, IOFF, JOFF, KOFF;
  float     u2, sixth=1.0f/6.0f;

  //
  // define global indices and array offsets
  //

  i    = threadIdx.x + blockIdx.x*BLOCK_X;
  j    = threadIdx.y + blockIdx.y*BLOCK_Y;
  indg = i + j*NX;

  IOFF = 1;
  JOFF = NX;
  KOFF = NX*NY;

  if ( i>=0 && i<=NX-1 && j>=0 && j<=NY-1 ) {

    for (k=0; k<NZ; k++) {

      if (i==0 || i==NX-1 || j==0 || j==NY-1 || k==0 || k==NZ-1) {
        u2 = d_u1[indg];  // Dirichlet b.c.'s
      }
      else {
        u2 = ( d_u1[indg-IOFF] + d_u1[indg+IOFF]
             + d_u1[indg-JOFF] + d_u1[indg+JOFF]
             + d_u1[indg-KOFF] + d_u1[indg+KOFF] ) * sixth;
      }
      d_u2[indg] = u2;

      indg += KOFF;
    }
  }
}



////////////////////////////////////////////////////////////////////////
// Main program
////////////////////////////////////////////////////////////////////////

int main(int argc, const char **argv){

  int       NX=1024, NY=1024, NZ=1024,
            REPEAT=20, bx, by, i, j, k;
  float    *h_u1, *h_u2,
           *d_u1, *d_u2, *d_foo;

  size_t    ind, bytes = sizeof(float) * NX*NY*NZ;

  printf("Grid dimensions: %d x %d x %d \n\n", NX, NY, NZ);

  // initialise card

  findCudaDevice(argc, argv);

  // initialise CUDA timing

  float milli;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  // allocate memory for arrays

  h_u1 = (float *)malloc(bytes);
  h_u2 = (float *)malloc(bytes);
  checkCudaErrors( cudaMalloc((void **)&d_u1, bytes) );
  checkCudaErrors( cudaMalloc((void **)&d_u2, bytes) );

  // initialise u1

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {
        ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1)
          h_u1[ind] = 1.0f;           // Dirichlet b.c.'s
        else
          h_u1[ind] = 0.0f;
      }
    }
  }

  // copy u1 to device

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(d_u1, h_u1, bytes,
                              cudaMemcpyHostToDevice) );
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u1 to device: %.1f (ms) \n\n", milli);



  // This is just ceil of the nearest int
  bx = 1 + (NX-1)/BLOCK_X;
  by = 1 + (NY-1)/BLOCK_Y;

  dim3 dimGrid(bx,by);
  dim3 dimBlock(BLOCK_X,BLOCK_Y);

  // Execute GPU kernel

  cudaEventRecord(start);

  for (i=0; i<REPEAT; i++) {
    GPU_laplace3d<<<dimGrid, dimBlock>>>(NX, NY, NZ, d_u1, d_u2);
    getLastCudaError("GPU_laplace3d execution failed\n");

    d_foo = d_u1; d_u1 = d_u2; d_u2 = d_foo;   // swap d_u1 and d_u2
  }

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("%dx GPU_laplace3d: %.1f (ms) \n\n", REPEAT, milli);

  // Read back GPU results

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(h_u2, d_u1, bytes, cudaMemcpyDeviceToHost) );
  // if you want to check the error between the original array and the jacobia array
  // checkCudaErrors( cudaMemcpy(h_u1, d_u2, bytes, cudaMemcpyDeviceToHost) );

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u2 to host: %.1f (ms) \n\n", milli);

  // error check

  // float err = 0.0;

  //for (k=0; k<NZ; k++) {
  //  for (j=0; j<NY; j++) {
  //    for (i=0; i<NX; i++) {
  //      ind = i + j*NX + k*NX*NY;
  //      err += (h_u1[ind]-h_u2[ind])*(h_u1[ind]-h_u2[ind]);
  //    }
  //  }
  // }

  // printf("rms error = %13f \n",sqrt(err/ (float)(NX*NY*NZ)));

 // Release GPU and CPU memory

  checkCudaErrors( cudaFree(d_u1) );
  checkCudaErrors( cudaFree(d_u2) );
  free(h_u1);
  free(h_u2);

  cudaDeviceReset();

}


Overwriting laplace3d.cu


In [10]:
!nvcc laplace3d.cu -o laplace3d -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z13GPU_laplace3dxxxPKfPf' for 'sm_70'
ptxas info    : Function properties for _Z13GPU_laplace3dxxxPKfPf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 392 bytes cmem[0]


In [11]:
!./laplace3d

Grid dimensions: 1024 x 1024 x 1024 

GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 984.5 (ms) 

20x GPU_laplace3d: 1231.8 (ms) 

Copy u2 to host: 3036.1 (ms) 



$$
\begin{array}{|c|c|}
\hline
\textbf{Block Size (X × Y)} & \textbf{Execution Time (ms)} \\
\hline
16 \times 16  & 1956 \\
32 \times 32  & 1282 \\
32 \times 8   & 1259 \\
64 \times 8   & 1250 \\
128 \times 8  & 1231 \\
\hline
\end{array}
$$

**Best Execution time is 1231 (ms) with 128 X 8 block size**

# Question 6

3D Grid and each thread handles a single grid point

In [12]:
%%writefile laplace3d_new.cu

//
// Program to solve Laplace equation on a regular 3D grid
//

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>

////////////////////////////////////////////////////////////////////////
// define kernel block size
////////////////////////////////////////////////////////////////////////

#define BLOCK_X 64
#define BLOCK_Y 1
#define BLOCK_Z 4

////////////////////////////////////////////////////////////////////////
// kernel function
////////////////////////////////////////////////////////////////////////

__global__ void GPU_laplace3d(long long NX, long long NY, long long NZ,
	         	      const float* __restrict__ d_u1,
			            float* __restrict__ d_u2)
{
  long long i, j, k, indg, IOFF, JOFF, KOFF;
  float     u2, sixth=1.0f/6.0f;

  //
  // define global indices and array offsets
  //

  i    = threadIdx.x + blockIdx.x*BLOCK_X;
  j    = threadIdx.y + blockIdx.y*BLOCK_Y;
  k    = threadIdx.z + blockIdx.z*BLOCK_Z;

  IOFF = 1;
  JOFF = NX;
  KOFF = NX*NY;

  indg = i + j*JOFF + k*KOFF;

  if (i>=0 && i<=NX-1 && j>=0 && j<=NY-1 && k>=0 && k<=NZ-1) {
    if (i==0 || i==NX-1 || j==0 || j==NY-1 || k==0 || k==NZ-1) {
      u2 = d_u1[indg];  // Dirichlet b.c.'s
    }
    else {
      u2 = ( d_u1[indg-IOFF] + d_u1[indg+IOFF]
           + d_u1[indg-JOFF] + d_u1[indg+JOFF]
           + d_u1[indg-KOFF] + d_u1[indg+KOFF] ) * sixth;
    }
    d_u2[indg] = u2;
  }
}

////////////////////////////////////////////////////////////////////////
// Gold routine -- reference C++ code
////////////////////////////////////////////////////////////////////////

void Gold_laplace3d(long long NX, long long NY, long long NZ, float* u1, float* u2)
{
  long long i, j, k, ind;
  float     sixth=1.0f/6.0f;  // predefining this improves performance more than 10%

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {   // i loop innermost for sequential memory access
	      ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1) {
          u2[ind] = u1[ind];          // Dirichlet b.c.'s
        }
        else {
          u2[ind] = ( u1[ind-1    ] + u1[ind+1    ]
                    + u1[ind-NX   ] + u1[ind+NX   ]
                    + u1[ind-NX*NY] + u1[ind+NX*NY] ) * sixth;
        }
      }
    }
  }
}


////////////////////////////////////////////////////////////////////////
// Main program
////////////////////////////////////////////////////////////////////////

int main(int argc, const char **argv){

  int       NX=1024, NY=1024, NZ=1024,
            REPEAT=20, bx, by, bz, i, j, k;
  float    *h_u1, *h_u2, *h_foo,
           *d_u1, *d_u2, *d_foo;

  size_t    ind, bytes = sizeof(float) * NX*NY*NZ;

  printf("Grid dimensions: %d x %d x %d \n\n", NX, NY, NZ);

  // initialise card

  findCudaDevice(argc, argv);

  // initialise CUDA timing

  float milli;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  // allocate memory for arrays

  h_u1 = (float *)malloc(bytes);
  h_u2 = (float *)malloc(bytes);
  checkCudaErrors( cudaMalloc((void **)&d_u1, bytes) );
  checkCudaErrors( cudaMalloc((void **)&d_u2, bytes) );

  // initialise u1

  for (k=0; k<NZ; k++) {
    for (j=0; j<NY; j++) {
      for (i=0; i<NX; i++) {
        ind = i + j*NX + k*NX*NY;

        if (i==0 || i==NX-1 || j==0 || j==NY-1|| k==0 || k==NZ-1)
          h_u1[ind] = 1.0f;           // Dirichlet b.c.'s
        else
          h_u1[ind] = 0.0f;
      }
    }
  }

  // copy u1 to device

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(d_u1, h_u1, bytes,
                              cudaMemcpyHostToDevice) );
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u1 to device: %.1f (ms) \n\n", milli);



  // Set up the execution configuration

  bx = 1 + (NX-1)/BLOCK_X;
  by = 1 + (NY-1)/BLOCK_Y;
  bz = 1 + (NZ-1)/BLOCK_Z;

  dim3 dimGrid(bx,by,bz);
  dim3 dimBlock(BLOCK_X,BLOCK_Y,BLOCK_Z);

  // Execute GPU kernel

  cudaEventRecord(start);

  for (i=0; i<REPEAT; i++) {
    GPU_laplace3d<<<dimGrid, dimBlock>>>(NX, NY, NZ, d_u1, d_u2);
    getLastCudaError("GPU_laplace3d execution failed\n");

    d_foo = d_u1; d_u1 = d_u2; d_u2 = d_foo;   // swap d_u1 and d_u2
  }

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("%dx GPU_laplace3d_new: %.1f (ms) \n\n", REPEAT, milli);

  // Read back GPU results

  cudaEventRecord(start);
  checkCudaErrors( cudaMemcpy(h_u2, d_u1, bytes, cudaMemcpyDeviceToHost) );
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&milli, start, stop);
  printf("Copy u2 to host: %.1f (ms) \n\n", milli);



 // Release GPU and CPU memory

  checkCudaErrors( cudaFree(d_u1) );
  checkCudaErrors( cudaFree(d_u2) );
  free(h_u1);
  free(h_u2);

  cudaDeviceReset();
}


Writing laplace3d_new.cu


In [13]:
!nvcc laplace3d_new.cu -o laplace3d_new -I. -lineinfo -arch=sm_70 --ptxas-options=-v --use_fast_math -lcudart

    float *h_u1, *h_u2, *h_foo,
                         ^


ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z13GPU_laplace3dxxxPKfPf' for 'sm_70'
ptxas info    : Function properties for _Z13GPU_laplace3dxxxPKfPf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 392 bytes cmem[0]


In [14]:
!./laplace3d_new

Grid dimensions: 1024 x 1024 x 1024 

GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 930.9 (ms) 

20x GPU_laplace3d_new: 900.9 (ms) 

Copy u2 to host: 3025.3 (ms) 



$$
\begin{array}{|c|c|}
\hline
\textbf{Block Size (X × Y × Z)} & \textbf{Execution Time (ms)} \\
\hline
8 \times 8 \times 8   & 1184 \\
8 \times 8 \times 4   & 1147 \\
16 \times 4 \times 4  & 960 \\
16 \times 8 \times 4  & 1036 \\
32 \times 4 \times 4  & 943 \\
32 \times 4 \times 2  & 1035 \\
32 \times 2 \times 4  & 905 \\
64 \times 2 \times 2  & 1030 \\
64 \times 4 \times 1  & 1304  \\
64 \times 1 \times 4  & 900 \\
\hline
\end{array}
$$


**Best execution time is 900 (ms) given by block size 64 x 1 x 4**

# Question 7


---

The next instructions check how many fp32 and integer instructions are performed by the two versions

In [15]:
!ncu --metrics "smsp__sass_thread_inst_executed_op_fp32_pred_on.sum,smsp__sass_thread_inst_executed_op_integer_pred_on.sum" ./laplace3d
!ncu --metrics "smsp__sass_thread_inst_executed_op_fp32_pred_on.sum,smsp__sass_thread_inst_executed_op_integer_pred_on.sum" ./laplace3d_new

Grid dimensions: 1024 x 1024 x 1024 

==PROF== Connected to process 1958 (/content/laplace3d)
GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 966.5 (ms) 

==PROF== Profiling "GPU_laplace3d" - 0: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 1: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 2: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 3: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 4: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 5: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 6: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 7: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 8: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 9: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 10: 0%....50%....100% - 1 pass
==PROF== Profiling "GPU_laplace3d" - 11: 0%....50%....100% - 1 pass
==PROF== Profili

$$
\begin{array}{|c|c|c|}
\hline
\textbf{Version} & \textbf{FP32 Operations (Billion)} & \textbf{Integer Operations (Billion)} \\
\hline
\text{laplace3d} & 6.4  & 17  \\
\text{laplace3d_new} & 6.4  & 66  \\
\hline
\end{array}
$$


# Question 8

In [16]:
!ncu ./laplace3d

Grid dimensions: 1024 x 1024 x 1024 

==PROF== Connected to process 2699 (/content/laplace3d)
GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 965.7 (ms) 

==PROF== Profiling "GPU_laplace3d" - 0: 0%
....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 1: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 2: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 3: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 4: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 5: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 6: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 7: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 8: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 9: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 10: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 11: 0%....50%....100% - 

**laplace 3d**

$$
\begin{array}{|c|c|c|}
\hline
\textbf{Metric Name} & \textbf{Metric Unit} & \textbf{Metric Value} \\
\hline
\text{DRAM Frequency} & \text{GHz} & 4.96 \\
\text{SM Frequency} & \text{MHz} & 585.00 \\
\text{Elapsed Cycles} & \text{cycles} & 36,426,208 \\
\text{Memory Throughput} & \text{%} & 79.92 \\
\text{DRAM Throughput} & \text{%} & 79.92 \\
\text{Duration} & \text{ms} & 62.27 \\
\text{L1/TEX Cache Throughput} & \text{%} & 45.86 \\
\text{L2 Cache Throughput} & \text{%} & 19.59 \\
\text{SM Active Cycles} & \text{cycles} & 35,981,727.65 \\
\text{Compute (SM) Throughput} & \text{%} & 36.03 \\
\hline
\end{array}
$$





$$
\begin{array}{|c|c|c|}
\hline
\textbf{Metric Name} & \textbf{Metric Unit} & \textbf{Metric Value} \\
\hline
\text{Block Size} & - & 1,024 \\
\text{Function Cache Configuration} & - & \text{CachePreferNone} \\
\text{Grid Size} & - & 1,024 \\
\text{Registers Per Thread} & \text{register/thread} & 64 \\
\text{Shared Memory Configuration Size} & \text{Kbyte} & 32.77 \\
\text{Driver Shared Memory Per Block} & \text{byte/block} & 0 \\
\text{Dynamic Shared Memory Per Block} & \text{byte/block} & 0 \\
\text{Static Shared Memory Per Block} & \text{byte/block} & 0 \\
\text{\# SMs} & \text{SM} & 40 \\
\text{Threads} & \text{thread} & 1,048,576 \\
\text{Uses Green Context} & - & 0 \\
\text{Waves Per SM} & - & 25.60 \\
\hline
\end{array}
$$






$$
\begin{array}{|c|c|c|}
\hline
\textbf{Metric Name} & \textbf{Metric Unit} & \textbf{Metric Value} \\
\hline
\text{Block Limit SM} & \text{block} & 16 \\
\text{Block Limit Registers} & \text{block} & 1 \\
\text{Block Limit Shared Mem} & \text{block} & 16 \\
\text{Block Limit Warps} & \text{block} & 1 \\
\text{Theoretical Active Warps per SM} & \text{warp} & 32 \\
\text{Theoretical Occupancy} & \text{\%} & 100 \\
\text{Achieved Occupancy} & \text{\%} & 99.11 \\
\text{Achieved Active Warps Per SM} & \text{warp} & 31.71 \\
\hline
\end{array}
$$


In [55]:
!ncu ./laplace3d_new

Grid dimensions: 1024 x 1024 x 1024 

==PROF== Connected to process 11932 (/content/laplace3d_new)
GPU Device 0: "Turing" with compute capability 7.5

Copy u1 to device: 1111.1 (ms) 

==PROF== Profiling "GPU_laplace3d" - 0: 0%
....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 1: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 2: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 3: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 4: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 5: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 6: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 7: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 8: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 9: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 10: 0%....50%....100% - 9 passes
==PROF== Profiling "GPU_laplace3d" - 11: 0%....50%....1

**Laplace 3d new**



### **General GPU Performance Metrics**
$$
\begin{array}{|c|c|c|}
\hline
\textbf{Metric Name} & \textbf{Metric Unit} & \textbf{Metric Value} \\
\hline
\text{DRAM Frequency} & \text{GHz} & 4.99 \\
\text{SM Frequency} & \text{MHz} & 585.00 \\
\text{Elapsed Cycles} & \text{cycles} & 32,077,015 \\
\text{Memory Throughput} & \text{\%} & 72.24 \\
\text{DRAM Throughput} & \text{\%} & 72.24 \\
\text{Duration} & \text{ms} & 54.83 \\
\text{L1/TEX Cache Throughput} & \text{\%} & 62.04 \\
\text{L2 Cache Throughput} & \text{\%} & 41.78 \\
\text{SM Active Cycles} & \text{cycles} & 32,062,723.23 \\
\text{Compute (SM) Throughput} & \text{\%} & 58.21 \\
\hline
\end{array}
$$

---

### **Launch Statistics**
$$
\begin{array}{|c|c|c|}
\hline
\textbf{Metric Name} & \textbf{Metric Unit} & \textbf{Metric Value} \\
\hline
\text{Block Size} & - & 256 \\
\text{Function Cache Configuration} & - & \text{CachePreferNone} \\
\text{Grid Size} & - & 4,194,304 \\
\text{Registers Per Thread} & \text{register/thread} & 21 \\
\text{Shared Memory Configuration Size} & \text{Kbyte} & 32.77 \\
\text{Driver Shared Memory Per Block} & \text{byte/block} & 0 \\
\text{Dynamic Shared Memory Per Block} & \text{byte/block} & 0 \\
\text{Static Shared Memory Per Block} & \text{byte/block} & 0 \\
\text{\# SMs} & \text{SM} & 40 \\
\text{Threads} & \text{thread} & 1,073,741,824 \\
\text{Uses Green Context} & - & 0 \\
\text{Waves Per SM} & - & 26,214.40 \\
\hline
\end{array}
$$

---

### **Occupancy Metrics**
$$
\begin{array}{|c|c|c|}
\hline
\textbf{Metric Name} & \textbf{Metric Unit} & \textbf{Metric Value} \\
\hline
\text{Block Limit SM} & \text{block} & 16 \\
\text{Block Limit Registers} & \text{block} & 10 \\
\text{Block Limit Shared Mem} & \text{block} & 16 \\
\text{Block Limit Warps} & \text{block} & 4 \\
\text{Theoretical Active Warps per SM} & \text{warp} & 32 \\
\text{Theoretical Occupancy} & \text{\%} & 100 \\
\text{Achieved Occupancy} & \text{\%} & 83.08 \\
\text{Achieved Active Warps Per SM} & \text{warp} & 26.59 \\
\hline
\end{array}
$$


# Question 9

### Solution 1

**Estimate how much data is moved from the device memory into the GPU, and
from the GPU back to the device memory, in each iteration. You can assume
that the array u2 is first transferred to the GPU before being written to,
which leads to it being transferred back to the device memory.**



**We first calculate the overall grid points:**


N = NX * NY * NZ = 1024 × 1024 × 1024 = 1,073,741,824

**then we calculate the number of points in the boundry**


N_boundry = 2*NX*NY + 2*NX*NZ + 2*NY*NZ = 6 * 1024 * 1024 = 6,291,456


**after that we calculate the number of points inside the cube without the boundry**


N_Interior = N - N_boundry = 1,073,741,824 - 6,291,456= 1,067,450,370



We read values from 6 neighboring points in d_u1 and write a single updated value into d_u2. However, due to overlapping blocks when accessing the d_u1 array, multiple threads may fetch the same elements into the cache.


To optimize memory access, we make a key assumption: once an element is loaded into the cache, it does not need to be fetched again from global memory (DRAM). This significantly reduces redundant memory transactions and improves overall efficiency.


total data loaded per grid point  = 1 * 4 = 4 bytes

total data stored per grid point = 1 * 4 = 4 bytes


since d_u2 is first transferred to the GPU before writing , i add an extra 4 bytes to the calculation :


total data movement per grid point in the interior = 4 + 4 + 4 = 12 bytes


total data movement per grid point in the boundry = 4 + 4 + 4 = 12 bytes


total data moved per iteration in the interior = N_interior * 12 = 12.8 GB


total data moved per iteration in the boundry = N_boundry * 12 = 75,497,472 byte = 0.075 GB


total data moved per iteration = interior + boundry = 12.875 GB



Or we can calculate it like this if we want to know the read and write seperatly:




$$
\begin{array}{|c|c|}
\hline
\textbf{Parameter} & \textbf{Value} \\
\hline
N = NX \times NY \times NZ & 1,073,741,824 \\
N_{\text{boundary}} = 2NXNY + 2NXNZ + 2NYNZ & 6,291,456 \\
N_{\text{interior}} = N - N_{\text{boundary}} & 1,067,450,368 \\
\hline
\textbf{Data Movement Per Grid Point} & \textbf{Size (bytes)} \\
\hline
\text{Loaded from } d_{u1} & 4 \\
\text{Stored to } d_{u2} & 4 \\
\text{Transfer of } d_{u2} \text{ before writing} & 4 \\
\textbf{Total (Interior)} & 12 \\
\textbf{Total (Boundary)} & 12 \\
\hline
\textbf{Total Data Moved Per Iteration} & \textbf{Size (GB)} \\
\hline
N_{\text{interior}} \times 12 & 12.8 \\
N_{\text{boundary}} \times 12 & 0.075 \\
\textbf{Total} & \textbf{12.875 GB} \\
\hline
\end{array}
$$




**Given the execution time per iteration, what read and write device memory
bandwidth does this imply?**


Total data moved = 12.875 GB


execution time (from ncu)  =  54.83 ms = 0.05483 sec


total memory bandwidth = $12.875 * 10^9/0.05483 $ = 234 GB/s


total memory throughput = 234/300 * 100% = 78%


actual memory throughput from the NCU = 79.9%


$$
\begin{array}{|c|c|}
\hline
\textbf{Type} & \textbf{Calculation} & \textbf{Value (GB/s)} \\
\hline
\text{Read Bandwidth} & \frac{8.1 \times 10^9}{0.05483} & 147.7 \\
\text{Write Bandwidth} & \frac{5.45 \times 10^9}{0.05483} & 99.4 \\
\hline
\textbf{Total Bandwidth Used} & 147.7 + 99.4 & \textbf{247} \\
\hline
\end{array}
$$



**Memory Bandwidth & Throughput Analysis**
$$
\begin{array}{|c|c|}
\hline
\textbf{Metric} & \textbf{Value} \\
\hline
\text{Total Data Moved} & 12.875 \text{ GB} \\
\text{Execution Time} & 54.83 \text{ ms} = 0.05483 \text{ s} \\
\text{Total Memory Bandwidth} & \frac{12.875 \times 10^9}{0.05483} = 234 \text{ GB/s} \\
\text{Theoretical Peak Memory Bandwidth} & 300 \text{ GB/s} \\
\text{Estimated Memory Throughput} & \frac{234}{300} \times 100 = 78\% \\
\text{Measured Memory Throughput (NCU)} & 79.9\% \\
\hline
\end{array}
$$


### Solution 2 ( we assume that each block must always load data from global memory (DRAM), meaning that no data reuse from cache is considered )

**Memory Data Movement Estimation Per Iteration**
$$
\begin{array}{|c|c|}
\hline
\textbf{Parameter} & \textbf{Value} \\
\hline
N = NX \times NY \times NZ & 1,073,741,824 \\
N_{\text{boundary}} = 2NXNY + 2NXNZ + 2NYNZ & 6,291,456 \\
N_{\text{interior}} = N - N_{\text{boundary}} & 1,067,450,368 \\
\hline
\textbf{Data Movement Per Grid Point} & \textbf{Size (bytes)} \\
\hline
\text{Loaded from } d_{u1} & 6 \times 4 = 24 \\
\text{Stored to } d_{u2} & 4 \\
\text{Transfer of } d_{u2} \text{ before writing} & 4 \\
\textbf{Total (Interior)} & 24 + 4 + 4 = 32 \\
\textbf{Total (Boundary)} & 4 + 4 + 4 = 12 \\
\hline
\textbf{Total Data Moved Per Iteration} & \textbf{Size (GB)} \\
\hline
N_{\text{interior}} \times 32 & 34.24 \\
N_{\text{boundary}} \times 12 & 0.075 \\
\textbf{Total} & \textbf{34.25 GB} \\
\hline
\end{array}
$$

**Memory Bandwidth & Throughput Analysis**
$$
\begin{array}{|c|c|}
\hline
\textbf{Metric} & \textbf{Value} \\
\hline
\text{Total Data Moved} & 34.25 \text{ GB} \\
\text{Execution Time} & 54.83 \text{ ms} = 0.05483 \text{ s} \\
\text{Total Memory Bandwidth} & \frac{34.25 \times 10^9}{0.05483} = 624.6 \text{ GB/s} \\
\text{Theoretical Peak Memory Bandwidth} & 300 \text{ GB/s} \\
\text{Estimated Memory Throughput} & \frac{638.4}{300} \times 100 = 208.2\% \\
\text{Measured Memory Throughput (NCU)} & 79.9\% \\
\hline
\end{array}
$$


**NVIDIA's reported memory bandwidth is typically bi-directional, meaning 300 GB/s includes both read and write bandwidth.**

# Unassign

In [None]:
from google.colab import runtime
runtime.unassign()