# **Challenge 1**

Implement a **batched arbitrarily-size matrix multiplication** kernel.

Let dimensions be identical for all matrix multiplications in the batch.<br>
These being $m, k, n \in \mathbb{N}$.<br>
While $batch \in \mathbb{N}$ is the batch size.

You are provided as input the following matrices:<br>
$N_0, N_1, N_2, ... N_{batch - 1} \in \mathbb{M}^{k \times n}$<br>
$M \in \mathbb{M}^{m \times k}$

You need to compute:<br>
$P_0, P_1, P_2, ... P_{batch - 1} \in \mathbb{M}^{m \times n}$

Where $P_i = M \otimes N_i$ for each $i \in \{0, ..., batch - 1\}$.

---

A baseline reference implementation is given. Implement your version to replace it. Rely on the provided host-side function to check the correctness of results.

To get a general performance metric rely on the profiler, specifically look for the `cuda_gpu_kern_sum` and try to minimize the `Total Time (ns)` of your kernel. Meanwhile, you may also want to improve `cuda_gpu_mem_time_sum`.

Step one is beating the reference implementation, that should be easy, then you can use all tricks in the book to push it further.
Anything goes, but if you use "exotic" tricks we want an explanation.
In fact, submitting your work, be sure to fill out the [report](#report) with brief insights of what you did.

General rules and advice:
- groups are of 3 members at most, that should, for as much as possible, equally contribute to the project.
- you automagically get ~3 points in the 1st part of the exam, taking the form of 1-2 questions that you will be allowed to skip.
- deadline is 1 week (as of this writing, you will need to submit your work before october 23 at 23:59).
- submissions are to be made on WeBeep, where you need to upload a downloaded copy of this notebook (.ipynb file); for groups of multiple people, it's enough for one member to submit the file in the assignment, other members shall simply write their group's name in their submission, we will then infer groups from what you write in the report section.
- your code needs to work here on Colab with the T4 runtime.
- do not alter code sections delimited by "==="s in the final submission.
- we will change around matrix sizes arbitrarily while evaluating your work, so make sure to cover all edge cases and take care that your code is scalable (e.g. execution time grows as expected when doubling all dimensions).
- you can get the maximum grade just by using what was discussed during lectures or is present in the glossary shown during exercise sessions; still, if you wanna have "more fun" this guide is your best friend https://docs.nvidia.com/cuda/cuda-c-best-practices-guide.
- a piece of code that works is better than a supposedly faster piece of code that doesn't, so don't go overboard, but be ambitious.
- use LLMs (ChatGPT and friends) responsibly; the purpose of this challenge is for you to get your hands dirty and build up confidence in writing parallel code through trial and error. Having an LLM write your code may get you the challenge's points (unless it's so blatant that we notice), but won't lead you to learn anything and the next time you see some parallel code your mind goes blank. If you wack your head at the problem instead, and solve it, the solution will stick in the back of your mind for a long time. Similarly, if despite pushing yourself you can't find "that damn bug", then asking an LLM is fine, so long as you tried first by yourself and just say "ahhhhhh, so that what it was!" upon having the LLM help you out. Long story short, AI is fine so long as it's a tool you **learn from** and **not** one you **blindly lean on**.

If you need help or anything, please drop us an email:
- Dr. M. Ronzani: marco.ronzani@polimi.it
- Prof. F. Ferrandi: fabrizio.ferrandi@polimi.it

## **Colab Setup**

In [None]:
%%capture
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nsight-systems-2023.2.3_2023.2.3.1001-1_amd64.deb
!apt update
!apt install ./nsight-systems-2023.2.3_2023.2.3.1001-1_amd64.deb
!apt --fix-broken install

In [None]:
!mkdir /home/cuda
%cd /home/cuda

## **Code**

In [None]:
# @title Original
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// this is the reference implementation
// you can change this to your heart's contempt
__global__
void batchedMatMul(float* M, float* N, float* P, int m, int k, int n, int batch) {
  int row = blockIdx.y*blockDim.y + threadIdx.y;
  int col = blockIdx.x*blockDim.x + threadIdx.x;

  if (row < m && col < n) {
    for (int b = 0; b < batch; b++) {
      float value = 0.0f;
      for (int i = 0; i < k; i++) {
        float a = M[row*k + i];
        float c = N[b*(k*n) + i*n + col];
        value += a * c;
      }
      P[b*(m*n) + row*n + col] = value;
    }
  }
}


int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  // here, you can change anything
  float *M_d;
  float *N_d;
  float *P_d;

  gpuErrchk(cudaMalloc((void**)&M_d, sizeM * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpy(M_d, M, sizeM * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(P_d, P, sizeP * sizeof(float), cudaMemcpyHostToDevice));

  dim3 blockSize(16,16); // is 16x16 truly the best here?
  dim3 numBlocks((n + blockSize.x - 1) / blockSize.x, (m + blockSize.y - 1) / blockSize.y);

  batchedMatMul<<<numBlocks, blockSize>>>(M_d, N_d, P_d, m, k, n, batch);
  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

  // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================

  // here, you can change anything, e.g. add some logging
  gpuErrchk(cudaFree(M_d));
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(M);
  free(N);
  free(P);
  free(P_host);

  return 0;
}


In [None]:
# @title Group
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
#define BATCH_PER_THREAD 4

__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// this is the reference implementation
// you can change this to your heart's contempt
__global__
void batchedMatMul(float* M, float* N, float* P, int m, int k, int n, int batch, int batchesPerBlock) {
  int row = blockIdx.y*blockDim.y + threadIdx.y;
  int col = blockIdx.x*blockDim.x + threadIdx.x;

  int batchStart = blockIdx.z * batchesPerBlock;
  int batchEnd   = min(batchStart + batchesPerBlock, batch);

   if (row < m && col < n) {
        for (int b = batchStart; b < batchEnd; ++b) {
            float value = 0.0f;
            for (int i = 0; i < k; ++i) {
                float a = M[row * k + i];
                float c = N[b * (k * n) + i * n + col];
                value += a * c;
            }
            P[b * (m * n) + row * n + col] = value;
        }
    }
  }


int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  // here, you can change anything
  float *M_d;
  float *N_d;
  float *P_d;
  const int batchesPerBlock = 1;

  gpuErrchk(cudaMalloc((void**)&M_d, sizeM * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpy(M_d, M, sizeM * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(P_d, P, sizeP * sizeof(float), cudaMemcpyHostToDevice));

  dim3 blockSize(32, 32); // is 16x16 truly the best here?
  dim3 numBlocks((n + blockSize.x - 1) / blockSize.x, (m + blockSize.y - 1) / blockSize.y, (batch + batchesPerBlock - 1) / batchesPerBlock);

  batchedMatMul<<<numBlocks, blockSize>>>(M_d, N_d, P_d, m, k, n, batch, batchesPerBlock);
  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

  // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================

  // here, you can change anything, e.g. add some logging
  gpuErrchk(cudaFree(M_d));
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(M);
  free(N);
  free(P);
  free(P_host);

  return 0;
}

In [None]:
# @title Claude
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// Optimized kernel with tiling and shared memory
#define TILE_DIM 32
#define BATCH_PER_THREAD 4

__global__
void batchedMatMul(float* M, float* N, float* P, int m, int k, int n, int batch) {
  __shared__ float tileM[TILE_DIM][TILE_DIM];
  __shared__ float tileN[TILE_DIM][TILE_DIM];

  int tx = threadIdx.x;
  int ty = threadIdx.y;

  int row = blockIdx.y * TILE_DIM + ty;
  int col = blockIdx.x * TILE_DIM + tx;

  // Process multiple batches per thread
  int batchStart = blockIdx.z * BATCH_PER_THREAD;
  int batchEnd = min(batchStart + BATCH_PER_THREAD, batch);

  // Accumulator for each batch
  float acc[BATCH_PER_THREAD];
  for (int i = 0; i < BATCH_PER_THREAD; i++) {
    acc[i] = 0.0f;
  }

  // Tile across k dimension
  int numTiles = (k + TILE_DIM - 1) / TILE_DIM;

  for (int t = 0; t < numTiles; t++) {
    // Load tile from M (shared across all batches)
    int mCol = t * TILE_DIM + tx;
    if (row < m && mCol < k) {
      tileM[ty][tx] = M[row * k + mCol];
    } else {
      tileM[ty][tx] = 0.0f;
    }

    __syncthreads();

    // Process each batch
    for (int b_idx = 0; b_idx < (batchEnd - batchStart); b_idx++) {
      int b = batchStart + b_idx;

      // Load tile from N for this batch
      int nRow = t * TILE_DIM + ty;
      if (nRow < k && col < n) {
        tileN[ty][tx] = N[b * (k * n) + nRow * n + col];
      } else {
        tileN[ty][tx] = 0.0f;
      }

      __syncthreads();

      // Compute partial dot product
      if (row < m && col < n) {
        for (int i = 0; i < TILE_DIM; i++) {
          acc[b_idx] += tileM[ty][i] * tileN[i][tx];
        }
      }

      __syncthreads();
    }
  }

  // Write results
  if (row < m && col < n) {
    for (int b_idx = 0; b_idx < (batchEnd - batchStart); b_idx++) {
      int b = batchStart + b_idx;
      P[b * (m * n) + row * n + col] = acc[b_idx];
    }
  }
}


int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  float *M_d;
  float *N_d;
  float *P_d;

  gpuErrchk(cudaMalloc((void**)&M_d, sizeM * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpy(M_d, M, sizeM * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(P_d, P, sizeP * sizeof(float), cudaMemcpyHostToDevice));

  // Optimized launch configuration
  dim3 blockSize(TILE_DIM, TILE_DIM);
  dim3 numBlocks(
    (n + TILE_DIM - 1) / TILE_DIM,
    (m + TILE_DIM - 1) / TILE_DIM,
    (batch + BATCH_PER_THREAD - 1) / BATCH_PER_THREAD
  );

  batchedMatMul<<<numBlocks, blockSize>>>(M_d, N_d, P_d, m, k, n, batch);
  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

 // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================


  gpuErrchk(cudaFree(M_d));
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(M);
  free(N);
  free(P);
  free(P_host);

  return 0;
}

In [None]:
# @title Alone 1ù
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// this is the reference implementation
// you can change this to your heart's contempt
__constant__ float Mc[163840];

__global__
void batchedMatMul(float* N, float* P, int m, int k, int n, int batch) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  int b = blockIdx.z * blockDim.z + threadIdx.z;  // Add batch dimension

  if (row < m && col < n && b < batch) {
    float value = 0.0f;
    for (int i = 0; i < k; i++) {
      float a = Mc[row*k + i];
      float c = N[b*(k*n) + i*n + col];
      value += a * c;
    }
    P[b*(m*n) + row*n + col] = value;
  }
}


int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  // here, you can change anything
  float *N_d;
  float *P_d;

  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpyToSymbol(Mc, M, sizeM * sizeof(float)));

  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(P_d, P, sizeP * sizeof(float), cudaMemcpyHostToDevice));

  dim3 blockSize(16, 16, 4); // is 16x16 truly the best here?
  dim3 numBlocks((n + blockSize.x - 1) / blockSize.x, (m + blockSize.y - 1) / blockSize.y, (batch + blockSize.z - 1) / blockSize.z);

  batchedMatMul<<<numBlocks, blockSize>>>(N_d, P_d, m, k, n, batch);
  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

  // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================

  // here, you can change anything, e.g. add some logging
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(N);
  free(P);
  free(P_host);

  return 0;
}


In [None]:
# @title Alone 2
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// Tuned kernel: 32x32 tile in shared memory,
// but 16x16 threads per block, each computing a 2x2 output fragment.
#define TILE_MN 32
#define THREADS_PER_DIM 16
static_assert(TILE_MN % THREADS_PER_DIM == 0, "Tile must be multiple of thread dim");

__global__ __launch_bounds__(THREADS_PER_DIM*THREADS_PER_DIM, 2)
void batchedMatMul(const float* __restrict__ M,
                   const float* __restrict__ N,
                         float* __restrict__ P,
                   int m, int k, int n, int batch)
{
  // Which batch do we work on
  const int b = blockIdx.z;
  if (b >= batch) return;

  // Shared tiles
  __shared__ float tileM[TILE_MN][TILE_MN];
  __shared__ float tileN[TILE_MN][TILE_MN];

  // Base pointers for this batch
  const float* __restrict__ Nb = N + b * (k * n);
  float* __restrict__ Pb = P + b * (m * n);

  // 2x2 register-tiling per thread
  const int row0 = blockIdx.y * TILE_MN + threadIdx.y;
  const int row1 = row0 + THREADS_PER_DIM; // +16
  const int col0 = blockIdx.x * TILE_MN + threadIdx.x;
  const int col1 = col0 + THREADS_PER_DIM; // +16

  float c00 = 0.f, c01 = 0.f, c10 = 0.f, c11 = 0.f;

  const int numTiles = (k + TILE_MN - 1) / TILE_MN;

  for (int t = 0; t < numTiles; ++t) {
    const int kBase = t * TILE_MN;

    // --- Load tile of M (two rows per thread) ---
    int mCol = kBase + threadIdx.x; // along K
    // first half of tile rows
    if (row0 < m && mCol < k) {
      tileM[threadIdx.y][threadIdx.x] = M[row0 * k + mCol];
    } else {
      tileM[threadIdx.y][threadIdx.x] = 0.f;
    }
    // second half of tile rows (+16)
    if (row1 < m && mCol < k) {
      tileM[threadIdx.y + THREADS_PER_DIM][threadIdx.x] = M[row1 * k + mCol];
    } else {
      tileM[threadIdx.y + THREADS_PER_DIM][threadIdx.x] = 0.f;
    }

    // --- Load tile of N (two cols per thread) ---
    int nRow = kBase + threadIdx.y; // along K
    // first half of tile cols
    if (nRow < k && col0 < n) {
      tileN[threadIdx.y][threadIdx.x] = Nb[nRow * n + col0];
    } else {
      tileN[threadIdx.y][threadIdx.x] = 0.f;
    }
    // second half of tile cols (+16)
    if (nRow < k && col1 < n) {
      tileN[threadIdx.y][threadIdx.x + THREADS_PER_DIM] = Nb[nRow * n + col1];
    } else {
      tileN[threadIdx.y][threadIdx.x + THREADS_PER_DIM] = 0.f;
    }

    __syncthreads();

    // --- Compute this tile's contribution (fully unrolled for 32) ---
    #pragma unroll
    for (int i = 0; i < TILE_MN; ++i) {
      float a0 = tileM[threadIdx.y][i];
      float a1 = tileM[threadIdx.y + THREADS_PER_DIM][i];
      float b0 = tileN[i][threadIdx.x];
      float b1 = tileN[i][threadIdx.x + THREADS_PER_DIM];

      c00 += a0 * b0;
      c01 += a0 * b1;
      c10 += a1 * b0;
      c11 += a1 * b1;
    }

    __syncthreads();
  }

  // --- Write 2x2 results ---
  if (row0 < m && col0 < n) Pb[row0 * n + col0] = c00;
  if (row0 < m && col1 < n) Pb[row0 * n + col1] = c01;
  if (row1 < m && col0 < n) Pb[row1 * n + col0] = c10;
  if (row1 < m && col1 < n) Pb[row1 * n + col1] = c11;
}



int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  // here, you can change anything
  float *M_d = nullptr;
  float *N_d = nullptr;
  float *P_d = nullptr;

  gpuErrchk(cudaMalloc((void**)&M_d, sizeM * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpy(M_d, M, sizeM * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  // No need to copy P to device; kernel overwrites all computed elements

  // Grid/Block configuration: 16x16 threads; 32x32 tile per block
  dim3 blockSize(THREADS_PER_DIM, THREADS_PER_DIM);  // 256 threads
  dim3 numBlocks(
    (n + TILE_MN - 1) / TILE_MN,   // x: columns
    (m + TILE_MN - 1) / TILE_MN,   // y: rows
    batch                          // z: batches (parallelize!)
  );

  batchedMatMul<<<numBlocks, blockSize>>>(M_d, N_d, P_d, m, k, n, batch);
  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

  // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================

  // here, you can change anything, e.g. add some logging
  gpuErrchk(cudaFree(M_d));
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(M);
  free(N);
  free(P);
  free(P_host);

  return 0;
}


In [None]:
# @title small & medium matrices large batch baseline - a matrix sized 2d block for each batch
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// this is the reference implementation
// you can change this to your heart's contempt
__global__
void batchedMatMul(float* M, float* N, float* P, int m, int k, int n, int batch) {

    int row = threadIdx.y;
    int col = threadIdx.x;

    if (row < m && col < n) {
      for (int b = 0; b < batch; b++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }

}

int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  // here, you can change anything
  float *M_d;
  float *N_d;
  float *P_d;

  gpuErrchk(cudaMalloc((void**)&M_d, sizeM * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpy(M_d, M, sizeM * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(P_d, P, sizeP * sizeof(float), cudaMemcpyHostToDevice));

  dim3 blockSize(n, m);  // one block of size m × n × batch
  int numBlocks = batch;
  batchedMatMul<<<numBlocks, blockSize>>>(M_d, N_d, P_d, m, k, n, batch);

  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

  // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================

  // here, you can change anything, e.g. add some logging
  gpuErrchk(cudaFree(M_d));
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(M);
  free(N);
  free(P);
  free(P_host);

  return 0;
}


In [None]:
# @title small matrices large batch 90% kernel time cut - single block operation in 3 dimensions with batches on z dimension
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// this is the reference implementation
// you can change this to your heart's contempt

__global__
void batchedMatMul(float* M, float* N, float* P, int m, int k, int n, int batch) {

    int row = threadIdx.y;
    int col = threadIdx.x;
    int b   = threadIdx.z;  // use z dimension for batch index

    if (row < m && col < n && b < batch) {
      float value = 0.0f;
      for (int i = 0; i < k; i++) {
        float a = M[row*k + i];
        float c = N[b*(k*n) + i*n + col];
        value += a * c;
      }
      P[b*(m*n) + row*n + col] = value;
    }

}

int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  // here, you can change anything

  float *M_d;
  float *N_d;
  float *P_d;

  gpuErrchk(cudaMalloc((void**)&M_d, sizeM * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpy(M_d, M, sizeM * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(P_d, P, sizeP * sizeof(float), cudaMemcpyHostToDevice));

  dim3 blockSize(n, m, batch);  // one block of size m × n × batch
  dim3 gridDim(1,1,1);
  batchedMatMul<<<gridDim, blockSize>>>(M_d, N_d, P_d, m, k, n, batch);

  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

  // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================

  // here, you can change anything, e.g. add some logging
  gpuErrchk(cudaFree(M_d));
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(M);
  free(N);
  free(P);
  free(P_host);

  return 0;
}


In [None]:
# @title medium matrices large batch - batch number of blocks distributed on grid z dimension
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// this is the reference implementation
// you can change this to your heart's contempt

__global__
void batchedMatMul(float* M, float* N, float* P, int m, int k, int n, int batch) {

    int row = threadIdx.y;
    int col = threadIdx.x;
    int b   = blockIdx.z;  // use z dimension for batch index

    if (row < m && col < n && b < batch) {
      float value = 0.0f;
      for (int i = 0; i < k; i++) {
        float a = M[row*k + i];
        float c = N[b*(k*n) + i*n + col];
        value += a * c;
      }
      P[b*(m*n) + row*n + col] = value;
    }

}

int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  // here, you can change anything

  float *M_d;
  float *N_d;
  float *P_d;

  gpuErrchk(cudaMalloc((void**)&M_d, sizeM * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpy(M_d, M, sizeM * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(P_d, P, sizeP * sizeof(float), cudaMemcpyHostToDevice));

  dim3 blockSize(n, m);
  dim3 gridDim(1, 1, batch);
  batchedMatMul<<<gridDim, blockSize>>>(M_d, N_d, P_d, m, k, n, batch);

  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

  // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================

  // here, you can change anything, e.g. add some logging
  gpuErrchk(cudaFree(M_d));
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(M);
  free(N);
  free(P);
  free(P_host);

  return 0;
}


In [None]:
# @title medium matrices large batch - M loaded in shared memory (no benefit)
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// this is the reference implementation
// you can change this to your heart's contempt

__global__
void batchedMatMul(const float* M, const float* N, float* P,
                   int m, int k, int n, int batch) {

    __shared__ float Ms[1024];  // enough space for m*k elements (adjust if needed)

    int row = threadIdx.y;
    int col = threadIdx.x;
    int b   = blockIdx.z;  // batch index (same block, different z threads)

    // Step 1. Cooperative load of M into shared memory
    // Each thread participates in loading M once
    for (int idx = threadIdx.y * blockDim.x + threadIdx.x;
         idx < m*k;
         idx += blockDim.x * blockDim.y)
    {
        Ms[idx] = M[idx];
    }

    __syncthreads();  // ensure M is fully loaded

    // Step 2. Compute P for each batch
    if (row < m && col < n && b < batch) {
        float value = 0.0f;
        for (int i = 0; i < k; ++i) {
            float a = Ms[row*k + i];               // from shared memory
            float c = N[b*(k*n) + i*n + col];      // from global memory
            value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
    }
}

int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  // here, you can change anything
  float *M_d;
  float *N_d;
  float *P_d;

  gpuErrchk(cudaMalloc((void**)&M_d, sizeM * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpy(M_d, M, sizeM * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(P_d, P, sizeP * sizeof(float), cudaMemcpyHostToDevice));

  dim3 blockSize(n, m);  // one block of size m × n × batch
  dim3 gridSize(1,1,batch);
  batchedMatMul<<<gridSize, blockSize>>>(M_d, N_d, P_d, m, k, n, batch);

  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

  // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================

  // here, you can change anything, e.g. add some logging
  gpuErrchk(cudaFree(M_d));
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(M);
  free(N);
  free(P);
  free(P_host);

  return 0;
}


In [None]:
# @title large matrices small batch - failure
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// this is the reference implementation
// you can change this to your heart's contempt

__global__
void batchedMatMul(float* M, float* N, float* P, int m, int k, int n, int batch) {

    int row = threadIdx.x;
    int col = blockIdx.y;
    int b   = blockIdx.z;  // use z dimension for batch index

    if (row < m && col < n && b < batch) {
      float value = 0.0f;
      for (int i = 0; i < k; i++) {
        float a = M[row*k + i];
        float c = N[b*(k*n) + i*n + col];
        value += a * c;
      }
      P[b*(m*n) + row*n + col] = value;
    }

}

int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  // here, you can change anything

  float *M_d;
  float *N_d;
  float *P_d;

  gpuErrchk(cudaMalloc((void**)&M_d, sizeM * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpy(M_d, M, sizeM * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(P_d, P, sizeP * sizeof(float), cudaMemcpyHostToDevice));

  dim3 blockSize(m, 1, 1);
  dim3 gridDim(1, n, batch);
  batchedMatMul<<<gridDim, blockSize>>>(M_d, N_d, P_d, m, k, n, batch);

  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

  // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================

  // here, you can change anything, e.g. add some logging
  gpuErrchk(cudaFree(M_d));
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(M);
  free(N);
  free(P);
  free(P_host);

  return 0;
}


In [None]:
# @title large
%%writefile bmatmul.cpp
// DON'T CHANGE THIS ^^ FILENAME!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// utility for wrapping CUDA API calls and log any error they may return (use this for debugging)
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort)
      exit(code);
  }
}

// === DO NOT CHANGE THIS ===
// host-side version, used to validate results
__host__
void batchedMatMulHost(float* M, float* N, float* P, int m, int k, int n, int batch) {
  for (int b = 0; b < batch; b++) {
    for (int row = 0; row < m; row++) {
      for (int col = 0; col < n; col++) {
        float value = 0.0f;
        for (int i = 0; i < k; i++) {
          float a = M[row*k + i];
          float c = N[b*(k*n) + i*n + col];
          value += a * c;
        }
        P[b*(m*n) + row*n + col] = value;
      }
    }
  }
}

void initWith(float number, float* arr, int size) {
  for (int i = 0; i < size; i++)
    arr[i] = number;
}

void initRandom(float* arr, int size, unsigned int seed, float minVal = 0.0f, float maxVal = 1.0f) {
  srand(seed);
  for (int i = 0; i < size; i++) {
    float r = (float)rand() / RAND_MAX;
    arr[i] = minVal + r * (maxVal - minVal);
  }
}

void checkResult(float* arr1, float* arr2, int size) {
  const float atol = 1e-4f; // absolute tolerance for fp32 (lack of) associativity
  const float rtol = 1e-4f; // relative tolerance for fp32 (lack of) associativity
  for (int i = 0; i < size; i++) {
    float diff = fabs(arr1[i] - arr2[i]);
    float tol = atol + rtol*fabs(arr2[i]);
    if (diff > tol) {
      printf("Error at %d: %f != %f (diff=%e, tol=%e)\n", i, arr1[i], arr2[i], diff, tol);
      exit(1);
    }
  }
}
// ==========================

// this is the reference implementation
// you can change this to your heart's contempt
__global__
void batchedMatMul(float* M, float* N, float* P, int m, int k, int n, int batch) {
  int row = blockIdx.y*blockDim.y + threadIdx.y;
  int col = blockIdx.x*blockDim.x + threadIdx.x;

  if (row < m && col < n) {
    for (int b = 0; b < batch; b++) {
      float value = 0.0f;
      for (int i = 0; i < k; i++) {
        float a = M[row*k + i];
        float c = N[b*(k*n) + i*n + col];
        value += a * c;
      }
      P[b*(m*n) + row*n + col] = value;
    }
  }
}


int main(int argc, char** argv) {
  // === DO NOT CHANGE THIS ===
  if (argc != 6) {
    printf("Usage: %s <m> <k> <n> <batch> <seed>\n", argv[0]);
    exit(1);
  }

  int m = atoi(argv[1]); // rows of Ms and Ps
  int k = atoi(argv[2]); // cols of Ms, rows of Ns
  int n = atoi(argv[3]); // cols of Ns and Ps
  int batch = atoi(argv[4]); // number of matrix pairs
  unsigned int seed = (unsigned int)atoi(argv[5]); // seed for random initialization

  printf("Running batched matmul with m=%d, k=%d, n=%d, batch=%d, seed=%u\n", m, k, n, batch, seed);

  const int sizeM = m*k;
  const int sizeN = k*n*batch;
  const int sizeP = m*n*batch;

  float* M = (float*)malloc(sizeM * sizeof(float));
  float* N = (float*)malloc(sizeN * sizeof(float));
  float* P = (float*)malloc(sizeP * sizeof(float));

  initRandom(M, sizeM, seed);
  initRandom(N, sizeN, seed + 1);
  initWith(0.0f, P, sizeP);
  // ==========================

  // here, you can change anything
  float *M_d;
  float *N_d;
  float *P_d;

  gpuErrchk(cudaMalloc((void**)&M_d, sizeM * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&N_d, sizeN * sizeof(float)));
  gpuErrchk(cudaMalloc((void**)&P_d, sizeP * sizeof(float)));

  gpuErrchk(cudaMemcpy(M_d, M, sizeM * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(N_d, N, sizeN * sizeof(float), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMemcpy(P_d, P, sizeP * sizeof(float), cudaMemcpyHostToDevice));

  dim3 blockSize(54,18); // is 16x16 truly the best here?
  dim3 numBlocks((n + blockSize.x - 1) / blockSize.x, (m + blockSize.y - 1) / blockSize.y);

  batchedMatMul<<<numBlocks, blockSize>>>(M_d, N_d, P_d, m, k, n, batch);
  gpuErrchk(cudaDeviceSynchronize());

  gpuErrchk(cudaMemcpy(P, P_d, sizeP * sizeof(float), cudaMemcpyDeviceToHost));

  // === DO NOT CHANGE THIS ===
  // However: once you know results are correct, you can temporarily
  //          comment this out if you want to test performance on large
  //          matrices, since the evaluation on CPU can get pretty slow.
  printf("Checking results on CPU...\n");
  float* P_host = (float*)malloc(sizeP * sizeof(float));
  initWith(0.0f, P_host, sizeP);
  batchedMatMulHost(M, N, P_host, m, k, n, batch);
  checkResult(P, P_host, m*n*batch);
  printf("All results matched, success!");
  // ==========================

  // here, you can change anything, e.g. add some logging
  gpuErrchk(cudaFree(M_d));
  gpuErrchk(cudaFree(N_d));
  gpuErrchk(cudaFree(P_d));

  free(M);
  free(N);
  free(P);
  free(P_host);

  return 0;
}


## **Compile, Run, Profile**

Compile and run:

In [None]:
!mv bmatmul.cpp bmatmul.cu
!nvcc -arch=sm_75 bmatmul.cu -o bmatmul

In [None]:
!./bmatmul 256 512 384 16 119

Profile:

In [None]:
!nsys profile --stats=true ./bmatmul 256 512 384 16 119

Running batched matmul with m=256, k=512, n=384, batch=16, seed=119
Checking results on CPU...
All results matched, success!Generating '/tmp/nsys-report-3b93.qdstrm'
[1/8] [========================100%] report1.nsys-rep
[2/8] [========================100%] report1.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/cuda/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  -------------  --------  -----------  ------------  ----------------------
     98.8    5,701,475,935         66  86,385,999.0  100,135,831.5    57,103  284,139,110  43,718,945.9  poll                  
      1.2       67,028,728        543     123,441.5       14,584.0       384   17,662,203   1,022,129.7  ioctl                 
      0.0        1,897,438         31      61,207.7       11,773.0     7,961    1,199,774     212,278.6  mmap64                
      0.0        1,108,245         10     110,824.5       56,247.0    25,249      382,985     116,560.1  sem_timedwait         
      0.0          490,750          1     490,750.0      490,750.0   490,750      490,750           0.0  pthread_cond_wait     
      0.0          430,390         49       8,783.5        7,673.0     1,866       29,380       4,288.5  open64                
      0.0          237,317         40       5,932.9        3,314.5     1,415       33,907       6,802.4  fopen                 
      0.0          167,330         15      11,155.3        5,610.0     1,530       57,251      13,956.3  mmap                  
      0.0          107,897          2      53,948.5       53,948.5    47,654       60,243       8,901.8  pthread_create        
      0.0           71,287         12       5,940.6        6,140.5     3,128        8,985       1,806.8  write                 
      0.0           52,856         33       1,601.7        1,177.0       729        6,200       1,268.2  fclose                
      0.0           37,455         20       1,872.8           45.5        45       36,517       8,154.4  fgets                 
      0.0           35,695         15       2,379.7        1,371.0       769       13,179       3,103.4  read                  
      0.0           32,629         64         509.8          510.5       160        1,232         205.2  fcntl                 
      0.0           31,885          6       5,314.2        5,309.0     1,402        8,591       2,624.8  open                  
      0.0           21,243          2      10,621.5       10,621.5     5,560       15,683       7,158.0  socket                
      0.0           20,519          5       4,103.8        4,564.0     2,171        5,648       1,342.4  munmap                
      0.0           18,954          3       6,318.0        6,960.0     3,692        8,302       2,371.1  pipe2                 
      0.0            9,431          1       9,431.0        9,431.0     9,431        9,431           0.0  connect               
      0.0            7,183          2       3,591.5        3,591.5     2,290        4,893       1,840.6  pthread_cond_broadcast
      0.0            5,886          2       2,943.0        2,943.0     2,853        3,033         127.3  fwrite                
      0.0            2,898          8         362.3          349.0       264          533          78.1  dup                   
      0.0            1,464          1       1,464.0        1,464.0     1,464        1,464           0.0  bind                  
      0.0              972          1         972.0          972.0       972          972           0.0  listen                

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)    Max (ns)   StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  -----------  ---------  ----------  ------------  ----------------------
     86.2       93,793,097          3  31,264,365.7     89,168.0     75,632  93,628,297  54,008,749.2  cudaMalloc            
      6.9        7,521,965          1   7,521,965.0  7,521,965.0  7,521,965   7,521,965           0.0  cudaDeviceSynchronize
      5.7        6,247,854          4   1,561,963.5  1,595,584.5    142,418   2,914,267   1,132,750.6  cudaMemcpy            
      0.9        1,023,654          3     341,218.0    277,734.0    140,895     605,025     238,488.6  cudaFree              
      0.2          179,158          1     179,158.0    179,158.0    179,158     179,158           0.0  cudaLaunchKernel      
      0.0            1,820          1       1,820.0      1,820.0      1,820       1,820           0.0  cuModuleGetLoadingMode

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                              Name                            
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ------------------------------------------------------------
    100.0        7,513,035          1  7,513,035.0  7,513,035.0  7,513,035  7,513,035          0.0  batchedMatMul(float *, float *, float *, int, int, int, int)

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)      Operation     
 --------  ---------------  -----  -----------  -----------  ---------  ---------  -----------  ------------------
     78.3        4,121,069      3  1,373,689.7  1,364,036.0     45,983  2,711,050  1,332,559.7  [CUDA memcpy HtoD]
     21.7        1,143,626      1  1,143,626.0  1,143,626.0  1,143,626  1,143,626          0.0  [CUDA memcpy DtoH]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation     
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
     19.399      3     6.466     6.291     0.524    12.583        6.031  [CUDA memcpy HtoD]
      6.291      1     6.291     6.291     6.291     6.291        0.000  [CUDA memcpy DtoH]

Generated:
    /home/cuda/report1.nsys-rep
    /home/cuda/report1.sqlite

<a name="report"></a>
## **Brief Report**

**You must fill in this section!!**

Group information:
- member-1: NAME, SURNAME, PERSON CODE
- member-2: NAME, SURNAME, PERSON CODE
- member-3: NAME, SURNAME, PERSON CODE
- I BULBASAUR<br><img src="https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/other/official-artwork/1.png" alt="Bulbasaur" width="120" border="0">

*Note: yes, groups can now have a logo - this is optional and merely for fun, if you don't feel like having one, no worries, in which case you may delete that itemize entry alongside this note :(*
<!-- if you reeeeeally don't have ideas for a logo, before giving up, check this out: https://picrew.me/en/image_maker/47882 -->

Bullet points describing what you did with a short motivation (some - arguably stupid ? - examples are given):
- used supershared L99 cache: this was the fastest way to raise the temperature and cook an egg on the GPU's heatsink
- pinned DRAM chips to the wall and asked them to be faster: they did not comply
- missread the assignment and implemented matrix diagonalization: now I have my own version of cuBLAS
- relied so heavily on blockIdx.z that the results tried to escape the HBM stack: we politely asked them to stay
- broke isolation and achieved priviledge escalation in Colab by kindly asking Google's chief sysadmin, we can now "tweak" the profiler's report: this was outside what was discussed during lectures, but social engineering is easier than writing good code

*Note: possibly less than 8 entries of ~32 words each. More isn't necessarily better if nobody will read it.*

*Note: the subject is "the main things you came up with to improve the kernel".*




- Block Size Optimization (32x32 = 1024 threads): Maximizes occupancy on modern GPUs (full warp utilization), better SM utilization than 16x16=256 threads