This project implements a high-performance matrix multiplication kernel (
On an NVIDIA Tesla T4, the optimized kernel achieves a 2,712x speedup over the CPU baseline for large matrix sizes (
Performance measured on NVIDIA Tesla T4 (Google Colab) vs Intel Xeon CPU.
| Matrix Size (N) | CPU Baseline (ms) | GPU Naïve (ms) | GPU Tiled (ms) | Speedup (vs CPU) |
|---|---|---|---|---|
| 512 | 200.71 | 102.95 | 2.12 | 94x |
| 1024 | 3,495.47 | 13.06 | 9.41 | 371x |
| 2048 | 64,302.17 | 88.10 | 59.43 | 1,081x |
| 4096 | 712,979.44 | 461.99 | 262.83 | 2,712x |
Note: The CPU implementation is
$O(N^3)$ and suffers from cache thrashing at large sizes. The GPU Tiled implementation leverages shared memory to reuse data, effectively minimizing the memory bandwidth bottleneck.
├── CMakeLists.txt # Build configuration (detects CUDA/CXX)
├── README.md # Documentation
├── benchmark/
│ ├── plot_results.py # Python script to generate performance plots
│ └── benchmark_plot.png # Generated plot
├── include/
│ ├── matmul_cpu.hpp # Header for CPU functions
│ └── matmul_gpu.hpp # Header for GPU wrappers
└── src/
├── benchmark.cpp # Main driver: runs tests & verifies correctness
├── matmul_cpu.cpp # Standard O(N^3) CPU implementation
└── matmul_gpu.cu # CUDA Kernels (Naïve & Tiled)
A standard triple-loop implementation. While accurate, it is limited by the serial nature of the CPU and high latency of DRAM access when handling large matrices ().
Maps one thread to one output pixel.
- Pros: Massive parallelism (thousands of threads active).
- Cons: Memory Bound. Each thread reads row A and column B directly from Global Memory, leading to redundant data fetches.
Uses Shared Memory (L1 Cache equivalent) to reduce global memory traffic.
- Technique: Threads load a tile of and into shared memory.
- Benefit: Data in shared memory is accessed ~16 times faster than global memory. This drastically improves the arithmetic intensity of the kernel.
- Synchronization: Uses
__syncthreads()to ensure data consistency within the block during tile loading.
- CMake >= 3.10
- NVIDIA CUDA Toolkit (nvcc)
- C++ Compiler (g++ or clang)
# 1. Create build directory
mkdir build && cd build
# 2. Configure project (CMake will auto-detect T4/Target GPU)
cmake ..
# 3. Compile
make
./matmul_bench
- Implement cuBLAS comparison to see how close we are to vendor-optimized libraries.
- Add support for Rectangular Matrices (currently supports Square ).
- Implement Double Buffering to hide memory latency even further.
Author: Devansh Singh Sijwali
