GPU-Accelerated Matrix Multiplication (C++ / CUDA)

🚀 Executive Summary

This project implements a high-performance matrix multiplication kernel ($C = A \times B$) using CUDA C++. It benchmarks a standard multithreaded CPU implementation against two GPU strategies: a Naïve Global Memory approach and an Optimized Shared Memory Tiling approach.

On an NVIDIA Tesla T4, the optimized kernel achieves a 2,712x speedup over the CPU baseline for large matrix sizes ($N=4096$), reducing execution time from ~11.8 minutes to 0.26 seconds.

📊 Benchmark Results

Performance measured on NVIDIA Tesla T4 (Google Colab) vs Intel Xeon CPU.

Matrix Size (N)	CPU Baseline (ms)	GPU Naïve (ms)	GPU Tiled (ms)	Speedup (vs CPU)
512	200.71	102.95	2.12	94x
1024	3,495.47	13.06	9.41	371x
2048	64,302.17	88.10	59.43	1,081x
4096	712,979.44	461.99	262.83	2,712x

Note: The CPU implementation is $O(N^3)$ and suffers from cache thrashing at large sizes. The GPU Tiled implementation leverages shared memory to reuse data, effectively minimizing the memory bandwidth bottleneck.

📂 Project Structure

├── CMakeLists.txt          # Build configuration (detects CUDA/CXX)
├── README.md               # Documentation
├── benchmark/
│   ├── plot_results.py     # Python script to generate performance plots
│   └── benchmark_plot.png  # Generated plot
├── include/
│   ├── matmul_cpu.hpp      # Header for CPU functions
│   └── matmul_gpu.hpp      # Header for GPU wrappers
└── src/
    ├── benchmark.cpp       # Main driver: runs tests & verifies correctness
    ├── matmul_cpu.cpp      # Standard O(N^3) CPU implementation
    └── matmul_gpu.cu       # CUDA Kernels (Naïve & Tiled)

🛠️ Implementation Details

1. CPU Baseline

A standard triple-loop implementation. While accurate, it is limited by the serial nature of the CPU and high latency of DRAM access when handling large matrices ().

2. GPU Naïve Kernel

Maps one thread to one output pixel.

Pros: Massive parallelism (thousands of threads active).
Cons: Memory Bound. Each thread reads row A and column B directly from Global Memory, leading to redundant data fetches.

3. GPU Tiled Kernel (Optimized)

Uses Shared Memory (L1 Cache equivalent) to reduce global memory traffic.

Technique: Threads load a tile of and into shared memory.
Benefit: Data in shared memory is accessed ~16 times faster than global memory. This drastically improves the arithmetic intensity of the kernel.
Synchronization: Uses __syncthreads() to ensure data consistency within the block during tile loading.

⚙️ Build & Run

Prerequisites

CMake >= 3.10
NVIDIA CUDA Toolkit (nvcc)
C++ Compiler (g++ or clang)

compilation

# 1. Create build directory
mkdir build && cd build

# 2. Configure project (CMake will auto-detect T4/Target GPU)
cmake ..

# 3. Compile
make

Running the Benchmark

./matmul_bench

🔗 Future Work

Implement cuBLAS comparison to see how close we are to vendor-optimized libraries.
Add support for Rectangular Matrices (currently supports Square ).
Implement Double Buffering to hide memory latency even further.

Author: Devansh Singh Sijwali

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU-Accelerated Matrix Multiplication (C++ / CUDA)

🚀 Executive Summary

📊 Benchmark Results

📂 Project Structure

🛠️ Implementation Details

1. CPU Baseline

2. GPU Naïve Kernel

3. GPU Tiled Kernel (Optimized)

⚙️ Build & Run

Prerequisites

compilation

Running the Benchmark

🔗 Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
include		include
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

GPU-Accelerated Matrix Multiplication (C++ / CUDA)

🚀 Executive Summary

📊 Benchmark Results

📂 Project Structure

🛠️ Implementation Details

1. CPU Baseline

2. GPU Naïve Kernel

3. GPU Tiled Kernel (Optimized)

⚙️ Build & Run

Prerequisites

compilation

Running the Benchmark

🔗 Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages