Skip to content

Dssix/gpu-accelerated-matrix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU-Accelerated Matrix Multiplication (C++ / CUDA)

Language Platform Status

🚀 Executive Summary

This project implements a high-performance matrix multiplication kernel ($C = A \times B$) using CUDA C++. It benchmarks a standard multithreaded CPU implementation against two GPU strategies: a Naïve Global Memory approach and an Optimized Shared Memory Tiling approach.

On an NVIDIA Tesla T4, the optimized kernel achieves a 2,712x speedup over the CPU baseline for large matrix sizes ($N=4096$), reducing execution time from ~11.8 minutes to 0.26 seconds.


📊 Benchmark Results

Performance measured on NVIDIA Tesla T4 (Google Colab) vs Intel Xeon CPU.

Matrix Size (N) CPU Baseline (ms) GPU Naïve (ms) GPU Tiled (ms) Speedup (vs CPU)
512 200.71 102.95 2.12 94x
1024 3,495.47 13.06 9.41 371x
2048 64,302.17 88.10 59.43 1,081x
4096 712,979.44 461.99 262.83 2,712x

Note: The CPU implementation is $O(N^3)$ and suffers from cache thrashing at large sizes. The GPU Tiled implementation leverages shared memory to reuse data, effectively minimizing the memory bandwidth bottleneck.

Benchmark Plot


📂 Project Structure

├── CMakeLists.txt          # Build configuration (detects CUDA/CXX)
├── README.md               # Documentation
├── benchmark/
│   ├── plot_results.py     # Python script to generate performance plots
│   └── benchmark_plot.png  # Generated plot
├── include/
│   ├── matmul_cpu.hpp      # Header for CPU functions
│   └── matmul_gpu.hpp      # Header for GPU wrappers
└── src/
    ├── benchmark.cpp       # Main driver: runs tests & verifies correctness
    ├── matmul_cpu.cpp      # Standard O(N^3) CPU implementation
    └── matmul_gpu.cu       # CUDA Kernels (Naïve & Tiled)


🛠️ Implementation Details

1. CPU Baseline

A standard triple-loop implementation. While accurate, it is limited by the serial nature of the CPU and high latency of DRAM access when handling large matrices ().

2. GPU Naïve Kernel

Maps one thread to one output pixel.

  • Pros: Massive parallelism (thousands of threads active).
  • Cons: Memory Bound. Each thread reads row A and column B directly from Global Memory, leading to redundant data fetches.

3. GPU Tiled Kernel (Optimized)

Uses Shared Memory (L1 Cache equivalent) to reduce global memory traffic.

  • Technique: Threads load a tile of and into shared memory.
  • Benefit: Data in shared memory is accessed ~16 times faster than global memory. This drastically improves the arithmetic intensity of the kernel.
  • Synchronization: Uses __syncthreads() to ensure data consistency within the block during tile loading.

⚙️ Build & Run

Prerequisites

  • CMake >= 3.10
  • NVIDIA CUDA Toolkit (nvcc)
  • C++ Compiler (g++ or clang)

compilation

# 1. Create build directory
mkdir build && cd build

# 2. Configure project (CMake will auto-detect T4/Target GPU)
cmake ..

# 3. Compile
make

Running the Benchmark

./matmul_bench

🔗 Future Work

  • Implement cuBLAS comparison to see how close we are to vendor-optimized libraries.
  • Add support for Rectangular Matrices (currently supports Square ).
  • Implement Double Buffering to hide memory latency even further.

Author: Devansh Singh Sijwali

About

C++ and CUDA implementation of matrix multiplication for high-performance scientific computing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors