This repository contains a collection of CUDA programming exercises completed as part of a GPU Computing course. The exercises progress from basic CUDA kernel development to advanced GPU optimization techniques, including parallel algorithms, memory optimization, streams, and Tensor Core programming.
The goal of these projects is to gain hands-on experience with GPU architecture, parallel programming models, and performance optimization on NVIDIA GPUs.
- CUDA kernel programming
- Thread and block hierarchy
- GPU memory model (global, shared, registers)
- Performance benchmarking and profiling
- Parallel scan algorithms
- Asynchronous execution and CUDA streams
- Shared-memory tiling
- Tensor Cores (WMMA)
- Stencil-based physical simulations
Description: Converted a sequential C++ matrix/vector computation into a parallel CUDA implementation and evaluated performance across different kernel launch configurations.
Key Learnings:
- Mapping computations to CUDA threads
- Kernel launch configuration tuning
- GPU benchmarking and performance analysis
- Understanding GPU occupancy and parallel execution
Description: Implemented a parallel inclusive prefix scan for complex numbers using the Kogge–Stone algorithm, with complex multiplication as the scan operation.
Key Learnings:
- Parallel scan algorithms
- Warp divergence reduction
- Shared memory optimization
- Thread coarsening and memory coalescing
Description: Part 1 implements overlapping of communication and computation using CUDA streams. Part 2 optimizes naive matrix multiplication using shared-memory tiling and profiling with NVIDIA Nsight.
Key Learnings:
- CUDA streams and asynchronous execution
- Overlapping memory transfers with computation
- Shared memory tiling strategies
- Performance-driven optimization using profiling tools
Description: Implemented a 1D convolution by reformulating it as a matrix multiplication and executing it using Tensor Core WMMA instructions.
Key Learnings:
- Convolution-to-matrix multiplication transformation
- WMMA API and Tensor Core usage
- Warp-level programming
- Hardware constraints of Tensor Cores
Description: Reimplemented a sequential 2D heat flow simulation as a CUDA-accelerated stencil computation using shared memory and double buffering.
Key Learnings:
- Stencil-based GPU computations
- Double buffering on the GPU
- Shared memory optimization
- Efficient handling of iterative simulations
- CUDA / C++
- NVIDIA GPUs
- CUDA Streams
- Shared Memory & WMMA
- NVIDIA Nsight (profiling)