Skip to content

Latest commit

 

History

History

10 Final Project

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

MNIST in CUDA

This is instruction manual for understanding + using the mnist training run in CUDA

Setup

DISCLAIMER: ensure you have a GPU with compute capability 5.0 or greater (at least maxwell architecture). See compatibilty guide: https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html

git clone https://github.com/Infatoshi/mnist-cuda
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Purpose

We train an MLP on the MNIST dataset. We implement both the batched training run in pytorch, then translate over to CUDA C/C++ using iteratively optimized GPU kernels. I purposely left out batchnorm, residual blocks, lower-precision, and other optimizations to keep the code simple and easy to understand. It would also take wayyyy longer to implement and explain.

What we need to watch out for + pay attention to:

Accelerate the data transfer via Prefetching

  • Unified vs Explicit Memory in CUDA
  • Maximizing Unified Memory Performance
  • Prefetching is automatically taken care of by unified memory via streams (this is what is has lower latency in the github link above)
    • CUDA streams - Lei Mao
    • NVIDIA Docs
    • Streams allow for overlapping data transfer (prefetching) with computation.
    • While one stream is executing a kernel, another stream can be transferring data for the next computation.
    • This technique is often called "double buffering" or "multi-buffering" when extended to more buffers.

Kernel Conversion

we will change the following functions to kernels: matmul_a_bt and matmul_at_b relu_forward and relu_backward bias_forward and bias_backward softmax compute_grad_output compute_output_gradients compute_hidden_gradients update_gradients