diff --git a/README.md b/README.md index 3195e32..40f86f3 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,22 @@ *From beginner fundamentals to production-ready optimization techniques* -**Quick Navigation:** [πŸš€ Quick Start](#-quick-start) β€’ [πŸ“š Modules](#-modules) β€’ [🐳 Docker Setup](#-docker-development) β€’ [🀝 Contributing](CONTRIBUTING.md) +## πŸ“‘ Table of Contents + +- [πŸ“‹ Project Overview](#-project-overview) +- [πŸ—οΈ GPU Programming Architecture](#️-gpu-programming-architecture) +- [✨ Key Features](#-key-features) +- [πŸš€ Quick Start](#-quick-start) +- [🎯 Learning Path](#-learning-path) +- [πŸ“š Modules](#-modules) +- [πŸ› οΈ Prerequisites](#️-prerequisites) +- [🐳 Docker Development](#-docker-development) +- [πŸ”§ Build System](#-build-system) +- [πŸ“Š Performance Expectations](#-performance-expectations) +- [πŸ› Troubleshooting](#-troubleshooting) +- [πŸ“– Documentation](#-documentation) +- [🀝 Contributing](#-contributing) +- [πŸ“„ License](#-license) --- @@ -27,6 +42,155 @@ Perfect for students, researchers, and developers looking to master GPU computing. +## πŸ—οΈ GPU Programming Architecture + +Understanding how GPU programming works from high-level code to hardware execution is crucial for effective GPU development. This section provides a comprehensive overview of the CUDA and HIP ROCm software-hardware stack. + +### Architecture Overview Diagram + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ APPLICATION LAYER β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ High-Level Code (C++/CUDA/HIP) β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ CUDA C++ Code β”‚ β”‚ HIP C++ Code β”‚ β”‚ OpenCL/SYCL β”‚ β”‚ +β”‚ β”‚ (.cu files) β”‚ β”‚ (.hip files) β”‚ β”‚ (Cross-platform) β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ __global__ kernels β”‚ β”‚ __global__ kernels β”‚ β”‚ kernel functions β”‚ β”‚ +β”‚ β”‚ cudaMalloc() β”‚ β”‚ hipMalloc() β”‚ β”‚ clCreateBuffer() β”‚ β”‚ +β”‚ β”‚ cudaMemcpy() β”‚ β”‚ hipMemcpy() β”‚ β”‚ clEnqueueNDRange() β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ COMPILATION LAYER β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ Compiler Frontend β”‚ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ NVCC β”‚ β”‚ HIP Clang β”‚ β”‚ LLVM/Clang β”‚ β”‚ +β”‚ β”‚ (NVIDIA Compiler) β”‚ β”‚ (AMD Compiler) β”‚ β”‚ (Open Standard) β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β€’ Parse CUDA syntax β”‚ β”‚ β€’ Parse HIP syntax β”‚ β”‚ β€’ Parse OpenCL/SYCL β”‚ β”‚ +β”‚ β”‚ β€’ Host/Device split β”‚ β”‚ β€’ Host/Device split β”‚ β”‚ β€’ Generate SPIR-V β”‚ β”‚ +β”‚ β”‚ β€’ Generate PTX β”‚ β”‚ β€’ Generate GCN ASM β”‚ β”‚ β€’ Target backends β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ INTERMEDIATE REPRESENTATION β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ PTX β”‚ β”‚ GCN ASM β”‚ β”‚ SPIR-V β”‚ β”‚ +β”‚ β”‚ (Parallel Thread β”‚ β”‚ (Graphics Core β”‚ β”‚ (Standard Portable β”‚ β”‚ +β”‚ β”‚ Execution) β”‚ β”‚ Next Assembly) β”‚ β”‚ IR - Vulkan) β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β€’ Virtual ISA β”‚ β”‚ β€’ AMD GPU ISA β”‚ β”‚ β€’ Cross-platform β”‚ β”‚ +β”‚ β”‚ β€’ Device agnostic β”‚ β”‚ β€’ RDNA/CDNA arch β”‚ β”‚ β€’ Vendor neutral β”‚ β”‚ +β”‚ β”‚ β€’ JIT compilation β”‚ β”‚ β€’ Direct execution β”‚ β”‚ β€’ Multiple targets β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ DRIVER LAYER β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ CUDA Driver β”‚ β”‚ ROCm Driver β”‚ β”‚ OpenCL Driver β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β€’ PTX β†’ SASS JIT β”‚ β”‚ β€’ GCN β†’ Machine β”‚ β”‚ β€’ SPIR-V β†’ Native β”‚ β”‚ +β”‚ β”‚ β€’ Memory management β”‚ β”‚ β€’ Memory management β”‚ β”‚ β€’ Memory management β”‚ β”‚ +β”‚ β”‚ β€’ Kernel launch β”‚ β”‚ β€’ Kernel launch β”‚ β”‚ β€’ Kernel launch β”‚ β”‚ +β”‚ β”‚ β€’ Context mgmt β”‚ β”‚ β€’ Context mgmt β”‚ β”‚ β€’ Context mgmt β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + β–Ό +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ HARDWARE LAYER β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ NVIDIA GPU β”‚ β”‚ AMD GPU β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ +β”‚ β”‚ β”‚ SM (Cores) β”‚ β”‚ β”‚ β”‚ CU (Cores) β”‚ β”‚ β”‚ Intel Xe Cores β”‚ β”‚ +β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚FP32 | INT32 β”‚ β”‚ β”‚ β”‚ β”‚ β”‚FP32 | INT32 β”‚ β”‚ β”‚ β”‚ β”‚ Vector Engines β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚FP64 | BF16 β”‚ β”‚ β”‚ β”‚ β”‚ β”‚FP64 | BF16 β”‚ β”‚ β”‚ β”‚ β”‚ Matrix Engines β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β”‚Tensor Cores β”‚ β”‚ β”‚ β”‚ β”‚ β”‚Matrix Cores β”‚ β”‚ β”‚ β”‚ β”‚ Ray Tracing β”‚ β”‚ β”‚ +β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ +β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ +β”‚ β”‚ Memory Hierarchy: β”‚ β”‚ Memory Hierarchy: β”‚ Memory Hierarchy: β”‚ +β”‚ β”‚ β€’ L1 Cache (KB) β”‚ β”‚ β€’ L1 Cache (KB) β”‚ β€’ L1 Cache β”‚ +β”‚ β”‚ β€’ L2 Cache (MB) β”‚ β”‚ β€’ L2 Cache (MB) β”‚ β€’ L2 Cache β”‚ +β”‚ β”‚ β€’ Global Mem (GB) β”‚ β”‚ β€’ Global Mem (GB) β”‚ β€’ Global Memory β”‚ +β”‚ β”‚ β€’ Shared Memory β”‚ β”‚ β€’ LDS (Local Data β”‚ β€’ Shared Local Memory β”‚ +β”‚ β”‚ β€’ Constant Memory β”‚ β”‚ Store) β”‚ β€’ Constant Memory β”‚ +β”‚ β”‚ β€’ Texture Memory β”‚ β”‚ β€’ Constant Memory β”‚ β”‚ +β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### Compilation Pipeline Deep Dive + +#### 1. **Source Code β†’ Frontend Parsing** +- **CUDA**: NVCC separates host (CPU) and device (GPU) code, parses CUDA extensions +- **HIP**: Clang-based compiler with HIP runtime API that maps to either CUDA or ROCm +- **OpenCL/SYCL**: LLVM-based compilation with cross-platform intermediate representation + +#### 2. **Frontend β†’ Intermediate Representation** +``` +High-Level Code Intermediate Form +───────────────── ─────────────────── +__global__ void kernel() β†’ PTX (NVIDIA) +{ GCN Assembly (AMD) + int id = threadIdx.x; SPIR-V (OpenCL/Vulkan) + output[id] = input[id] * 2; LLVM IR (SYCL) +} +``` + +#### 3. **Runtime Compilation & Optimization** +- **NVIDIA**: PTX β†’ SASS (GPU-specific machine code) via JIT compilation +- **AMD**: GCN Assembly β†’ GPU microcode via ROCm runtime +- **Optimizations**: Register allocation, memory coalescing, instruction scheduling + +#### 4. **Hardware Execution Model** + +| Abstraction Level | NVIDIA Term | AMD Term | Description | +|------------------|-------------|----------|-------------| +| **Thread** | Thread | Work-item | Single execution unit | +| **Thread Group** | Warp (32 threads) | Wavefront (64 threads) | SIMD execution group | +| **Thread Block** | Block | Work-group | Shared memory + synchronization | +| **Grid** | Grid | NDRange | Collection of all thread blocks | + +#### 5. **Memory Architecture Mapping** + +``` +Programming Model Hardware Implementation +───────────────── ───────────────────────── +Global Memory β†’ GPU DRAM (HBM/GDDR) +Shared Memory β†’ On-chip SRAM (48-164KB per SM/CU) +Local Memory β†’ GPU DRAM (spilled registers) +Constant Memory β†’ Cached read-only GPU DRAM +Texture Memory β†’ Cached GPU DRAM with interpolation +Registers β†’ On-chip register file (32K-64K per SM/CU) +``` + +### Performance Implications + +Understanding this architecture helps optimize GPU code: + +1. **Memory Coalescing**: Access patterns that align with hardware memory buses +2. **Occupancy**: Balancing registers, shared memory, and thread blocks per SM/CU +3. **Divergence**: Minimizing different execution paths within warps/wavefronts +4. **Latency Hiding**: Using enough threads to hide memory access latency +5. **Memory Hierarchy**: Optimal use of each memory type based on access patterns + +This architectural knowledge is essential for writing efficient GPU code and is covered progressively throughout our modules. + ## ✨ Key Features | Feature | Description | @@ -75,46 +239,49 @@ cd modules/module1/examples ./01_vector_addition_cuda ``` -### 🎯 What You'll Learn - -**πŸ‘Ά Beginner Track** - Start here if you're new to GPU programming -- GPU architecture fundamentals -- Writing your first CUDA/HIP kernels -- Memory management between CPU and GPU -- Basic parallel algorithms -- Debugging and profiling basics - -**πŸ”₯ Intermediate Track** - For developers with some parallel programming experience -- Advanced memory optimization techniques -- Multi-dimensional data processing -- GPU architecture deep dive -- Performance engineering -- Multi-GPU programming - -**πŸš€ Advanced Track** - For experts seeking production-level skills -- Fundamental parallel algorithms (reduction, scan, convolution) -- Advanced algorithmic patterns (sorting, sparse matrices) -- Domain-specific applications (ML, scientific computing) -- Production deployment and optimization -- Next-generation GPU architectures +## 🎯 Learning Path + +Choose your track based on your experience level: + +**πŸ‘Ά Beginner Track** (Modules 1-3) - GPU fundamentals, memory management, first kernels +**πŸ”₯ Intermediate Track** (Modules 4-5) - Advanced programming, performance optimization +**πŸš€ Advanced Track** (Modules 6-9) - Parallel algorithms, domain applications, production deployment + +*Each track builds on the previous one, so start with the appropriate level for your background.* ## πŸ“š Modules -| Module | Level | Duration | Topics | Examples | -|--------|-------|----------|--------|----------| -| [**Module 1**](modules/module1/) | Beginner | 4-6h | GPU Fundamentals, CUDA/HIP Basics | 13 | -| [**Module 2**](modules/module2/) | Beginner-Intermediate | 6-8h | Multi-Dimensional Data Processing | 10 | -| [**Module 3**](modules/module3/) | Intermediate | 6-8h | GPU Architecture & Execution Models | 12 | -| [**Module 4**](modules/module4/) | Intermediate-Advanced | 8-10h | Advanced GPU Programming | 9 | -| [**Module 5**](modules/module5/) | Advanced | 6-8h | Performance Engineering | 5 | -| [**Module 6**](modules/module6/) | Intermediate-Advanced | 8-10h | Fundamental Parallel Algorithms | 10 | -| [**Module 7**](modules/module7/) | Advanced | 8-10h | Advanced Algorithmic Patterns | 4 | -| [**Module 8**](modules/module8/) | Advanced | 10-12h | Domain-Specific Applications | 4 | -| [**Module 9**](modules/module9/) | Expert | 6-8h | Production GPU Programming | 4 | +Our comprehensive curriculum progresses from fundamental concepts to production-ready optimization techniques: + +| Module | Level | Duration | Focus Area | Key Topics | Examples | +|--------|-------|----------|------------|------------|----------| +| [**Module 1**](modules/module1/) | πŸ‘Ά Beginner | 4-6h | **GPU Fundamentals** | Architecture, Memory, First Kernels | 13 | +| [**Module 2**](modules/module2/) | πŸ‘Άβ†’πŸ”₯ | 6-8h | **Memory Optimization** | Coalescing, Shared Memory, Texture | 10 | +| [**Module 3**](modules/module3/) | πŸ”₯ Intermediate | 6-8h | **Execution Models** | Warps, Occupancy, Synchronization | 12 | +| [**Module 4**](modules/module4/) | πŸ”₯β†’πŸš€ | 8-10h | **Advanced Programming** | Streams, Multi-GPU, Unified Memory | 9 | +| [**Module 5**](modules/module5/) | πŸš€ Advanced | 6-8h | **Performance Engineering** | Profiling, Bottleneck Analysis | 5 | +| [**Module 6**](modules/module6/) | πŸš€ Advanced | 8-10h | **Parallel Algorithms** | Reduction, Scan, Convolution | 10 | +| [**Module 7**](modules/module7/) | πŸš€ Expert | 8-10h | **Algorithmic Patterns** | Sorting, Graph Algorithms | 4 | +| [**Module 8**](modules/module8/) | πŸš€ Expert | 10-12h | **Domain Applications** | ML, Scientific Computing | 4 | +| [**Module 9**](modules/module9/) | πŸš€ Expert | 6-8h | **Production Deployment** | Libraries, Integration, Scaling | 4 | **πŸ“ˆ Progressive Learning Path: 70+ Examples β€’ 50+ Hours β€’ Beginner to Expert** -**[οΏ½ View Learning Modules β†’](modules/)** +### Learning Progression + +``` +Module 1: Hello GPU World Module 6: Parallel Algorithms + ↓ ↓ +Module 2: Memory Mastery Module 7: Advanced Patterns + ↓ ↓ +Module 3: Execution Deep Dive Module 8: Real Applications + ↓ ↓ +Module 4: Advanced Features Module 9: Production Ready + ↓ +Module 5: Performance Tuning +``` + +**[πŸ“š View All Modules β†’](modules/)** ## πŸ› οΈ Prerequisites @@ -236,33 +403,7 @@ make profile # Performance profiling make debug # Debug builds with extra checks ``` -## 🚦 Getting Started Guide - -### 1. Choose Your Path -- **🐳 Docker**: No setup required, works everywhere β†’ [Docker Guide](docker/README.md) -- **πŸ’» Native**: Direct system installation β†’ [Installation Guide](#option-2-native-installation) - -### 2. Start Learning -```bash -# Begin with Module 1 -cd modules/module1 -cat README.md # Read learning objectives -cd examples && make # Build examples -./01_vector_addition_cuda # Run your first GPU program! -``` - -### 3. Progress Through Modules -- Each module builds on previous concepts -- Complete examples and exercises in order -- Use profiling tools to understand performance -- Experiment with different optimizations - -### 4. Advanced Topics -- Modules 6-9 cover production-level techniques -- Focus on algorithms and applications relevant to your domain -- Contribute back with improvements and new examples - -## πŸ“Š Performance Expectations +## Performance Expectations | Module Level | Typical GPU Speedup | Memory Efficiency | Code Quality | |--------------|-------------------|------------------|--------------| @@ -306,8 +447,6 @@ make check-hip ./docker/scripts/build.sh --clean --all ``` -**[οΏ½ Need Help? Check Common Issues β†’](README.md#-troubleshooting)** - ## πŸ“– Documentation | Document | Description | diff --git a/docker/README.md b/docker/README.md index 5596db9..91a363c 100644 --- a/docker/README.md +++ b/docker/README.md @@ -37,9 +37,6 @@ This directory contains Docker configurations for comprehensive GPU programming # Or specify platform ./docker/scripts/run.sh cuda # For NVIDIA GPUs ./docker/scripts/run.sh rocm # For AMD GPUs - -# Start with Jupyter Lab -./docker/scripts/run.sh cuda --jupyter ``` ## πŸ“ Directory Structure @@ -106,16 +103,6 @@ docker/ [CUDA-DEV] /workspace/gpu-programming-101/modules/module1/examples $ ./01_vector_addition_cuda ``` -### Jupyter Lab Development -```bash -# Start with Jupyter Lab -./docker/scripts/run.sh cuda --jupyter - -# Access Jupyter at: -# http://localhost:8888 (CUDA) -# http://localhost:8889 (ROCm) -``` - ### Background Services ```bash # Run in detached mode @@ -237,14 +224,6 @@ nsys profile -t cuda ./05_performance_comparison rocprof --hip-trace --stats ./02_vector_addition_hip ``` -## 🌐 Port Mappings - -| Service | Host Port | Container Port | Purpose | -|---------|-----------|----------------|---------| -| CUDA Jupyter | 8888 | 8888 | Jupyter Lab | -| ROCm Jupyter | 8889 | 8888 | Jupyter Lab | -| Development Tools | 8890 | 8888 | Documentation | - ## πŸ’Ύ Volume Mounts ### Project Files @@ -369,7 +348,7 @@ docker stats gpu101-cuda-dev 5. **Practice**: `cd examples && make && ./01_vector_addition_cuda` ### Development Workflow -1. **Background**: `./docker/scripts/run.sh cuda --jupyter` +1. **Background**: `./docker/scripts/run.sh cuda --detach` 2. **Code**: Edit files in your host IDE 3. **Test**: Compile and run in container 4. **Debug**: Use integrated debugging tools diff --git a/docker/docker-compose.yml b/docker/docker-compose.yml index 6afe173..72e1a7d 100644 --- a/docker/docker-compose.yml +++ b/docker/docker-compose.yml @@ -23,12 +23,6 @@ services: - ../:/workspace/gpu-programming-101:rw - cuda-home:/root - cuda-cache:/root/.cache - - cuda-jupyter:/root/.jupyter - - # Port mappings - ports: - - "8888:8888" # Jupyter Lab - - "6006:6006" # TensorBoard (if needed) # Interactive mode stdin_open: true @@ -73,11 +67,6 @@ services: - ../:/workspace/gpu-programming-101:rw - rocm-home:/root - rocm-cache:/root/.cache - - rocm-jupyter:/root/.jupyter - - # Port mappings - ports: - - "8889:8888" # Jupyter Lab (different port to avoid conflict) # Interactive mode stdin_open: true @@ -113,10 +102,6 @@ services: - ../:/workspace/gpu-programming-101:rw - dev-home:/root - # Port mappings - ports: - - "8890:8888" # Jupyter Lab for documentation - # Interactive mode stdin_open: true tty: true @@ -134,14 +119,10 @@ volumes: driver: local cuda-cache: driver: local - cuda-jupyter: - driver: local rocm-home: driver: local rocm-cache: driver: local - rocm-jupyter: - driver: local dev-home: driver: local