Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 20 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

**A comprehensive, hands-on educational project for mastering GPU programming with CUDA and HIP**

*From beginner fundamentals to production-ready optimization techniques*
*From beginner fundamentals to professional-grade optimization techniques*

## πŸ“‘ Table of Contents

Expand Down Expand Up @@ -37,7 +37,7 @@
- **9 comprehensive modules** covering beginner to expert topics
- **71 working code examples** in both CUDA and HIP
- **Cross-platform support** for NVIDIA and AMD GPUs
- **Production-ready development environment** with Docker
- **Comprehensive development environment** with Docker
- **Professional tooling** including profilers, debuggers, and CI/CD

Perfect for students, researchers, and developers looking to master GPU computing.
Expand Down Expand Up @@ -198,7 +198,7 @@ This architectural knowledge is essential for writing efficient GPU code and is
| 🎯 **Complete Curriculum** | 9 progressive modules from basics to advanced topics |
| πŸ’» **Cross-Platform** | Full CUDA and HIP support for NVIDIA and AMD GPUs |
| 🐳 **Docker Ready** | Complete containerized development environment with CUDA 12.9.1 & ROCm 7.0 |
| πŸ”§ **Production Quality** | Professional build systems, auto-detection, testing, and profiling |
| πŸ”§ **Professional Quality** | Professional build systems, auto-detection, testing, and profiling |
| πŸ“Š **Performance Focus** | Optimization techniques and benchmarking throughout |
| 🌐 **Community Driven** | Open source with comprehensive contribution guidelines |
| πŸ§ͺ **Advanced Libraries** | Support for Thrust, MIOpen, and production ML frameworks |
Expand Down Expand Up @@ -252,21 +252,21 @@ Choose your track based on your experience level:

## πŸ“š Modules

Our comprehensive curriculum progresses from fundamental concepts to production-ready optimization techniques:
Our comprehensive curriculum progresses from fundamental concepts to advanced optimization techniques:

| Module | Level | Duration | Focus Area | Key Topics | Examples |
|--------|-------|----------|------------|------------|----------|
| [**Module 1**](modules/module1/) | πŸ‘Ά Beginner | 4-6h | **GPU Fundamentals** | Architecture, Memory, First Kernels | 13 |
| [**Module 2**](modules/module2/) | πŸ‘Άβ†’πŸ”₯ | 6-8h | **Memory Optimization** | Coalescing, Shared Memory, Texture | 10 |
| [**Module 3**](modules/module3/) | πŸ”₯ Intermediate | 6-8h | **Execution Models** | Warps, Occupancy, Synchronization | 12 |
| [**Module 4**](modules/module4/) | πŸ”₯β†’πŸš€ | 8-10h | **Advanced Programming** | Streams, Multi-GPU, Unified Memory | 9 |
| [**Module 5**](modules/module5/) | πŸš€ Advanced | 6-8h | **Performance Engineering** | Profiling, Bottleneck Analysis | 5 |
| [**Module 6**](modules/module6/) | πŸš€ Advanced | 8-10h | **Parallel Algorithms** | Reduction, Scan, Convolution | 10 |
| [**Module 7**](modules/module7/) | πŸš€ Expert | 8-10h | **Algorithmic Patterns** | Sorting, Graph Algorithms | 4 |
| [**Module 8**](modules/module8/) | πŸš€ Expert | 10-12h | **Domain Applications** | ML, Scientific Computing | 4 |
| [**Module 9**](modules/module9/) | πŸš€ Expert | 6-8h | **Production Deployment** | Libraries, Integration, Scaling | 4 |
| Module | Level | Focus Area | Key Topics | Examples |
|--------|-------|------------|------------|----------|
| [**Module 1**](modules/module1/) | πŸ‘Ά Beginner | **GPU Fundamentals** | Architecture, Memory, First Kernels | 13 |
| [**Module 2**](modules/module2/) | πŸ‘Άβ†’πŸ”₯ | **Memory Optimization** | Coalescing, Shared Memory, Texture | 10 |
| [**Module 3**](modules/module3/) | πŸ”₯ Intermediate | **Execution Models** | Warps, Occupancy, Synchronization | 12 |
| [**Module 4**](modules/module4/) | πŸ”₯β†’πŸš€ | **Advanced Programming** | Streams, Multi-GPU, Unified Memory | 9 |
| [**Module 5**](modules/module5/) | πŸš€ Advanced | **Performance Engineering** | Profiling, Bottleneck Analysis | 5 |
| [**Module 6**](modules/module6/) | πŸš€ Advanced | **Parallel Algorithms** | Reduction, Scan, Convolution | 10 |
| [**Module 7**](modules/module7/) | πŸš€ Expert | **Algorithmic Patterns** | Sorting, Graph Algorithms | 4 |
| [**Module 8**](modules/module8/) | πŸš€ Expert | **Domain Applications** | ML, Scientific Computing | 4 |
| [**Module 9**](modules/module9/) | πŸš€ Expert | **Production Deployment** | Libraries, Integration, Scaling | 4 |

**πŸ“ˆ Progressive Learning Path: 71 Examples β€’ 50+ Hours β€’ Beginner to Expert**
**πŸ“ˆ Progressive Learning Path: 71 Examples β€’ Beginner to Expert**

### Learning Progression

Expand Down Expand Up @@ -387,7 +387,7 @@ Experience the full development environment with zero setup:
**Container Specifications:**
- **CUDA**: NVIDIA CUDA 12.9.1 on Ubuntu 22.04
- **ROCm**: AMD ROCm 7.0 on Ubuntu 24.04
- **Libraries**: Production-ready toolchains with debugging support
- **Libraries**: Professional toolchains with debugging support

**[πŸ“– Complete Docker Guide β†’](docker/README.md)**

Expand Down Expand Up @@ -415,7 +415,7 @@ make debug # Debug builds with extra checks

### Advanced Build Features
- **Automatic GPU Detection**: Detects NVIDIA/AMD hardware and builds accordingly
- **Production Optimization**: `-O3`, fast math, architecture-specific optimizations
- **Professional Optimization**: `-O3`, fast math, architecture-specific optimizations
- **Debug Support**: Full debugging symbols and validation checks
- **Library Management**: Automatic detection of optional dependencies (NVML, MIOpen)
- **Cross-Platform**: Single Makefile supports both CUDA and HIP builds
Expand All @@ -426,7 +426,7 @@ make debug # Debug builds with extra checks
|--------------|-------------------|------------------|--------------|
| **Beginner** | 10-100x | 60-80% | Educational |
| **Intermediate** | 50-500x | 80-95% | Optimized |
| **Advanced** | 100-1000x | 85-95% | Production |
| **Advanced** | 100-1000x | 85-95% | Professional |
| **Expert** | 500-5000x | 95%+ | Library-Quality |

## πŸ› Troubleshooting
Expand Down Expand Up @@ -507,7 +507,7 @@ If you use this project in your research, education, or publications, please cit
author={{Stephen Shao}},
year={2025},
howpublished={\url{https://github.com/AIComputing101/gpu-programming-101}},
note={A complete GPU programming educational resource with 70+ production-ready examples covering fundamentals through advanced optimization techniques for NVIDIA CUDA and AMD HIP platforms}
note={A complete GPU programming educational resource with 71 comprehensive examples covering fundamentals through advanced optimization techniques for NVIDIA CUDA and AMD HIP platforms}
}
```

Expand Down
2 changes: 1 addition & 1 deletion modules/module3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -334,4 +334,4 @@ ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum ./02_scan

---

**Note**: This module provides both educational implementations (showing algorithm progression) and production-ready optimized versions. Focus on understanding the concepts before optimizing for specific use cases.
**Note**: This module provides both educational implementations (showing algorithm progression) and optimized versions. Focus on understanding the concepts before optimizing for specific use cases.
4 changes: 2 additions & 2 deletions modules/module5/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,11 +436,11 @@ Module 5 represents the pinnacle of GPU performance optimization, covering:
- **Memory Subsystem Optimization** across all levels of the GPU memory hierarchy
- **Compute Optimization Strategies** for maximum algorithmic efficiency
- **Cross-Platform Performance** considerations for portable high-performance code
- **Production-Ready Optimization** techniques used in industry applications
- **Professional Optimization** techniques used in industry applications

These skills are essential for:
- Achieving maximum performance from GPU investments
- Building production-quality high-performance applications
- Building professional-quality high-performance applications
- Understanding performance trade-offs in GPU algorithm design
- Developing performance-portable code across GPU architectures

Expand Down
2 changes: 1 addition & 1 deletion modules/module6/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -367,4 +367,4 @@ These algorithms form the foundation for more complex applications covered in su
**Difficulty**: Intermediate-Advanced
**Prerequisites**: Modules 1-5 completion, parallel algorithm concepts

**Note**: This module emphasizes both educational understanding and production-ready implementations. Focus on mastering the algorithmic concepts before diving into platform-specific optimizations.
**Note**: This module emphasizes both educational understanding and optimized implementations. Focus on mastering the algorithmic concepts before diving into platform-specific optimizations.
2 changes: 1 addition & 1 deletion modules/module7/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -372,4 +372,4 @@ Master these concepts to tackle the most demanding computational challenges and
**Difficulty**: Advanced
**Prerequisites**: Modules 1-6 completion, advanced algorithm knowledge

**Note**: This module focuses on production-level implementations of sophisticated algorithms. Emphasis is placed on understanding both the theoretical foundations and practical optimization techniques required for real-world deployment.
**Note**: This module focuses on advanced-level implementations of sophisticated algorithms. Emphasis is placed on understanding both the theoretical foundations and practical optimization techniques required for real-world deployment.
8 changes: 4 additions & 4 deletions modules/module8/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ By completing this module, you will:

#### 1. Deep Learning Inference Kernels (`01_deep_learning_*.cu/.cpp`)

Production-quality neural network inference implementations:
Professional-quality neural network inference implementations:

- **Custom Convolution Kernels**: Optimized for specific layer configurations
- **GEMM Optimization**: High-performance matrix multiplication for fully connected layers
Expand Down Expand Up @@ -214,7 +214,7 @@ make monte_carlo # Monte Carlo simulations
make finance # Computational finance
make library_integration # Library integration examples

# Production builds with optimizations
# Professional builds with optimizations
make production

# Debug builds for development
Expand Down Expand Up @@ -396,7 +396,7 @@ make scaling_analysis
Module 8 bridges the gap between GPU programming techniques and real-world applications:

- **Domain Expertise**: Apply GPU techniques to solve actual industry problems
- **Production Quality**: Build applications that meet real-world performance and accuracy requirements
- **Professional Quality**: Build applications that meet real-world performance and accuracy requirements
- **Integration Skills**: Successfully integrate GPU computing into existing workflows and systems
- **Optimization Mastery**: Achieve optimal performance for domain-specific computational patterns

Expand All @@ -414,4 +414,4 @@ Master these domain-specific applications to become a complete GPU computing exp
**Difficulty**: Advanced
**Prerequisites**: Modules 1-7 completion, domain-specific knowledge

**Note**: This module emphasizes real-world application development with production-quality implementations. Students should focus on both technical excellence and practical deployment considerations.
**Note**: This module emphasizes real-world application development with professional-quality implementations. Students should focus on both technical excellence and practical deployment considerations.
2 changes: 1 addition & 1 deletion modules/module8/examples/01_deep_learning_cuda.cu
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
/**
* Module 8: Domain-Specific Applications - Deep Learning Inference Kernels (CUDA)
*
* Production-quality neural network inference implementations optimized for NVIDIA GPU architectures.
* Professional-quality neural network inference implementations optimized for NVIDIA GPU architectures.
* This example demonstrates custom convolution kernels, GEMM optimization, activation functions,
* and mixed precision inference with Tensor Core utilization.
*
Expand Down
2 changes: 1 addition & 1 deletion modules/module8/examples/01_deep_learning_hip.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

const int WAVEFRONT_SIZE = 64;earning Inference Kernels (HIP)
*
* Production-quality neural network inference implementations optimized for AMD GPU architectures.
* Professional-quality neural network inference implementations optimized for AMD GPU architectures.
* This example demonstrates deep learning kernels adapted for ROCm/HIP with wavefront-aware
* optimizations and LDS utilization patterns specific to AMD hardware.
*
Expand Down
6 changes: 3 additions & 3 deletions modules/module8/examples/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ BUILD_HIP = 0
GPU_VENDOR = NONE
endif

# Compiler flags for production-quality applications
# Compiler flags for professional-quality applications
CUDA_FLAGS = -std=c++17 -O3 -arch=sm_70 -lineinfo --use_fast_math
CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_70
HIP_FLAGS = -std=c++17 -O3 -ffast-math
Expand Down Expand Up @@ -186,7 +186,7 @@ debug: CUDA_FLAGS = $(CUDA_DEBUG_FLAGS)
debug: HIP_FLAGS = $(HIP_DEBUG_FLAGS)
debug: all

# Production builds with maximum optimization
# Professional builds with maximum optimization
.PHONY: production
production: CUDA_FLAGS += -DNDEBUG -Xptxas -O3
production: HIP_FLAGS += -DNDEBUG
Expand Down Expand Up @@ -589,7 +589,7 @@ help:
@echo " cuda - Build CUDA applications only"
@echo " hip - Build HIP applications only"
@echo " debug - Build with debug flags"
@echo " production - Build with maximum optimization"
@echo " professional - Build with maximum optimization"
@echo " clean - Remove build artifacts"
@echo ""
@echo "Domain Application Targets:"
Expand Down
4 changes: 2 additions & 2 deletions modules/module9/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Module 9: Production GPU Programming

This module focuses on building enterprise-grade GPU applications with emphasis on deployment, maintenance, scalability, and integration with production systems. Learn how to transition from prototype to production-ready GPU software.
This module focuses on building enterprise-grade GPU applications with emphasis on deployment, maintenance, scalability, and integration with production systems. Learn how to transition from prototype to professional-grade GPU software.

## Learning Objectives

Expand Down Expand Up @@ -357,7 +357,7 @@ make cost_analysis
- [ ] Monitoring and observability built into the application

### Infrastructure
- [ ] Production-grade Kubernetes cluster with GPU support
- [ ] Enterprise-grade Kubernetes cluster with GPU support
- [ ] Monitoring and alerting infrastructure deployed
- [ ] Backup and disaster recovery procedures implemented
- [ ] Security scanning and vulnerability management in place
Expand Down
6 changes: 3 additions & 3 deletions modules/module9/content.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Production GPU Programming: Enterprise-Grade Implementation Guide
# Professional GPU Programming: Enterprise-Grade Implementation Guide

> Environment note: Production examples and deployment references assume development using Docker images with CUDA 12.9.1 (Ubuntu 22.04) and ROCm 7.0 (Ubuntu 24.04) for parity between environments. Enhanced build system supports production-grade optimizations.
> Environment note: Professional examples and deployment references assume development using Docker images with CUDA 12.9.1 (Ubuntu 22.04) and ROCm 7.0 (Ubuntu 24.04) for parity between environments. Enhanced build system supports professional-grade optimizations.

This comprehensive guide covers all aspects of deploying, maintaining, and scaling GPU applications in production environments, from architecture design to operational excellence.

Expand Down Expand Up @@ -1469,6 +1469,6 @@ This comprehensive guide covers all essential aspects of production GPU programm
4. **Monitoring**: Comprehensive observability for GPU workloads
5. **Scalability**: Auto-scaling and load balancing strategies
6. **Security**: Enterprise-grade security and compliance
7. **Best Practices**: Production-ready configuration and health monitoring
7. **Best Practices**: Professional configuration and health monitoring

These concepts enable the development of enterprise-grade GPU applications that meet the demanding requirements of production environments while maintaining high performance, reliability, and security standards.
8 changes: 4 additions & 4 deletions modules/module9/examples/01_architecture_cuda.cu
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
/**
* Module 9: Production GPU Programming - Production Architecture Patterns (CUDA)
*
* Enterprise-grade GPU application architecture demonstrating production-ready patterns
* Enterprise-grade GPU application architecture demonstrating professional patterns
* including microservices design, error handling, monitoring integration, and scalable
* deployment strategies. This example showcases real-world production requirements.
*
* Topics Covered:
* - Production-grade error handling and recovery mechanisms
* - Professional-grade error handling and recovery mechanisms
* - Comprehensive logging and monitoring integration
* - Resource management and memory pools
* - Health checks and service discovery integration
Expand Down Expand Up @@ -37,7 +37,7 @@
#include <random>
#include <functional>

// Production-grade error handling macros
// Professional-grade error handling macros
#define CUDA_CHECK_PROD(call, context) \
do { \
cudaError_t error = call; \
Expand Down Expand Up @@ -841,7 +841,7 @@ int main(int argc, char* argv[]) {
// Demo mode - show capabilities
std::cout << "Production GPU Architecture Features:\n";
std::cout << "β€’ Comprehensive error handling and recovery\n";
std::cout << "β€’ Production-grade logging and monitoring\n";
std::cout << "β€’ Professional-grade logging and monitoring\n";
std::cout << "β€’ Resource management and memory pools\n";
std::cout << "β€’ Health checks and service discovery\n";
std::cout << "β€’ Configuration management\n";
Expand Down
6 changes: 3 additions & 3 deletions modules/module9/examples/01_architecture_hip.cpp
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
/**
* Module 9: Production GPU Programming - Production Architecture Patterns (HIP)
*
* Enterprise-grade GPU application architecture demonstrating production-ready patterns
* Enterprise-grade GPU application architecture demonstrating professional patterns
* adapted for AMD GPU architectures using ROCm/HIP. This example showcases real-world
* production requirements optimized for AMD hardware and ROCm ecosystem.
*
Expand Down Expand Up @@ -34,7 +34,7 @@
#include <random>
#include <functional>

// Production-grade error handling macros for HIP
// Professional-grade error handling macros for HIP
#define HIP_CHECK_PROD(call, context) \
do { \
hipError_t error = call; \
Expand Down Expand Up @@ -828,7 +828,7 @@ int main(int argc, char* argv[]) {
std::cout << "β€’ Wavefront-aware resource management (64-thread wavefronts)\n";
std::cout << "β€’ NUMA-aware memory allocation for multi-GPU systems\n";
std::cout << "β€’ AMD GPU specific error handling and recovery\n";
std::cout << "β€’ Production-grade logging optimized for ROCm ecosystem\n";
std::cout << "β€’ Professional-grade logging optimized for ROCm ecosystem\n";
std::cout << "β€’ Multi-tenant resource isolation for AMD GPUs\n";
std::cout << "β€’ Real-time health monitoring with AMD-specific thresholds\n";

Expand Down
4 changes: 2 additions & 2 deletions modules/module9/examples/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ BUILD_HIP = 0
GPU_VENDOR = NONE
endif

# Compiler flags for production-ready applications
# Compiler flags for professional applications
CUDA_FLAGS = -std=c++17 -O3 -arch=sm_70 -lineinfo --use_fast_math -DPRODUCTION_BUILD
CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_70 -DDEBUG_BUILD
HIP_FLAGS = -std=c++17 -O3 -ffast-math -DPRODUCTION_BUILD
Expand Down Expand Up @@ -659,7 +659,7 @@ help:
@echo ""
@echo "Build Targets:"
@echo " all - Build all production applications"
@echo " production - Build with production optimization and hardening"
@echo " professional - Build with professional optimization and hardening"
@echo " debug - Build with debug information"
@echo " clean - Remove all build artifacts"
@echo ""
Expand Down