From 675cb301d4b284b4dc1b64b1967c20f671aa811b Mon Sep 17 00:00:00 2001 From: Stephen Shao Date: Mon, 22 Sep 2025 11:15:09 -0400 Subject: [PATCH 1/9] Updated docker images --- CONTRIBUTING.md | 16 ++- CUDA_ROCM_FEATURES.md | 195 ++++++++++++++++++++++++++++++ README.md | 8 +- docker/README.md | 26 ++-- docker/cuda/Dockerfile | 16 ++- docker/docker-compose.yml | 2 +- docker/scripts/build.sh | 4 +- modules/module1/README.md | 2 +- modules/module1/content.md | 2 +- modules/module2/README.md | 2 +- modules/module2/content.md | 2 +- modules/module3/content.md | 2 +- modules/module4/content.md | 2 +- modules/module5/content.md | 2 +- modules/module6/content.md | 2 +- modules/module7/content.md | 2 +- modules/module8/content.md | 2 +- modules/module9/content.md | 6 +- modules/module9/examples/Makefile | 2 +- 19 files changed, 248 insertions(+), 47 deletions(-) create mode 100644 CUDA_ROCM_FEATURES.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a03c978..c739738 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -41,7 +41,7 @@ docker-compose up -d cuda-dev # For NVIDIA GPUs docker-compose up -d rocm-dev # For AMD GPUs # Option 2: Native development -# Install CUDA Toolkit 12.9.1+ or ROCm latest +# Install CUDA Toolkit 13.0.1+ or ROCm latest # See modules/module1/README.md for detailed setup instructions # Build all examples @@ -242,7 +242,7 @@ When reporting bugs, please include: - **Operating System**: (Ubuntu 22.04, Windows 11, etc.) - **GPU**: (RTX 4090, RX 7900 XTX, etc.) - **Driver Version**: (NVIDIA 535.x, ROCm latest, etc.) -- **CUDA/HIP Version**: (12.9.1, 7.0, etc.) +- **CUDA/HIP Version**: (13.0.1, 7.0.1, etc.) - **Docker**: (if using containerized development) ### Bug Description @@ -258,7 +258,7 @@ When reporting bugs, please include: **Environment:** - OS: Ubuntu 22.04 - GPU: RTX 4080 -- CUDA: 12.9.1 +- CUDA: 13.0.1 - Driver: 535.98 **Description:** @@ -313,4 +313,12 @@ By contributing, you agree that your contributions will be licensed under the MI --- -Thank you for contributing to GPU Programming 101! Your efforts help make GPU computing more accessible to developers worldwide. šŸš€ \ No newline at end of file +Thank you for contributing to GPU Programming 101! Your efforts help make GPU computing more accessible to developers worldwide. šŸš€ + +## 🧩 Maintaining feature docs + +If you update examples or module content to use new CUDA or ROCm capabilities, please also: + +- Bump the versions in `CUDA_ROCM_FEATURES.md` and re‑scan the official release notes. +- Update module READMEs to mention any new minimum driver/toolkit requirements. +- Avoid marketing claims; prefer links to vendor docs and measured results in our own benchmarks. \ No newline at end of file diff --git a/CUDA_ROCM_FEATURES.md b/CUDA_ROCM_FEATURES.md new file mode 100644 index 0000000..6812740 --- /dev/null +++ b/CUDA_ROCM_FEATURES.md @@ -0,0 +1,195 @@ +# CUDA and ROCm Feature Guide (Living Document) + +Last updated: 2025-09-22 + +This guide summarizes current, officially documented features of NVIDIA CUDA and AMD ROCm that we leverage across this project. It is designed to be easy to maintain as new versions ship. Where possible, we link to authoritative sources instead of restating volatile details. + +Tip: Prefer the linked release notes and programming guides for exact, version-specific behavior. Update checklist is at the end of this document. + +--- + +## Current Versions at a Glance + +- CUDA: 13.0 Update 1 (13.0.U1) + - Source of truth: NVIDIA CUDA Toolkit Release Notes + - Driver requirement overview: CUDA Compatibility Guide for Drivers +- ROCm: 7.0.1 + - Source of truth: ROCm Release History and ROCm docs index + +Reference links are provided at the bottom for maintenance. + +--- + +## CUDA 13.x overview + +Highlights pulled from NVIDIA’s official docs (see links): + +- General platform + - CUDA 13.x is ABI-stable within the major series; requires r580+ driver on Linux. + - Increased MPS server client limits on Ampere and newer architectures (subject to architectural limits). +- Compiler and runtime + - NVCC/NVRTC updates; PTX ISA updates (see PTX 9.0 notes in release docs). + - Programmatic Dependent Launch (PDL) support in select library kernels on sm_90+. +- Developer tools + - Nsight Systems and Nsight Compute continue as the primary profilers. + - Compute Sanitizer updates; Visual Profiler and nvprof are removed in 13.0. +- Deprecations and removals + - Dropped offline compilation/library support for pre-Turing architectures (Maxwell, Pascal, Volta) in CUDA 13.0. Continue to use 12.x to target these. + - Windows Toolkit no longer bundles a display driver (install separately). + - Removed multi-device cooperative group launch APIs; several legacy headers removed. + +Architectures and typical use cases (non-exhaustive): + +- Blackwell/Blackwell Ultra (SM110+): next‑gen AI/HPC; FP4/FP8 workflows via libraries. +- Hopper (H100/H200, SM90): transformer engine, thread block clusters, DPX; AI training/HPC. +- Ada (RTX 40): workstation/development; AV1 encode; content creation/AI dev. +- Ampere (A100/RTX 30): MIG, 3rd‑gen tensor cores; research/mixed workloads. + +Core libraries snapshot (examples; see library release notes for specifics): + +- cuBLAS/cuBLASLt: autotuning options; improvements on newer architectures; mixed precision and block‑scaled formats. +- cuFFT: new error codes; performance changes; dropped pre‑Turing support. +- cuSPARSE: generic API enhancements; 64‑bit indices in SpGEMM; various bug fixes. +- Math/NPP/nvJPEG: targeted perf/accuracy improvements and API cleanups. + +Authoritative references: + +- CUDA Toolkit Release Notes (13.0 U1) +- CUDA Compatibility Guide for Drivers +- Nsight Systems Release Notes; Nsight Compute Release Notes +- CUDA C++ Programming Guide changelog + +--- + +## ROCm 7.0.x overview + +Highlights from AMD’s official docs (see links): + +- ROCm 7.0.1 is the latest as of 2025‑09‑17; consult the release history for point updates. +- HIP as the primary programming model, with CUDA‑like APIs and HIP‑Clang toolchain. +- Windows support targets HIP SDK for development; full ROCm stack targets Linux. +- Libraries are provided under the ROCm organization (rocBLAS/hipBLAS, rocFFT/hipFFT, rocSPARSE/hipSPARSE, rocRAND/hipRAND, rocSOLVER/hipSOLVER, rocPRIM/hipCUB, rocThrust, etc.). +- Tooling and system components: ROCr runtime, ROCm SMI, rocprof/rocprofiler, rocgdb/rocm‑debug‑agent. + +Architectures (illustrative, not exhaustive): + +- CDNA3 (MI300 family): AI training and HPC; unified memory on APUs (MI300A), large HBM configs (MI300X). +- RDNA3 (Radeon 7000 series): workstation/gaming; AV1 encode/decode; hardware ray tracing. + +Common libraries (see ROCm Libraries reference): + +- rocBLAS / hipBLAS; rocFFT / hipFFT; rocRAND / hipRAND; rocSPARSE / hipSPARSE; rocSOLVER / hipSOLVER; rocPRIM/hipCUB; rocThrust. +- ML/DL: MIOpen; framework integrations via the ROCm for AI guide. + +Authoritative references: + +- ROCm Docs index (What is ROCm?, install, reference) +- ROCm Release History (7.0.1, 7.0.0, …) +- ROCm libraries reference; tools/compilers/runtimes reference + +--- + +## Cross‑platform mapping (CUDA ⇄ HIP) + +Quick mapping for common concepts. Always check specific APIs for support and behavior differences. + +- Kernel launch + - CUDA: <<>>; HIP: hipLaunchKernelGGL +- Memory management + - CUDA: cudaMalloc/cudaMemcpy/etc.; HIP: hipMalloc/hipMemcpy/etc. +- Streams and events + - CUDA: cudaStream_t/cudaEvent_t; HIP: hipStream_t/hipEvent_t +- Graphs + - CUDA: cudaGraph_t and Graph Exec; HIP: hipGraph_t and equivalents; feature coverage evolves, verify against ROCm docs. +- Cooperative groups + - CUDA: cooperative_groups; HIP: HIP cooperative groups header; multi‑device variants differ (and some CUDA multi‑device APIs removed in 13.0). +- Libraries + - cuBLAS ↔ hipBLAS/rocBLAS; cuFFT ↔ hipFFT/rocFFT; cuSPARSE ↔ hipSPARSE/rocSPARSE; Thrust/CUB ↔ rocThrust/hipCUB/rocPRIM. + +Porting aids: + +- hipify (perl/python) for source translation; hip‑clang for compilation. + +--- + +## Compatibility and supported platforms + +- CUDA drivers and OS + - See the CUDA Compatibility Guide for minimum driver versions by toolkit series (e.g., 13.x requires r580+ on Linux). Windows driver no longer bundled starting with 13.0. +- CUDA architectures + - 13.0 drops offline compilation/library support for Maxwell/Pascal/Volta; continue to use 12.x for those targets. +- ROCm OS/GPU support + - See ROCm install guides and GPU/accelerator support references for Linux and Windows HIP SDK system requirements. + +--- + +## Educational integration (this repository) + +This course demonstrates both CUDA and HIP across modules. Key tool updates to note: + +- Profiling and analysis + - NVIDIA: Nsight Systems, Nsight Compute, CUPTI changes in 13.x, Compute Sanitizer + - AMD: rocprof/rocprofiler, ROCm SMI +- Memory and graphs + - CUDA: CUDA Graphs; memory pools and VMM; asynchronous copy + - ROCm: HIP graph APIs (coverage evolves); ROCr runtime memory features + +Example module alignment (indicative; see each module’s README for details): + +- Module 1: Runtime APIs, device queries, build/tooling +- Module 2: Memory management (device, pinned, unified/coherent where available) +- Module 3: Synchronization and cooperation (warp/wavefront‑level, cooperative groups) +- Module 4: Streams, events, graphs, and multi‑GPU basics +- Module 5: Profiling and debugging (Nsight Tools, Compute Sanitizer, rocprof, rocm‑smi) +- Module 6+: Libraries (BLAS/FFT/SPARSE) and domain examples (AI/HPC) + +### New features by module (CUDA 13.x and ROCm 7.0.x) + +| Module | CUDA (what you’ll learn) | ROCm/HIP (what you’ll learn) | +|---|---|---| +| Module 1: Getting Started | Toolchain (nvcc), project layout, kernel launch basics (grid/block/thread indexing), device vs host code, cudaMalloc/cudaMemcpy, device query and error handling | Toolchain (hipcc/hip-clang), hipLaunchKernelGGL, hipMalloc/hipMemcpy, hipGetDeviceProperties, mapping CUDA concepts to HIP | +| Module 2: Memory & Data Movement | Global/shared/constant/texture memory usage, coalesced access, pinned memory, unified memory and prefetch, async copies and measuring bandwidth | HIP memory APIs and ROCr memory model, pinned host buffers, unified/coherent memory notes, async transfers, using rocm-smi/rocprof to observe bandwidth | +| Module 3: Parallel Patterns & Sync | Reductions, scans, sorting; warp-level primitives; cooperative groups; shared memory tiling; atomics and barriers; occupancy considerations | rocPRIM/hipCUB/rocThrust equivalents; wavefront-level ops; HIP cooperative groups; LDS usage; atomics and synchronization semantics | +| Module 4: Concurrency, Streams & Multi‑GPU | Streams/events, priorities, CUDA Graphs (capture/instantiate/launch), peer-to-peer (UVA/P2P), basic multi‑GPU patterns | hipStream/hipEvent, HIP Graph API coverage and usage, peer access where supported, multi‑GPU fundamentals with ROCm tools | +| Module 5: Profiling, Debugging & Sanitizers | Nsight Systems (timeline/tracing), Nsight Compute (kernel analysis), Compute Sanitizer (racecheck/memcheck), intro to CUPTI-based profiling | rocprof/rocprofiler for traces and metrics, rocm-smi telemetry, rocgdb/ROCm Debug Agent basics, best practices for profiling | +| Module 6: Math & Core Libraries | cuBLAS/cuBLASLt (GEMM, batched ops, mixed precision), cuFFT, cuSPARSE, Thrust/CUB algorithms, choosing/tuning library routines | rocBLAS/hipBLAS, rocFFT/hipFFT, rocSPARSE/hipSPARSE, rocThrust/hipCUB/rocPRIM; Tensile-backed tuning in rocBLAS; API parity tips | +| Module 7: Advanced Algorithms & Optimization | Tiling and cache use, shared memory bank conflicts, cooperative groups for complex patterns, intro to memory pools/VMM, kernel fusion patterns | Wavefront-aware tuning, LDS patterns, rocPRIM building blocks, HIP-specific perf tips, memory behavior across devices | +| Module 8: AI/ML Workflows | cuDNN basics, TensorRT concepts (dynamic shapes/precision), mixed precision (FP16/BF16/FP8 via libs), graphs for inference pipelines | MIOpen basics, framework setup on ROCm (PyTorch/TF where supported), MIGraphX or framework runtimes, mixed precision support | +| Module 9: Packaging, Deployment & Containers | CUDA containers (base/runtime-devel), driver/runtime compatibility, minimal deployment artifacts, reproducible builds | ROCm container bases (rocm/dev), runtime setup (kernel modules, groups/permissions), compatibility guidance and reproducibility | + +--- + +## Maintenance: how to update this document + +When CUDA or ROCm releases a new version, follow this checklist: + +1) Update versions at the top + - CUDA: consult CUDA Toolkit Release Notes page; record the latest major.minor (e.g., 13.0 Update 1) and driver requirements. + - ROCm: consult ROCm Release History; record latest (e.g., 7.0.1). +2) Scan notable changes + - CUDA: skim ā€œNew Featuresā€, ā€œDeprecated or Dropped Featuresā€, and library sections (cuBLAS/cuFFT/…); note any course‑impacting changes. + - ROCm: skim ā€œWhat is ROCm?ā€, ā€œROCm librariesā€, and ā€œTools/Compilers/Runtimesā€ sections for new features or renamed packages. +3) Verify cross‑platform notes + - Confirm HIP Graph API coverage and any caveats; update mapping if needed. +4) Update references + - Keep the link reference list (below) current; avoid copying long tables—link out to authoritative docs. +5) Record the date in ā€œLast updatedā€. + +Tip: Avoid claiming specific percentage speedups unless you include a citation. Prefer phrasing like ā€œperformance improvements in X; see release notes.ā€ + +--- + +## Reference links (authoritative sources) + +- NVIDIA + - CUDA Toolkit Release Notes: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html + - CUDA Compatibility Guide (drivers): https://docs.nvidia.com/deploy/cuda-compatibility/index.html + - CUDA C++ Programming Guide (changelog): https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#changelog + - Nsight Systems Release Notes: https://docs.nvidia.com/nsight-systems/ReleaseNotes/index.html + - Nsight Compute Release Notes: https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html +- AMD + - ROCm docs index: https://rocm.docs.amd.com/en/latest/index.html + - ROCm release history: https://rocm.docs.amd.com/en/latest/release/versions.html + - ROCm libraries reference: https://rocm.docs.amd.com/en/latest/reference/api-libraries.html + - ROCm tools/compilers/runtimes: https://rocm.docs.amd.com/en/latest/reference/rocm-tools.html + - HIP documentation: https://rocm.docs.amd.com/projects/HIP/en/latest/index.html \ No newline at end of file diff --git a/README.md b/README.md index a833a65..884fd26 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # GPU Programming 101 šŸš€ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -[![CUDA](https://img.shields.io/badge/CUDA-12.9.1-76B900?logo=nvidia)](https://developer.nvidia.com/cuda-toolkit) +[![CUDA](https://img.shields.io/badge/CUDA-13.0.1-76B900?logo=nvidia)](https://developer.nvidia.com/cuda-toolkit) [![ROCm](https://img.shields.io/badge/ROCm-7.0-red?logo=amd)](https://rocmdocs.amd.com/) [![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?logo=docker)](https://www.docker.com/) [![Examples](https://img.shields.io/badge/Examples-71-green)](modules/) @@ -197,7 +197,7 @@ This architectural knowledge is essential for writing efficient GPU code and is |---------|-------------| | šŸŽÆ **Complete Curriculum** | 9 progressive modules from basics to advanced topics | | šŸ’» **Cross-Platform** | Full CUDA and HIP support for NVIDIA and AMD GPUs | -| 🐳 **Docker Ready** | Complete containerized development environment with CUDA 12.9.1 & ROCm 7.0 | +| 🐳 **Docker Ready** | Complete containerized development environment with CUDA 13.0.1 & ROCm 7.0.1 | | šŸ”§ **Professional Quality** | Professional build systems, auto-detection, testing, and profiling | | šŸ“Š **Performance Focus** | Optimization techniques and benchmarking throughout | | 🌐 **Community Driven** | Open source with comprehensive contribution guidelines | @@ -319,7 +319,7 @@ Module 5: Performance Tuning - **macOS**: macOS 12+ (Metal Performance Shaders for basic GPU compute) #### GPU Computing Platforms -- **CUDA Toolkit**: 12.0+ (Docker uses CUDA 12.9.1) +- **CUDA Toolkit**: 13.0+ recommended (Docker uses CUDA 13.0.1) - **Driver Requirements**: - Linux: 550.54.14+ for CUDA 12.4+ - Windows: 551.61+ for CUDA 12.4+ @@ -385,7 +385,7 @@ Experience the full development environment with zero setup: - 🧹 Easy cleanup when done **Container Specifications:** -- **CUDA**: NVIDIA CUDA 12.9.1 on Ubuntu 22.04 +- **CUDA**: NVIDIA CUDA 13.0.1 on Ubuntu 24.04 - **ROCm**: AMD ROCm 7.0 on Ubuntu 24.04 - **Libraries**: Professional toolchains with debugging support diff --git a/docker/README.md b/docker/README.md index 2c5497e..8109529 100644 --- a/docker/README.md +++ b/docker/README.md @@ -4,10 +4,10 @@ This directory contains Docker configurations for comprehensive GPU programming ## šŸš€ Latest Versions (2025) -- **CUDA**: 12.9.1 (Latest stable release) -- **ROCm**: 7.0 (Latest stable release) -- **Ubuntu**: 22.04 LTS -- **Nsight Tools**: 2025.1.1 +- **CUDA**: 13.0.1 (Toolkit 13.0 U1) +- **ROCm**: 7.0.1 (Latest stable release) +- **Ubuntu**: 24.04 LTS +- **Nsight Tools**: 2025.3.x ## šŸš€ Quick Start @@ -58,27 +58,27 @@ docker/ ### CUDA Development Container **Image**: `gpu-programming-101:cuda` -**Base**: `nvidia/cuda:12.9.1-devel-ubuntu22.04` +**Base**: `nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04` **Features**: -- CUDA 12.9.1 with development tools +- CUDA 13.0.1 (cuDNN devel) with development tools - NVIDIA Nsight Systems & Compute profilers - Python 3 with scientific libraries - GPU monitoring and debugging tools **GPU Requirements**: -- NVIDIA GPU with compute capability 3.5+ -- NVIDIA drivers 535+ +- NVIDIA GPU supported by CUDA 13.x (Turing and newer recommended for new toolchain features) +- NVIDIA drivers r580+ - nvidia-container-toolkit ### ROCm Development Container **Image**: `gpu-programming-101:rocm` -**Base**: `rocm/dev-ubuntu-22.04:7.0-complete` +**Base**: `rocm/dev-ubuntu-24.04:7.0.1-complete` **Features**: - ROCm 7.0 with HIP development environment - Cross-platform GPU programming (AMD/NVIDIA) -- ROCm profiling tools (rocprof, roctracer) +- ROCm profiling tools (rocprof, rocprofiler) - Python 3 with scientific libraries **GPU Requirements**: @@ -282,7 +282,7 @@ nvidia-smi # For NVIDIA rocm-smi # For AMD # Verify Docker GPU support -docker run --rm --gpus all nvidia/cuda:12.9.1-base nvidia-smi +docker run --rm --gpus all nvidia/cuda:13.0.1-base-ubuntu24.04 nvidia-smi # Check container runtime docker run --rm --device=/dev/kfd rocm/dev-ubuntu-22.04:7.0 rocminfo @@ -297,8 +297,8 @@ docker system prune -a sudo apt update && sudo apt upgrade docker-ce docker-compose # Check base image availability -docker pull nvidia/cuda:12.9.1-devel-ubuntu22.04 -docker pull rocm/dev-ubuntu-22.04:7.0-complete +docker pull nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04 +docker pull rocm/dev-ubuntu-24.04:7.0.1-complete ``` **"Permission denied errors"** diff --git a/docker/cuda/Dockerfile b/docker/cuda/Dockerfile index e320889..d9202d7 100644 --- a/docker/cuda/Dockerfile +++ b/docker/cuda/Dockerfile @@ -1,14 +1,14 @@ # GPU Programming 101 - CUDA Development Container -# Based on NVIDIA's official CUDA 12.9.1 development image (latest stable as of 2025) +# Based on NVIDIA's official CUDA 13.0.1 cuDNN development image (Ubuntu 24.04) -FROM nvidia/cuda:12.9.1-devel-ubuntu22.04 +FROM nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04 # Metadata LABEL maintainer="GPU Programming 101" LABEL description="CUDA development environment for GPU programming course" LABEL version="2.0" -LABEL cuda.version="12.9.1" -LABEL ubuntu.version="22.04" +LABEL cuda.version="13.0.1" +LABEL ubuntu.version="24.04" # Avoid interactive prompts during package installation ARG DEBIAN_FRONTEND=noninteractive @@ -32,8 +32,6 @@ RUN apt-get update && apt-get install -y \ # Additional utilities pkg-config \ software-properties-common \ - # GPU monitoring tools (installed but won't work during build) - nvidia-utils-535 \ # Debugging and profiling tools gdb \ valgrind \ @@ -45,7 +43,7 @@ RUN apt-get update && apt-get install -y \ # Install optional CUDA tools if available RUN apt-get update && \ - (apt-get install -y cuda-tools-12-9 || apt-get install -y cuda-tools || true) && \ + (apt-get install -y cuda-tools-13-0 || apt-get install -y cuda-tools || true) && \ rm -rf /var/lib/apt/lists/* # Install minimal Python packages for basic development (no heavy data science libs) @@ -58,7 +56,7 @@ ENV PATH=/usr/local/cuda/bin:${PATH} ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH} ENV CUDA_HOME=/usr/local/cuda ENV CUDA_ROOT=/usr/local/cuda -ENV CUDA_VERSION=12.9.1 +ENV CUDA_VERSION=13.0.1 ENV NVIDIA_VISIBLE_DEVICES=all ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility @@ -156,7 +154,7 @@ RUN chmod +x /workspace/test-gpu.sh RUN cd /workspace && \ git clone https://github.com/NVIDIA/cuda-samples.git && \ cd cuda-samples && \ - git checkout v12.9 + git checkout v13.0 # Default command CMD ["/bin/bash"] diff --git a/docker/docker-compose.yml b/docker/docker-compose.yml index 87df707..388fdd6 100644 --- a/docker/docker-compose.yml +++ b/docker/docker-compose.yml @@ -1,6 +1,6 @@ # GPU Programming 101 - Docker Compose Configuration # Supports both NVIDIA CUDA and AMD ROCm platforms -# Updated for CUDA 12.9.1 and ROCm 7.0 (2025) +# Updated for CUDA 13.0.1 and ROCm 7.0.x (2025) services: # NVIDIA CUDA Development Environment diff --git a/docker/scripts/build.sh b/docker/scripts/build.sh index 2aee5e0..c62f347 100755 --- a/docker/scripts/build.sh +++ b/docker/scripts/build.sh @@ -211,8 +211,8 @@ main() { if [ "$pull" = true ]; then log "Pulling base images..." - docker pull nvidia/cuda:12.4-devel-ubuntu22.04 || warning "Failed to pull CUDA base image" - docker pull rocm/dev-ubuntu-24.04:latest || warning "Failed to pull ROCm base image" + docker pull nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04 || warning "Failed to pull CUDA base image" + docker pull rocm/dev-ubuntu-24.04:7.0.1-complete || warning "Failed to pull ROCm base image" fi local success_count=0 diff --git a/modules/module1/README.md b/modules/module1/README.md index 5a5eddd..82bd86d 100644 --- a/modules/module1/README.md +++ b/modules/module1/README.md @@ -20,7 +20,7 @@ After completing this module, you will be able to: ### Prerequisites - NVIDIA GPU with CUDA support OR AMD GPU with ROCm support -- CUDA Toolkit 12.0+ or ROCm 6.0+ (Docker images provide CUDA 12.9.1 and ROCm latest) +- CUDA Toolkit 13.0+ or ROCm 7.0+ (Docker images provide CUDA 13.0.1 and ROCm 7.0.1) - C/C++ compiler (GCC, Clang, or MSVC) Tip: You can skip native installs by using our Docker environment (recommended): diff --git a/modules/module1/content.md b/modules/module1/content.md index 61ea826..804c04b 100644 --- a/modules/module1/content.md +++ b/modules/module1/content.md @@ -1,7 +1,7 @@ # Module 1: Foundations of GPU Programming with CUDA and HIP *Heterogeneous Data Parallel Computing* -> Environment note: Examples are validated in containers using CUDA 12.9.1 (Ubuntu 22.04) and ROCm 7.0 (Ubuntu 24.04). The advanced build system automatically detects your GPU vendor and optimizes accordingly. Using Docker is recommended for a consistent setup. +> Environment note: Examples are validated in containers using CUDA 13.0.1 (Ubuntu 24.04) and ROCm 7.0.1 (Ubuntu 24.04). The advanced build system automatically detects your GPU vendor and optimizes accordingly. Using Docker is recommended for a consistent setup. ## Learning Objectives After completing this module, you will be able to: diff --git a/modules/module2/README.md b/modules/module2/README.md index 0456d99..5e66906 100644 --- a/modules/module2/README.md +++ b/modules/module2/README.md @@ -19,7 +19,7 @@ After completing this module, you will be able to: ### Prerequisites - NVIDIA GPU with CUDA support OR AMD GPU with ROCm support -- CUDA Toolkit 12.0+ or ROCm 6.0+ (Docker images provide CUDA 12.9.1 and ROCm latest) +- CUDA Toolkit 13.0+ or ROCm 7.0+ (Docker images provide CUDA 13.0.1 and ROCm 7.0.1) - C/C++ compiler (GCC, Clang, or MSVC) Recommended: use our Docker dev environment diff --git a/modules/module2/content.md b/modules/module2/content.md index 0d4cfdc..ce86349 100644 --- a/modules/module2/content.md +++ b/modules/module2/content.md @@ -1,7 +1,7 @@ # Module 2: Advanced GPU Memory Management and Optimization *Mastering GPU Memory Hierarchies and Performance Optimization* -> Environment note: Examples are tested in Docker containers with CUDA 12.9.1 (Ubuntu 22.04) and ROCm 7.0 (Ubuntu 24.04). The improved build system automatically optimizes memory access patterns. Prefer Docker for reproducible builds. +> Environment note: Examples are tested in Docker containers with CUDA 13.0.1 (Ubuntu 24.04) and ROCm 7.0.1 (Ubuntu 24.04). The improved build system automatically optimizes memory access patterns. Prefer Docker for reproducible builds. ## Learning Objectives After completing this module, you will be able to: diff --git a/modules/module3/content.md b/modules/module3/content.md index 6c6e321..c0b9f68 100644 --- a/modules/module3/content.md +++ b/modules/module3/content.md @@ -1,7 +1,7 @@ # Module 3: Advanced GPU Algorithms and Parallel Patterns *Mastering High-Performance Parallel Computing Algorithms* -> Environment note: Use the provided Docker images (CUDA 12.9.1 on Ubuntu 22.04, ROCm 7.0 on Ubuntu 24.04) with automatic GPU detection for consistent toolchains across platforms. +> Environment note: Use the provided Docker images (CUDA 13.0.1 on Ubuntu 24.04, ROCm 7.0.1 on Ubuntu 24.04) with automatic GPU detection for consistent toolchains across platforms. ## Learning Objectives After completing this module, you will be able to: diff --git a/modules/module4/content.md b/modules/module4/content.md index 33628a0..07e2758 100644 --- a/modules/module4/content.md +++ b/modules/module4/content.md @@ -1,6 +1,6 @@ # Module 4: Advanced GPU Programming - Multi-GPU, Streams, and Scalability -> Environment note: Examples are validated with CUDA 12.9.1 (Ubuntu 22.04) and ROCm 7.0 (Ubuntu 24.04) in Docker containers. Multi-GPU sections may require appropriate hardware and drivers. Auto-detection build system optimizes for your platform. +> Environment note: Examples are validated with CUDA 13.0.1 (Ubuntu 24.04) and ROCm 7.0.1 (Ubuntu 24.04) in Docker containers. Multi-GPU sections may require appropriate hardware and drivers. Auto-detection build system optimizes for your platform. ## Overview diff --git a/modules/module5/content.md b/modules/module5/content.md index 78a1de1..802c1d5 100644 --- a/modules/module5/content.md +++ b/modules/module5/content.md @@ -1,6 +1,6 @@ # Module 5: Performance Considerations and GPU Optimization -> Environment note: Examples and profiling workflows are validated using Docker images with CUDA 12.9.1 (Ubuntu 22.04) and ROCm 7.0 (Ubuntu 24.04) for consistent toolchains. Enhanced build system includes profiling integrations. +> Environment note: Examples and profiling workflows are validated using Docker images with CUDA 13.0.1 (Ubuntu 24.04) and ROCm 7.0.1 (Ubuntu 24.04) for consistent toolchains. Enhanced build system includes profiling integrations. ## Table of Contents 1. [Introduction to GPU Performance Optimization](#introduction) diff --git a/modules/module6/content.md b/modules/module6/content.md index 61d2c2c..1c61608 100644 --- a/modules/module6/content.md +++ b/modules/module6/content.md @@ -1,6 +1,6 @@ # Module 6: Fundamental Parallel Algorithms - Comprehensive Guide -> Environment note: The examples and benchmarks in this module are tested in Docker with CUDA 12.9.1 (Ubuntu 22.04) and ROCm 7.0 (Ubuntu 24.04) to ensure reproducibility. Recent algorithm fixes improve performance. +> Environment note: The examples and benchmarks in this module are tested in Docker with CUDA 13.0.1 (Ubuntu 24.04) and ROCm 7.0.1 (Ubuntu 24.04) to ensure reproducibility. Recent algorithm fixes improve performance. ## Introduction diff --git a/modules/module7/content.md b/modules/module7/content.md index c971355..59a4d65 100644 --- a/modules/module7/content.md +++ b/modules/module7/content.md @@ -1,6 +1,6 @@ # Module 7: Advanced Algorithmic Patterns - Comprehensive Guide -> Environment note: Use the provided Docker environment (CUDA 12.9.1 on Ubuntu 22.04, ROCm 7.0 on Ubuntu 24.04) for consistent builds and tools across platforms. Recent algorithmic pattern fixes included. +> Environment note: Use the provided Docker environment (CUDA 13.0.1 on Ubuntu 24.04, ROCm 7.0.1 on Ubuntu 24.04) for consistent builds and tools across platforms. Recent algorithmic pattern fixes included. ## Introduction diff --git a/modules/module8/content.md b/modules/module8/content.md index e9e37d4..de47cb7 100644 --- a/modules/module8/content.md +++ b/modules/module8/content.md @@ -1,6 +1,6 @@ # Module 8: Domain-Specific Applications - Comprehensive Guide -> Environment note: The examples and integrations in this module assume Docker images with CUDA 12.9.1 (Ubuntu 22.04) and ROCm 7.0 (Ubuntu 24.04) are used for consistent library/tool availability. Includes Thrust and MIOpen support. +> Environment note: The examples and integrations in this module assume Docker images with CUDA 13.0.1 (Ubuntu 24.04) and ROCm 7.0.1 (Ubuntu 24.04) are used for consistent library/tool availability. Includes Thrust and MIOpen support. ## Introduction diff --git a/modules/module9/content.md b/modules/module9/content.md index 6179894..9573dfa 100644 --- a/modules/module9/content.md +++ b/modules/module9/content.md @@ -1,6 +1,6 @@ # Professional GPU Programming: Enterprise-Grade Implementation Guide -> Environment note: Professional examples and deployment references assume development using Docker images with CUDA 12.9.1 (Ubuntu 22.04) and ROCm 7.0 (Ubuntu 24.04) for parity between environments. Enhanced build system supports professional-grade optimizations. +> Environment note: Professional examples and deployment references assume development using Docker images with CUDA 13.0.1 (Ubuntu 24.04) and ROCm 7.0.1 (Ubuntu 24.04) for parity between environments. Enhanced build system supports professional-grade optimizations. This comprehensive guide covers all aspects of deploying, maintaining, and scaling GPU applications in production environments, from architecture design to operational excellence. @@ -317,7 +317,7 @@ public: ```dockerfile # Multi-stage build for production GPU applications -FROM nvidia/cuda:12.9.1-devel-ubuntu22.04 AS builder +FROM nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04 AS builder # Install dependencies RUN apt-get update && apt-get install -y \ @@ -334,7 +334,7 @@ RUN mkdir build && cd build && \ make -j$(nproc) # Production runtime image -FROM nvidia/cuda:12.9.1-runtime-ubuntu22.04 +FROM nvidia/cuda:13.0.1-runtime-ubuntu24.04 # Install runtime dependencies only RUN apt-get update && apt-get install -y \ diff --git a/modules/module9/examples/Makefile b/modules/module9/examples/Makefile index 48f397b..6c280a5 100644 --- a/modules/module9/examples/Makefile +++ b/modules/module9/examples/Makefile @@ -358,7 +358,7 @@ package_production: production build_containers: production @echo "Building production containers..." @if command -v docker > /dev/null 2>&1; then \ - echo "FROM nvidia/cuda:11.8-runtime-ubuntu20.04" > $(DEPLOY_DIR)/Dockerfile.cuda; \ + echo "FROM nvidia/cuda:13.0.1-runtime-ubuntu24.04" > $(DEPLOY_DIR)/Dockerfile.cuda; \ echo "COPY build/ /app/" >> $(DEPLOY_DIR)/Dockerfile.cuda; \ echo "WORKDIR /app" >> $(DEPLOY_DIR)/Dockerfile.cuda; \ echo "EXPOSE 8080" >> $(DEPLOY_DIR)/Dockerfile.cuda; \ From 493b7b1e280fa6ec1f7eff7d1625b0571512314e Mon Sep 17 00:00:00 2001 From: Stephen Shao Date: Mon, 22 Sep 2025 11:56:54 -0400 Subject: [PATCH 2/9] Updated the Dockerfile --- docker/cuda/Dockerfile | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/docker/cuda/Dockerfile b/docker/cuda/Dockerfile index d9202d7..d624894 100644 --- a/docker/cuda/Dockerfile +++ b/docker/cuda/Dockerfile @@ -25,10 +25,6 @@ RUN apt-get update && apt-get install -y \ nano \ htop \ tree \ - # Minimal Python for basic scripting (not data science) - python3 \ - python3-pip \ - python3-dev \ # Additional utilities pkg-config \ software-properties-common \ @@ -46,10 +42,6 @@ RUN apt-get update && \ (apt-get install -y cuda-tools-13-0 || apt-get install -y cuda-tools || true) && \ rm -rf /var/lib/apt/lists/* -# Install minimal Python packages for basic development (no heavy data science libs) -RUN pip3 install --no-cache-dir \ - numpy \ - matplotlib # Set up CUDA environment variables ENV PATH=/usr/local/cuda/bin:${PATH} @@ -152,9 +144,8 @@ RUN chmod +x /workspace/test-gpu.sh # Install CUDA samples for learning and reference RUN cd /workspace && \ - git clone https://github.com/NVIDIA/cuda-samples.git && \ - cd cuda-samples && \ - git checkout v13.0 + git clone --depth 1 --branch v13.0 https://github.com/NVIDIA/cuda-samples.git && \ + cd cuda-samples # Default command CMD ["/bin/bash"] From 5b3e406b0d496fda87d524f35c6e455bcec623ca Mon Sep 17 00:00:00 2001 From: Stephen Shao Date: Mon, 22 Sep 2025 12:10:35 -0400 Subject: [PATCH 3/9] Updated the Makefile of each module to support CUDA arch detection --- modules/module1/examples/Makefile | 11 +++++++++-- modules/module2/examples/Makefile | 11 +++++++++-- modules/module3/examples/Makefile | 11 +++++++++-- modules/module4/examples/Makefile | 13 ++++++++++--- modules/module5/examples/Makefile | 13 ++++++++++--- modules/module6/examples/Makefile | 11 +++++++++-- modules/module7/examples/Makefile | 11 +++++++++-- modules/module8/examples/Makefile | 11 +++++++++-- modules/module9/examples/Makefile | 11 +++++++++-- 9 files changed, 83 insertions(+), 20 deletions(-) diff --git a/modules/module1/examples/Makefile b/modules/module1/examples/Makefile index fe99f79..2b41dcc 100644 --- a/modules/module1/examples/Makefile +++ b/modules/module1/examples/Makefile @@ -25,9 +25,16 @@ BUILD_HIP = 0 GPU_VENDOR = NONE endif +# CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +ifeq ($(strip $(CUDA_ARCH)),) + CUDA_ARCH := sm_90 +endif +CUDA_ARCH_FLAG := -arch=$(CUDA_ARCH) + # Compiler flags -CUDA_FLAGS = -std=c++17 -O2 -arch=sm_70 -CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_70 +CUDA_FLAGS = -std=c++17 -O2 $(CUDA_ARCH_FLAG) +CUDA_DEBUG_FLAGS = -std=c++17 -g -G $(CUDA_ARCH_FLAG) HIP_FLAGS = -std=c++17 -O2 HIP_DEBUG_FLAGS = -std=c++17 -g diff --git a/modules/module2/examples/Makefile b/modules/module2/examples/Makefile index e563026..8e67072 100644 --- a/modules/module2/examples/Makefile +++ b/modules/module2/examples/Makefile @@ -25,9 +25,16 @@ BUILD_HIP = 0 GPU_VENDOR = NONE endif +# CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +ifeq ($(strip $(CUDA_ARCH)),) + CUDA_ARCH := sm_90 +endif +CUDA_ARCH_FLAG := -arch=$(CUDA_ARCH) + # Compiler flags -CUDA_FLAGS = -std=c++17 -O2 -arch=sm_75 -lcudart -lcuda -CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_75 -lcudart -lcuda +CUDA_FLAGS = -std=c++17 -O2 $(CUDA_ARCH_FLAG) -lcudart -lcuda +CUDA_DEBUG_FLAGS = -std=c++17 -g -G $(CUDA_ARCH_FLAG) -lcudart -lcuda HIP_FLAGS = -std=c++17 -O2 HIP_DEBUG_FLAGS = -std=c++17 -g diff --git a/modules/module3/examples/Makefile b/modules/module3/examples/Makefile index 43893ad..f1dcace 100644 --- a/modules/module3/examples/Makefile +++ b/modules/module3/examples/Makefile @@ -25,9 +25,16 @@ BUILD_HIP = 0 GPU_VENDOR = NONE endif +# CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +ifeq ($(strip $(CUDA_ARCH)),) + CUDA_ARCH := sm_90 +endif +CUDA_ARCH_FLAG := -arch=$(CUDA_ARCH) + # Compiler flags -CUDA_FLAGS = -std=c++17 -O2 -arch=sm_75 -lcudart -lcuda -CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_75 -lcudart -lcuda +CUDA_FLAGS = -std=c++17 -O2 $(CUDA_ARCH_FLAG) -lcudart -lcuda +CUDA_DEBUG_FLAGS = -std=c++17 -g -G $(CUDA_ARCH_FLAG) -lcudart -lcuda HIP_FLAGS = -std=c++17 -O2 HIP_DEBUG_FLAGS = -std=c++17 -g diff --git a/modules/module4/examples/Makefile b/modules/module4/examples/Makefile index 53e3863..9417bdd 100644 --- a/modules/module4/examples/Makefile +++ b/modules/module4/examples/Makefile @@ -25,10 +25,17 @@ BUILD_HIP = 0 GPU_VENDOR = NONE endif +# CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +ifeq ($(strip $(CUDA_ARCH)),) + CUDA_ARCH := sm_90 +endif +CUDA_ARCH_FLAG := -arch=$(CUDA_ARCH) + # Compiler flags -CUDA_FLAGS = -std=c++17 -O2 -arch=sm_75 -rdc=true -lcudart -lcuda -CUDA_DP_FLAGS = -std=c++17 -O2 -arch=sm_75 -rdc=true -lcudadevrt -lcudart -lcuda -CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_75 -rdc=true -lcudart -lcuda +CUDA_FLAGS = -std=c++17 -O2 $(CUDA_ARCH_FLAG) -rdc=true -lcudart -lcuda +CUDA_DP_FLAGS = -std=c++17 -O2 $(CUDA_ARCH_FLAG) -rdc=true -lcudadevrt -lcudart -lcuda +CUDA_DEBUG_FLAGS = -std=c++17 -g -G $(CUDA_ARCH_FLAG) -rdc=true -lcudart -lcuda HIP_FLAGS = -std=c++17 -O2 -fopenmp HIP_DEBUG_FLAGS = -std=c++17 -g -fopenmp diff --git a/modules/module5/examples/Makefile b/modules/module5/examples/Makefile index 6f83c3e..3e12ec0 100644 --- a/modules/module5/examples/Makefile +++ b/modules/module5/examples/Makefile @@ -25,9 +25,16 @@ BUILD_HIP = 0 GPU_VENDOR = NONE endif +# CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +ifeq ($(strip $(CUDA_ARCH)),) + CUDA_ARCH := sm_90 +endif +CUDA_ARCH_FLAG := -arch=$(CUDA_ARCH) + # Compiler flags -CUDA_FLAGS = -std=c++17 -O3 -arch=sm_90 -lineinfo -CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_90 +CUDA_FLAGS = -std=c++17 -O3 $(CUDA_ARCH_FLAG) -lineinfo +CUDA_DEBUG_FLAGS = -std=c++17 -g -G $(CUDA_ARCH_FLAG) HIP_FLAGS = -std=c++17 -O3 HIP_DEBUG_FLAGS = -std=c++17 -g @@ -241,7 +248,7 @@ validate: all @echo "Validating optimization implementations..." @echo "This will run examples with different optimization levels to verify correctness" @$(MAKE) clean - @$(MAKE) CUDA_FLAGS="-std=c++17 -O0 -arch=sm_70" HIP_FLAGS="-std=c++17 -O0" all + @$(MAKE) CUDA_FLAGS="-std=c++17 -O0 $(CUDA_ARCH_FLAG)" HIP_FLAGS="-std=c++17 -O0" all @echo "Running unoptimized versions for correctness baseline..." @for target in $(ALL_TARGETS); do \ if [ -f $$target ]; then \ diff --git a/modules/module6/examples/Makefile b/modules/module6/examples/Makefile index c91787f..48199d4 100644 --- a/modules/module6/examples/Makefile +++ b/modules/module6/examples/Makefile @@ -24,9 +24,16 @@ BUILD_HIP = 0 GPU_VENDOR = NONE endif +# CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +ifeq ($(strip $(CUDA_ARCH)),) + CUDA_ARCH := sm_90 +endif +CUDA_ARCH_FLAG := -arch=$(CUDA_ARCH) + # Compiler flags -CUDA_FLAGS = -std=c++17 -O3 -arch=sm_90 -lineinfo -CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_90 +CUDA_FLAGS = -std=c++17 -O3 $(CUDA_ARCH_FLAG) -lineinfo +CUDA_DEBUG_FLAGS = -std=c++17 -g -G $(CUDA_ARCH_FLAG) HIP_FLAGS = -std=c++17 -O3 HIP_DEBUG_FLAGS = -std=c++17 -g diff --git a/modules/module7/examples/Makefile b/modules/module7/examples/Makefile index 48c803c..6604cd3 100644 --- a/modules/module7/examples/Makefile +++ b/modules/module7/examples/Makefile @@ -24,9 +24,16 @@ BUILD_HIP = 0 GPU_VENDOR = NONE endif +# CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +ifeq ($(strip $(CUDA_ARCH)),) + CUDA_ARCH := sm_90 +endif +CUDA_ARCH_FLAG := -arch=$(CUDA_ARCH) + # Compiler flags -CUDA_FLAGS = -std=c++17 -O3 -arch=sm_75 -lineinfo -CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_75 +CUDA_FLAGS = -std=c++17 -O3 $(CUDA_ARCH_FLAG) -lineinfo +CUDA_DEBUG_FLAGS = -std=c++17 -g -G $(CUDA_ARCH_FLAG) HIP_FLAGS = -std=c++17 -O3 HIP_DEBUG_FLAGS = -std=c++17 -g diff --git a/modules/module8/examples/Makefile b/modules/module8/examples/Makefile index 12244a2..7c0a73e 100644 --- a/modules/module8/examples/Makefile +++ b/modules/module8/examples/Makefile @@ -24,9 +24,16 @@ BUILD_HIP = 0 GPU_VENDOR = NONE endif +# CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +ifeq ($(strip $(CUDA_ARCH)),) + CUDA_ARCH := sm_90 +endif +CUDA_ARCH_FLAG := -arch=$(CUDA_ARCH) + # Compiler flags for professional-quality applications -CUDA_FLAGS = -std=c++17 -O3 -arch=sm_70 -lineinfo --use_fast_math -CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_70 +CUDA_FLAGS = -std=c++17 -O3 $(CUDA_ARCH_FLAG) -lineinfo --use_fast_math +CUDA_DEBUG_FLAGS = -std=c++17 -g -G $(CUDA_ARCH_FLAG) HIP_FLAGS = -std=c++17 -O3 -ffast-math HIP_DEBUG_FLAGS = -std=c++17 -g diff --git a/modules/module9/examples/Makefile b/modules/module9/examples/Makefile index 6c280a5..1daf2f4 100644 --- a/modules/module9/examples/Makefile +++ b/modules/module9/examples/Makefile @@ -25,9 +25,16 @@ BUILD_HIP = 0 GPU_VENDOR = NONE endif +# CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +ifeq ($(strip $(CUDA_ARCH)),) + CUDA_ARCH := sm_90 +endif +CUDA_ARCH_FLAG := -arch=$(CUDA_ARCH) + # Compiler flags for professional applications -CUDA_FLAGS = -std=c++17 -O3 -arch=sm_70 -lineinfo --use_fast_math -DPRODUCTION_BUILD -CUDA_DEBUG_FLAGS = -std=c++17 -g -G -arch=sm_70 -DDEBUG_BUILD +CUDA_FLAGS = -std=c++17 -O3 $(CUDA_ARCH_FLAG) -lineinfo --use_fast_math -DPRODUCTION_BUILD +CUDA_DEBUG_FLAGS = -std=c++17 -g -G $(CUDA_ARCH_FLAG) -DDEBUG_BUILD HIP_FLAGS = -std=c++17 -O3 -ffast-math -DPRODUCTION_BUILD HIP_DEBUG_FLAGS = -std=c++17 -g -DDEBUG_BUILD CXX_FLAGS = -std=c++17 -O3 -DPRODUCTION_BUILD From d0860cb1cafe11dbcdc567ef999e5c5a7173e277 Mon Sep 17 00:00:00 2001 From: Stephen Shao Date: Mon, 22 Sep 2025 12:31:21 -0400 Subject: [PATCH 4/9] Fix the container build error: remove pip installs from the CUDA Dockerfile to avoid PEP 668. Fix the nvcc compile error: stop forcing hardcoded -arch flags (sm_70, sm_75, etc.) and detect the actual GPU arch at build time. Clean up source code broken by CUDA 13 deprecations: replace uses of memoryClockRate and memoryBusWidth from cudaDeviceProp with cudaDeviceGetAttribute. --- .../module1/examples/03_matrix_multiplication_cuda.cu | 9 +++++++-- modules/module1/examples/04_device_info_cuda.cu | 10 +++++++--- .../examples/05_performance_comparison_cuda.cu | 5 ++++- modules/module1/examples/Makefile | 2 +- modules/module2/examples/02_memory_coalescing_cuda.cu | 9 ++++++--- .../examples/05_memory_bandwidth_optimization_cuda.cu | 9 ++++++--- modules/module2/examples/Makefile | 2 +- modules/module3/examples/Makefile | 2 +- modules/module4/examples/01_cuda_streams_basics.cu | 7 +++++-- modules/module4/examples/02_multi_gpu_programming.cu | 11 +++++++---- modules/module4/examples/Makefile | 2 +- modules/module5/examples/Makefile | 2 +- modules/module6/examples/Makefile | 2 +- modules/module7/examples/Makefile | 2 +- modules/module8/examples/Makefile | 2 +- modules/module9/examples/Makefile | 2 +- 16 files changed, 51 insertions(+), 27 deletions(-) diff --git a/modules/module1/examples/03_matrix_multiplication_cuda.cu b/modules/module1/examples/03_matrix_multiplication_cuda.cu index 332d8cc..bb85507 100644 --- a/modules/module1/examples/03_matrix_multiplication_cuda.cu +++ b/modules/module1/examples/03_matrix_multiplication_cuda.cu @@ -210,8 +210,13 @@ int main() { cudaDeviceProp props; CUDA_CHECK(cudaGetDeviceProperties(&props, 0)); printf("Running on: %s\n", props.name); - printf("Peak memory bandwidth: %.1f GB/s\n", - 2.0 * props.memoryClockRate * (props.memoryBusWidth / 8) / 1.0e6); + // CUDA 13: use cudaDeviceGetAttribute for memory metrics + int memClockKHz = 0; + int busWidthBits = 0; + cudaDeviceGetAttribute(&memClockKHz, cudaDevAttrMemoryClockRate, 0); + cudaDeviceGetAttribute(&busWidthBits, cudaDevAttrGlobalMemoryBusWidth, 0); + double peakGBs = 2.0 * (memClockKHz / 1e6) * (busWidthBits / 8.0); + printf("Peak memory bandwidth: %.1f GB/s\n", peakGBs); // Cleanup free(h_A); free(h_B); free(h_C); free(h_C_ref); diff --git a/modules/module1/examples/04_device_info_cuda.cu b/modules/module1/examples/04_device_info_cuda.cu index 88e4110..06411f0 100644 --- a/modules/module1/examples/04_device_info_cuda.cu +++ b/modules/module1/examples/04_device_info_cuda.cu @@ -27,10 +27,14 @@ int main() { props.maxThreadsDim[0], props.maxThreadsDim[1], props.maxThreadsDim[2]); printf(" Max Grid Size: (%d, %d, %d)\n", props.maxGridSize[0], props.maxGridSize[1], props.maxGridSize[2]); - printf(" Memory Clock Rate: %.2f GHz\n", props.memoryClockRate / 1e6); - printf(" Memory Bus Width: %d bits\n", props.memoryBusWidth); + int memClockKHz = 0; + int busWidthBits = 0; + cudaDeviceGetAttribute(&memClockKHz, cudaDevAttrMemoryClockRate, i); + cudaDeviceGetAttribute(&busWidthBits, cudaDevAttrGlobalMemoryBusWidth, i); + printf(" Memory Clock Rate: %.2f GHz\n", memClockKHz / 1e6); + printf(" Memory Bus Width: %d bits\n", busWidthBits); printf(" Peak Memory Bandwidth: %.2f GB/s\n", - 2.0 * props.memoryClockRate * (props.memoryBusWidth / 8) / 1.0e6); + 2.0 * (memClockKHz / 1e6) * (busWidthBits / 8.0)); printf(" Multiprocessor Count: %d\n", props.multiProcessorCount); printf(" L2 Cache Size: %d bytes\n", props.l2CacheSize); printf(" Max Threads per Multiprocessor: %d\n", props.maxThreadsPerMultiProcessor); diff --git a/modules/module1/examples/05_performance_comparison_cuda.cu b/modules/module1/examples/05_performance_comparison_cuda.cu index c84efff..a6b7cf2 100644 --- a/modules/module1/examples/05_performance_comparison_cuda.cu +++ b/modules/module1/examples/05_performance_comparison_cuda.cu @@ -160,8 +160,11 @@ int main() { cudaDeviceProp props; CUDA_CHECK(cudaGetDeviceProperties(&props, 0)); printf("GPU: %s\n", props.name); + int memClockKHz = 0, busWidthBits = 0; + cudaDeviceGetAttribute(&memClockKHz, cudaDevAttrMemoryClockRate, 0); + cudaDeviceGetAttribute(&busWidthBits, cudaDevAttrGlobalMemoryBusWidth, 0); printf("Peak Memory Bandwidth: %.2f GB/s\n", - 2.0 * props.memoryClockRate * (props.memoryBusWidth / 8) / 1.0e6); + 2.0 * (memClockKHz / 1e6) * (busWidthBits / 8.0)); return 0; } \ No newline at end of file diff --git a/modules/module1/examples/Makefile b/modules/module1/examples/Makefile index 2b41dcc..14b8f74 100644 --- a/modules/module1/examples/Makefile +++ b/modules/module1/examples/Makefile @@ -26,7 +26,7 @@ GPU_VENDOR = NONE endif # CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) -CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -dc '0-9' | sed -e 's/^/sm_/') ifeq ($(strip $(CUDA_ARCH)),) CUDA_ARCH := sm_90 endif diff --git a/modules/module2/examples/02_memory_coalescing_cuda.cu b/modules/module2/examples/02_memory_coalescing_cuda.cu index 687c27f..f47140b 100644 --- a/modules/module2/examples/02_memory_coalescing_cuda.cu +++ b/modules/module2/examples/02_memory_coalescing_cuda.cu @@ -354,10 +354,13 @@ int main() { cudaDeviceProp props; CUDA_CHECK(cudaGetDeviceProperties(&props, 0)); printf("Running on: %s\n", props.name); + int memClockKHz = 0, busWidthBits = 0; + cudaDeviceGetAttribute(&memClockKHz, cudaDevAttrMemoryClockRate, 0); + cudaDeviceGetAttribute(&busWidthBits, cudaDevAttrGlobalMemoryBusWidth, 0); printf("Global memory bandwidth: %.1f GB/s\n", - 2.0 * props.memoryClockRate * (props.memoryBusWidth / 8) / 1.0e6); - printf("Memory bus width: %d bits\n", props.memoryBusWidth); - printf("Memory clock rate: %d MHz\n", props.memoryClockRate / 1000); + 2.0 * (memClockKHz / 1e6) * (busWidthBits / 8.0)); + printf("Memory bus width: %d bits\n", busWidthBits); + printf("Memory clock rate: %d MHz\n", memClockKHz / 1000); printf("Warp size: %d threads\n", props.warpSize); // Run benchmarks diff --git a/modules/module2/examples/05_memory_bandwidth_optimization_cuda.cu b/modules/module2/examples/05_memory_bandwidth_optimization_cuda.cu index 831aea0..6fb1ac7 100644 --- a/modules/module2/examples/05_memory_bandwidth_optimization_cuda.cu +++ b/modules/module2/examples/05_memory_bandwidth_optimization_cuda.cu @@ -417,10 +417,13 @@ int main() { CUDA_CHECK(cudaGetDeviceProperties(&props, 0)); printf("Running on: %s\n", props.name); - double theoretical_bandwidth = 2.0 * props.memoryClockRate * (props.memoryBusWidth / 8) / 1.0e6; + int memClockKHz = 0, busWidthBits = 0; + cudaDeviceGetAttribute(&memClockKHz, cudaDevAttrMemoryClockRate, 0); + cudaDeviceGetAttribute(&busWidthBits, cudaDevAttrGlobalMemoryBusWidth, 0); + double theoretical_bandwidth = 2.0 * (memClockKHz / 1e6) * (busWidthBits / 8.0); printf("Theoretical peak bandwidth: %.1f GB/s\n", theoretical_bandwidth); - printf("Memory clock rate: %d MHz\n", props.memoryClockRate / 1000); - printf("Memory bus width: %d bits\n", props.memoryBusWidth); + printf("Memory clock rate: %d MHz\n", memClockKHz / 1000); + printf("Memory bus width: %d bits\n", busWidthBits); printf("L2 cache size: %d MB\n", props.l2CacheSize / (1024 * 1024)); // Run benchmarks diff --git a/modules/module2/examples/Makefile b/modules/module2/examples/Makefile index 8e67072..6ca3118 100644 --- a/modules/module2/examples/Makefile +++ b/modules/module2/examples/Makefile @@ -26,7 +26,7 @@ GPU_VENDOR = NONE endif # CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) -CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -dc '0-9' | sed -e 's/^/sm_/') ifeq ($(strip $(CUDA_ARCH)),) CUDA_ARCH := sm_90 endif diff --git a/modules/module3/examples/Makefile b/modules/module3/examples/Makefile index f1dcace..5c5b92a 100644 --- a/modules/module3/examples/Makefile +++ b/modules/module3/examples/Makefile @@ -26,7 +26,7 @@ GPU_VENDOR = NONE endif # CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) -CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -dc '0-9' | sed -e 's/^/sm_/') ifeq ($(strip $(CUDA_ARCH)),) CUDA_ARCH := sm_90 endif diff --git a/modules/module4/examples/01_cuda_streams_basics.cu b/modules/module4/examples/01_cuda_streams_basics.cu index d0a00a7..e400d44 100644 --- a/modules/module4/examples/01_cuda_streams_basics.cu +++ b/modules/module4/examples/01_cuda_streams_basics.cu @@ -330,8 +330,11 @@ int main() { printf("Compute Capability: %d.%d\n", prop.major, prop.minor); printf("Concurrent Kernels: %s\n", prop.concurrentKernels ? "Yes" : "No"); printf("Async Engine Count: %d\n", prop.asyncEngineCount); - printf("Memory Bus Width: %d bits\n", prop.memoryBusWidth); - printf("Memory Clock Rate: %.2f MHz\n\n", prop.memoryClockRate / 1000.0f); + int memClockKHz = 0, busWidthBits = 0; + cudaDeviceGetAttribute(&memClockKHz, cudaDevAttrMemoryClockRate, 0); + cudaDeviceGetAttribute(&busWidthBits, cudaDevAttrGlobalMemoryBusWidth, 0); + printf("Memory Bus Width: %d bits\n", busWidthBits); + printf("Memory Clock Rate: %.2f MHz\n\n", memClockKHz / 1000.0f); // Allocate host memory const int totalSize = TOTAL_SIZE; diff --git a/modules/module4/examples/02_multi_gpu_programming.cu b/modules/module4/examples/02_multi_gpu_programming.cu index bae59b5..9d1f418 100644 --- a/modules/module4/examples/02_multi_gpu_programming.cu +++ b/modules/module4/examples/02_multi_gpu_programming.cu @@ -83,10 +83,13 @@ void printDeviceInfo() { printf("Device %d: %s\n", i, prop.name); printf(" Compute Capability: %d.%d\n", prop.major, prop.minor); printf(" Global Memory: %.2f GB\n", prop.totalGlobalMem / (1024.0*1024.0*1024.0)); - printf(" Memory Clock Rate: %.2f MHz\n", prop.memoryClockRate / 1000.0); - printf(" Memory Bus Width: %d bits\n", prop.memoryBusWidth); - printf(" Peak Memory Bandwidth: %.2f GB/s\n", - 2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6); + int memClockKHz = 0, busWidthBits = 0; + cudaDeviceGetAttribute(&memClockKHz, cudaDevAttrMemoryClockRate, i); + cudaDeviceGetAttribute(&busWidthBits, cudaDevAttrGlobalMemoryBusWidth, i); + printf(" Memory Clock Rate: %.2f MHz\n", memClockKHz / 1000.0); + printf(" Memory Bus Width: %d bits\n", busWidthBits); + printf(" Peak Memory Bandwidth: %.2f GB/s\n", + 2.0 * (memClockKHz / 1e6) * (busWidthBits / 8.0)); printf(" Multiprocessors: %d\n", prop.multiProcessorCount); printf(" Concurrent Kernels: %s\n", prop.concurrentKernels ? "Yes" : "No"); diff --git a/modules/module4/examples/Makefile b/modules/module4/examples/Makefile index 9417bdd..ed8ba7d 100644 --- a/modules/module4/examples/Makefile +++ b/modules/module4/examples/Makefile @@ -26,7 +26,7 @@ GPU_VENDOR = NONE endif # CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) -CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -dc '0-9' | sed -e 's/^/sm_/') ifeq ($(strip $(CUDA_ARCH)),) CUDA_ARCH := sm_90 endif diff --git a/modules/module5/examples/Makefile b/modules/module5/examples/Makefile index 3e12ec0..4877b24 100644 --- a/modules/module5/examples/Makefile +++ b/modules/module5/examples/Makefile @@ -26,7 +26,7 @@ GPU_VENDOR = NONE endif # CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) -CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -dc '0-9' | sed -e 's/^/sm_/') ifeq ($(strip $(CUDA_ARCH)),) CUDA_ARCH := sm_90 endif diff --git a/modules/module6/examples/Makefile b/modules/module6/examples/Makefile index 48199d4..5392d17 100644 --- a/modules/module6/examples/Makefile +++ b/modules/module6/examples/Makefile @@ -25,7 +25,7 @@ GPU_VENDOR = NONE endif # CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) -CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -dc '0-9' | sed -e 's/^/sm_/') ifeq ($(strip $(CUDA_ARCH)),) CUDA_ARCH := sm_90 endif diff --git a/modules/module7/examples/Makefile b/modules/module7/examples/Makefile index 6604cd3..b74ef72 100644 --- a/modules/module7/examples/Makefile +++ b/modules/module7/examples/Makefile @@ -25,7 +25,7 @@ GPU_VENDOR = NONE endif # CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) -CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -dc '0-9' | sed -e 's/^/sm_/') ifeq ($(strip $(CUDA_ARCH)),) CUDA_ARCH := sm_90 endif diff --git a/modules/module8/examples/Makefile b/modules/module8/examples/Makefile index 7c0a73e..c11f3a5 100644 --- a/modules/module8/examples/Makefile +++ b/modules/module8/examples/Makefile @@ -25,7 +25,7 @@ GPU_VENDOR = NONE endif # CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) -CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -dc '0-9' | sed -e 's/^/sm_/') ifeq ($(strip $(CUDA_ARCH)),) CUDA_ARCH := sm_90 endif diff --git a/modules/module9/examples/Makefile b/modules/module9/examples/Makefile index 1daf2f4..630cffd 100644 --- a/modules/module9/examples/Makefile +++ b/modules/module9/examples/Makefile @@ -26,7 +26,7 @@ GPU_VENDOR = NONE endif # CUDA architecture detection (prefer actual GPU via nvidia-smi; fallback sm_90) -CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | awk -F. '/^[0-9]+\.[0-9]+$/ {printf "sm_%d%d", $$1, $$2}') +CUDA_ARCH ?= $(shell nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -dc '0-9' | sed -e 's/^/sm_/') ifeq ($(strip $(CUDA_ARCH)),) CUDA_ARCH := sm_90 endif From 5ba90235f82c9df6883ac223223b418d6f40dfd5 Mon Sep 17 00:00:00 2001 From: Stephen Shao Date: Mon, 22 Sep 2025 14:01:03 -0400 Subject: [PATCH 5/9] Fix CUDA build error (arch flags): completed Remove Python pip from Dockerfile: completed (your build log confirms success) Repo-wide sweep for hardcoded sm_XX: completed --- modules/module2/content.md | 1 - modules/module2/examples/03_texture_memory_cuda.cu | 1 - modules/module5/examples/01_gpu_profiling_cuda.cu | 14 ++++++++++---- 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/modules/module2/content.md b/modules/module2/content.md index ce86349..1629262 100644 --- a/modules/module2/content.md +++ b/modules/module2/content.md @@ -302,7 +302,6 @@ Texture memory provides: ```cuda #include -#include __global__ void textureKernel(cudaTextureObject_t texObj, float *output, int width, int height) { diff --git a/modules/module2/examples/03_texture_memory_cuda.cu b/modules/module2/examples/03_texture_memory_cuda.cu index 889def4..ca211bc 100644 --- a/modules/module2/examples/03_texture_memory_cuda.cu +++ b/modules/module2/examples/03_texture_memory_cuda.cu @@ -1,5 +1,4 @@ #include -#include #include #include #include diff --git a/modules/module5/examples/01_gpu_profiling_cuda.cu b/modules/module5/examples/01_gpu_profiling_cuda.cu index 9f7472a..2648cef 100644 --- a/modules/module5/examples/01_gpu_profiling_cuda.cu +++ b/modules/module5/examples/01_gpu_profiling_cuda.cu @@ -272,10 +272,13 @@ void analyzeDeviceProperties() { printf("Cores per MP: %d (estimated)\n", _ConvertSMVer2Cores(prop.major, prop.minor)); printf("Total Cores: %d (estimated)\n", prop.multiProcessorCount * _ConvertSMVer2Cores(prop.major, prop.minor)); printf("GPU Clock Rate: %.2f GHz\n", prop.clockRate / 1e6); - printf("Memory Clock Rate: %.2f GHz\n", prop.memoryClockRate / 1e6); - printf("Memory Bus Width: %d bits\n", prop.memoryBusWidth); + int memClockKHz = 0, busWidthBits = 0; + cudaDeviceGetAttribute(&memClockKHz, cudaDevAttrMemoryClockRate, device); + cudaDeviceGetAttribute(&busWidthBits, cudaDevAttrGlobalMemoryBusWidth, device); + printf("Memory Clock Rate: %.2f GHz\n", memClockKHz / 1e6); + printf("Memory Bus Width: %d bits\n", busWidthBits); printf("Peak Memory Bandwidth: %.1f GB/s\n", - 2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6); + 2.0 * (memClockKHz / 1e6) * (busWidthBits / 8.0)); printf("Global Memory: %.1f GB\n", prop.totalGlobalMem / (1024.0 * 1024.0 * 1024.0)); printf("Shared Memory per Block: %zu KB\n", prop.sharedMemPerBlock / 1024); printf("Max Threads per Block: %d\n", prop.maxThreadsPerBlock); @@ -409,7 +412,10 @@ void calculateTheoreticalLimits() { printf("=== Theoretical Performance Limits ===\n"); // Memory bandwidth calculation - double memoryBandwidth = 2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6; // GB/s + int memClockKHz = 0, busWidthBits = 0; + cudaDeviceGetAttribute(&memClockKHz, cudaDevAttrMemoryClockRate, device); + cudaDeviceGetAttribute(&busWidthBits, cudaDevAttrGlobalMemoryBusWidth, device); + double memoryBandwidth = 2.0 * (memClockKHz / 1e6) * (busWidthBits / 8.0); // GB/s printf("Peak Memory Bandwidth: %.1f GB/s\n", memoryBandwidth); // Compute throughput estimation From 2b9ac28551100343b3df3994c23373aef86bbb6e Mon Sep 17 00:00:00 2001 From: Stephen Shao Date: Mon, 22 Sep 2025 14:23:19 -0400 Subject: [PATCH 6/9] update outdated cudaMemAdvise/cudaMemPrefetchAsync calls to the new cudaMemLocation API --- modules/module2/content.md | 14 +++++--- .../examples/04_unified_memory_cuda.cu | 27 +++++++++------ modules/module4/README.md | 15 +++++---- modules/module4/content.md | 13 +++++--- modules/module4/examples/03_unified_memory.cu | 33 ++++++++++++------- 5 files changed, 66 insertions(+), 36 deletions(-) diff --git a/modules/module2/content.md b/modules/module2/content.md index 1629262..37b0b78 100644 --- a/modules/module2/content.md +++ b/modules/module2/content.md @@ -470,11 +470,14 @@ __global__ void processData(float *data, size_t n) { ```cuda void optimizedUnifiedMemory(float *data, size_t n, int device) { // Prefetch data to GPU before kernel launch - cudaMemPrefetchAsync(data, n * sizeof(float), device); + cudaMemLocation loc{}; + loc.type = cudaMemLocationTypeDevice; + loc.id = device; + cudaMemPrefetchAsync(data, n * sizeof(float), loc, /*stream=*/0); // Set memory usage hints - cudaMemAdvise(data, n * sizeof(float), cudaMemAdviseSetReadMostly, device); - cudaMemAdvise(data, n * sizeof(float), cudaMemAdviseSetPreferredLocation, device); + cudaMemAdvise(data, n * sizeof(float), cudaMemAdviseSetReadMostly, loc); + cudaMemAdvise(data, n * sizeof(float), cudaMemAdviseSetPreferredLocation, loc); // Launch kernel int blockSize = 256; @@ -482,7 +485,10 @@ void optimizedUnifiedMemory(float *data, size_t n, int device) { processData<<>>(data, n); // Prefetch back to CPU if needed - cudaMemPrefetchAsync(data, n * sizeof(float), cudaCpuDeviceId); + cudaMemLocation hostLoc{}; + hostLoc.type = cudaMemLocationTypeHost; + hostLoc.id = 0; + cudaMemPrefetchAsync(data, n * sizeof(float), hostLoc, /*stream=*/0); } ``` diff --git a/modules/module2/examples/04_unified_memory_cuda.cu b/modules/module2/examples/04_unified_memory_cuda.cu index db8caab..96feb8d 100644 --- a/modules/module2/examples/04_unified_memory_cuda.cu +++ b/modules/module2/examples/04_unified_memory_cuda.cu @@ -222,16 +222,23 @@ void demonstrateMemoryMigration() { } int device = 0; + // CUDA 13 updated UM APIs use cudaMemLocation instead of raw int device IDs + cudaMemLocation locDevice{}; + locDevice.type = cudaMemLocationTypeDevice; + locDevice.id = device; + cudaMemLocation locHost{}; + locHost.type = cudaMemLocationTypeHost; + locHost.id = 0; // host id is unused printf("Testing memory migration with prefetching and hints...\n"); - // Set memory advice - CUDA_CHECK(cudaMemAdvise(data, bytes, cudaMemAdviseSetReadMostly, device)); - CUDA_CHECK(cudaMemAdvise(data, bytes, cudaMemAdviseSetPreferredLocation, device)); + // Set memory advice (location-aware in CUDA 13) + CUDA_CHECK(cudaMemAdvise(data, bytes, cudaMemAdviseSetReadMostly, locDevice)); + CUDA_CHECK(cudaMemAdvise(data, bytes, cudaMemAdviseSetPreferredLocation, locDevice)); - // Prefetch to GPU + // Prefetch to GPU (location-aware + explicit stream) printf("Prefetching to GPU...\n"); - CUDA_CHECK(cudaMemPrefetchAsync(data, bytes, device)); + CUDA_CHECK(cudaMemPrefetchAsync(data, bytes, locDevice, 0)); CUDA_CHECK(cudaDeviceSynchronize()); int blockSize = 256; @@ -250,9 +257,9 @@ void demonstrateMemoryMigration() { float gpu_time; CUDA_CHECK(cudaEventElapsedTime(&gpu_time, start, stop)); - // Prefetch to CPU + // Prefetch to CPU (location-aware + explicit stream) printf("Prefetching to CPU...\n"); - CUDA_CHECK(cudaMemPrefetchAsync(data, bytes, cudaCpuDeviceId)); + CUDA_CHECK(cudaMemPrefetchAsync(data, bytes, locHost, 0)); CUDA_CHECK(cudaDeviceSynchronize()); // CPU computation (data already on CPU) @@ -274,9 +281,9 @@ void demonstrateMemoryMigration() { // Test without prefetching for comparison printf("\nTesting without prefetching...\n"); - // Reset memory advice - CUDA_CHECK(cudaMemAdvise(data, bytes, cudaMemAdviseUnsetReadMostly, device)); - CUDA_CHECK(cudaMemAdvise(data, bytes, cudaMemAdviseUnsetPreferredLocation, device)); + // Reset memory advice (location-aware in CUDA 13) + CUDA_CHECK(cudaMemAdvise(data, bytes, cudaMemAdviseUnsetReadMostly, locDevice)); + CUDA_CHECK(cudaMemAdvise(data, bytes, cudaMemAdviseUnsetPreferredLocation, locDevice)); CUDA_CHECK(cudaEventRecord(start)); computeIntensive<<>>(data, n); diff --git a/modules/module4/README.md b/modules/module4/README.md index 5818cd2..8b5efb0 100644 --- a/modules/module4/README.md +++ b/modules/module4/README.md @@ -262,12 +262,15 @@ for (int chunk = 0; chunk < numChunks; chunk++) { **Memory Hints:** ```cuda -// Guide data placement -cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, deviceId); -cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, deviceId); - -// Prefetch data proactively -cudaMemPrefetchAsync(data, size, deviceId); +// Guide data placement (CUDA 13+) +cudaMemLocation loc{}; +loc.type = cudaMemLocationTypeDevice; +loc.id = deviceId; +cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, loc); +cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, loc); + +// Prefetch data proactively (CUDA 13+) +cudaMemPrefetchAsync(data, size, loc, /*stream=*/0); ``` ### 4. P2P Communication Patterns diff --git a/modules/module4/content.md b/modules/module4/content.md index 07e2758..9c3b5d9 100644 --- a/modules/module4/content.md +++ b/modules/module4/content.md @@ -135,11 +135,14 @@ kernel<<>>(data, n); #### Memory Access Patterns ```cuda -// Prefetch data to GPU -cudaMemPrefetchAsync(data, size, deviceId); - -// Provide memory access hints -cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, deviceId); +// Prefetch data to GPU (CUDA 13+) +cudaMemLocation loc{}; +loc.type = cudaMemLocationTypeDevice; +loc.id = deviceId; +cudaMemPrefetchAsync(data, size, loc, /*stream=*/0); + +// Provide memory access hints (CUDA 13+) +cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, loc); ``` #### Unified Memory Best Practices diff --git a/modules/module4/examples/03_unified_memory.cu b/modules/module4/examples/03_unified_memory.cu index a095c73..24b834b 100644 --- a/modules/module4/examples/03_unified_memory.cu +++ b/modules/module4/examples/03_unified_memory.cu @@ -204,14 +204,21 @@ double optimizedUnifiedMemory(int n) { auto start = std::chrono::high_resolution_clock::now(); - // Provide memory hints + // Provide memory hints (CUDA 13: use cudaMemLocation) int deviceId = 0; - CUDA_CHECK(cudaMemAdvise(a, bytes, cudaMemAdviseSetReadMostly, deviceId)); - CUDA_CHECK(cudaMemAdvise(b, bytes, cudaMemAdviseSetReadMostly, deviceId)); + cudaMemLocation locDevice{}; + locDevice.type = cudaMemLocationTypeDevice; + locDevice.id = deviceId; + cudaMemLocation locHost{}; + locHost.type = cudaMemLocationTypeHost; + locHost.id = 0; + + CUDA_CHECK(cudaMemAdvise(a, bytes, cudaMemAdviseSetReadMostly, locDevice)); + CUDA_CHECK(cudaMemAdvise(b, bytes, cudaMemAdviseSetReadMostly, locDevice)); - // Prefetch data to GPU - CUDA_CHECK(cudaMemPrefetchAsync(a, bytes, deviceId)); - CUDA_CHECK(cudaMemPrefetchAsync(b, bytes, deviceId)); + // Prefetch data to GPU (location-aware + explicit stream) + CUDA_CHECK(cudaMemPrefetchAsync(a, bytes, locDevice, 0)); + CUDA_CHECK(cudaMemPrefetchAsync(b, bytes, locDevice, 0)); // Launch kernel dim3 block(BLOCK_SIZE); @@ -219,8 +226,8 @@ double optimizedUnifiedMemory(int n) { vectorAdd<<>>(a, b, c, n); CUDA_CHECK(cudaGetLastError()); - // Prefetch result back to CPU - CUDA_CHECK(cudaMemPrefetchAsync(c, bytes, cudaCpuDeviceId)); + // Prefetch result back to CPU (location-aware + explicit stream) + CUDA_CHECK(cudaMemPrefetchAsync(c, bytes, locHost, 0)); CUDA_CHECK(cudaDeviceSynchronize()); // Access result on CPU @@ -376,9 +383,13 @@ void multiGPUUnifiedMemory(int n) { int offset = gpu * chunkSize; int currentChunkSize = (gpu == deviceCount - 1) ? n - offset : chunkSize; - // Prefetch chunk to current GPU - CUDA_CHECK(cudaMemPrefetchAsync(data + offset, - currentChunkSize * sizeof(float), gpu)); + // Prefetch chunk to current GPU (location-aware + explicit stream) + cudaMemLocation locGpu{}; + locGpu.type = cudaMemLocationTypeDevice; + locGpu.id = gpu; + CUDA_CHECK(cudaMemPrefetchAsync(data + offset, + currentChunkSize * sizeof(float), + locGpu, 0)); // Process on this GPU dim3 block(BLOCK_SIZE); From 8301996262dc9b1d5c76f3ae8b517d71f1a44fc4 Mon Sep 17 00:00:00 2001 From: Stephen Shao Date: Mon, 22 Sep 2025 14:33:06 -0400 Subject: [PATCH 7/9] Fixed the clockRate property issues --- modules/module4/examples/02_multi_gpu_programming.cu | 4 +++- modules/module5/examples/01_gpu_profiling_cuda.cu | 8 ++++++-- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/modules/module4/examples/02_multi_gpu_programming.cu b/modules/module4/examples/02_multi_gpu_programming.cu index 9d1f418..06d4cbe 100644 --- a/modules/module4/examples/02_multi_gpu_programming.cu +++ b/modules/module4/examples/02_multi_gpu_programming.cu @@ -209,7 +209,9 @@ double runMultiGPUWeighted(float *h_data, int size, int numGPUs) { CUDA_CHECK(cudaGetDeviceProperties(&prop, gpu)); // Simple weight based on SM count and clock rate - weights[gpu] = prop.multiProcessorCount * (prop.clockRate / 1000.0); + int gpuClockKHz = 0; + cudaDeviceGetAttribute(&gpuClockKHz, cudaDevAttrClockRate, gpu); + weights[gpu] = prop.multiProcessorCount * (gpuClockKHz / 1000.0); totalWeight += weights[gpu]; } diff --git a/modules/module5/examples/01_gpu_profiling_cuda.cu b/modules/module5/examples/01_gpu_profiling_cuda.cu index 2648cef..9b1b4cc 100644 --- a/modules/module5/examples/01_gpu_profiling_cuda.cu +++ b/modules/module5/examples/01_gpu_profiling_cuda.cu @@ -271,7 +271,9 @@ void analyzeDeviceProperties() { printf("Multiprocessors: %d\n", prop.multiProcessorCount); printf("Cores per MP: %d (estimated)\n", _ConvertSMVer2Cores(prop.major, prop.minor)); printf("Total Cores: %d (estimated)\n", prop.multiProcessorCount * _ConvertSMVer2Cores(prop.major, prop.minor)); - printf("GPU Clock Rate: %.2f GHz\n", prop.clockRate / 1e6); + int gpuClockKHz = 0; + cudaDeviceGetAttribute(&gpuClockKHz, cudaDevAttrClockRate, device); + printf("GPU Clock Rate: %.2f GHz\n", gpuClockKHz / 1e6); int memClockKHz = 0, busWidthBits = 0; cudaDeviceGetAttribute(&memClockKHz, cudaDevAttrMemoryClockRate, device); cudaDeviceGetAttribute(&busWidthBits, cudaDevAttrGlobalMemoryBusWidth, device); @@ -421,7 +423,9 @@ void calculateTheoreticalLimits() { // Compute throughput estimation int coresPerSM = _ConvertSMVer2Cores(prop.major, prop.minor); int totalCores = prop.multiProcessorCount * coresPerSM; - double computeThroughput = totalCores * prop.clockRate / 1e6; // GFLOPS (single precision) + int gpuClockKHz = 0; + cudaDeviceGetAttribute(&gpuClockKHz, cudaDevAttrClockRate, device); + double computeThroughput = totalCores * gpuClockKHz / 1e6; // GFLOPS (single precision) printf("Estimated Peak Compute (SP): %.1f GFLOPS\n", computeThroughput); // Roofline model breakpoint From 9a04824a34e37b107e010b9bdef2d9e20ab7f3ee Mon Sep 17 00:00:00 2001 From: Stephen Shao Date: Mon, 22 Sep 2025 14:37:40 -0400 Subject: [PATCH 8/9] add the missing #include header to fix the compilation errors --- modules/module6/examples/04_reduction_cuda.cu | 1 + modules/module6/examples/05_prefix_sum_cuda.cu | 1 + modules/module7/examples/01_sorting_cuda.cu | 1 + 3 files changed, 3 insertions(+) diff --git a/modules/module6/examples/04_reduction_cuda.cu b/modules/module6/examples/04_reduction_cuda.cu index 7a06883..51c0422 100644 --- a/modules/module6/examples/04_reduction_cuda.cu +++ b/modules/module6/examples/04_reduction_cuda.cu @@ -29,6 +29,7 @@ #include #include #include +#include namespace cg = cooperative_groups; diff --git a/modules/module6/examples/05_prefix_sum_cuda.cu b/modules/module6/examples/05_prefix_sum_cuda.cu index a74f4f5..dd1c58a 100644 --- a/modules/module6/examples/05_prefix_sum_cuda.cu +++ b/modules/module6/examples/05_prefix_sum_cuda.cu @@ -30,6 +30,7 @@ #include #include #include +#include namespace cg = cooperative_groups; diff --git a/modules/module7/examples/01_sorting_cuda.cu b/modules/module7/examples/01_sorting_cuda.cu index 6556ab4..6b902b2 100644 --- a/modules/module7/examples/01_sorting_cuda.cu +++ b/modules/module7/examples/01_sorting_cuda.cu @@ -30,6 +30,7 @@ #include #include #include +#include namespace cg = cooperative_groups; From 2146721a2156a00071feef96f009b5f3a3c61c91 Mon Sep 17 00:00:00 2001 From: Stephen Shao Date: Mon, 22 Sep 2025 14:53:13 -0400 Subject: [PATCH 9/9] Updated the doc of CUDA and ROCm features --- CUDA_ROCM_FEATURES.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/CUDA_ROCM_FEATURES.md b/CUDA_ROCM_FEATURES.md index 6812740..f05a4e1 100644 --- a/CUDA_ROCM_FEATURES.md +++ b/CUDA_ROCM_FEATURES.md @@ -68,17 +68,21 @@ Highlights from AMD’s official docs (see links): - ROCm 7.0.1 is the latest as of 2025‑09‑17; consult the release history for point updates. - HIP as the primary programming model, with CUDA‑like APIs and HIP‑Clang toolchain. - Windows support targets HIP SDK for development; full ROCm stack targets Linux. -- Libraries are provided under the ROCm organization (rocBLAS/hipBLAS, rocFFT/hipFFT, rocSPARSE/hipSPARSE, rocRAND/hipRAND, rocSOLVER/hipSOLVER, rocPRIM/hipCUB, rocThrust, etc.). +- ROCm Libraries monorepo: multiple core math and support libraries are consolidated in the ROCm Libraries monorepo for unified CI/build. Projects included (as of rocm‑7.0.1): composablekernel, hipblas, hipblas-common, hipblaslt, hipcub, hipfft, hiprand, hipsolver, hipsparse, hipsparselt, miopen, rocblas, rocfft, rocprim, rocrand, rocsolver, rocsparse, rocthrust. Shared components: rocroller, tensile, mxdatagenerator. Most of these are marked ā€œCompletedā€ in the monorepo migration status and the monorepo is the source of truth; see its README for current status. - Tooling and system components: ROCr runtime, ROCm SMI, rocprof/rocprofiler, rocgdb/rocm‑debug‑agent. +Nomenclature: project names in the monorepo are standardized to match released package names (for example, hipblas/hipfft/rocsparse instead of mixed casing). + Architectures (illustrative, not exhaustive): - CDNA3 (MI300 family): AI training and HPC; unified memory on APUs (MI300A), large HBM configs (MI300X). - RDNA3 (Radeon 7000 series): workstation/gaming; AV1 encode/decode; hardware ray tracing. -Common libraries (see ROCm Libraries reference): +Common libraries (see ROCm Libraries reference and monorepo): -- rocBLAS / hipBLAS; rocFFT / hipFFT; rocRAND / hipRAND; rocSPARSE / hipSPARSE; rocSOLVER / hipSOLVER; rocPRIM/hipCUB; rocThrust. +- BLAS/solver/sparse: rocBLAS / hipBLAS, hipBLASLt, rocSOLVER / hipSOLVER, rocSPARSE / hipSPARSE, hipSPARSElt. +- FFT/random/core: rocFFT / hipFFT, rocRAND / hipRAND, rocPRIM / hipCUB, rocThrust. +- Kernel building blocks: composablekernel; shared dependencies like Tensile and rocRoller (used by rocBLAS/hipBLASLt). - ML/DL: MIOpen; framework integrations via the ROCm for AI guide. Authoritative references: @@ -86,6 +90,7 @@ Authoritative references: - ROCm Docs index (What is ROCm?, install, reference) - ROCm Release History (7.0.1, 7.0.0, …) - ROCm libraries reference; tools/compilers/runtimes reference +- ROCm Libraries monorepo (status, structure, releases): https://github.com/ROCm/rocm-libraries ---