covers/: Book cover imagesblurbs/: Promotional blurbsinfographics/: Marketing visualssource_code/: Code samplesmanuscript/: Drafts and format.txt for TOCmarketing/: Ads and press releasesadditional_resources/: Extras
View the live site at burstbookspublishing.github.io/advanced-cuda-programming
- Advanced CUDA Programming: High Performance Computing with GPUs
- Ampere/Hopper Architecture Details
- Streaming Multiprocessor Internals
- Memory Controller Design
- Warp Scheduling Mechanisms
- Branch Prediction and Divergence
- Instruction-Level Parallelism
- Cache Hierarchy Implementation
- Memory Coalescing Mechanisms
- L2 Cache Optimization Strategies
- Custom Memory Allocators
- Memory Pool Implementation
- Zero-Copy Memory Strategies
- Page Migration Engines
- Prefetch Optimization
- System-Wide Memory Access
- Bank Conflict Resolution
- Shared Memory Access Patterns
- Cache Line Utilization
- Multi-Stream Scheduling
- Stream Priority Control
- Event-Based Synchronization
- Overlapping Data Transfers
- Pinned Memory Usage
- Asynchronous Prefetching
- Inter-Stream Dependencies
- CPU-GPU Synchronization
- Multi-GPU Coordination
- Dynamic Block Sizing
- Occupancy-Driven Design
- Resource Utilization
- Warp Primitives
- Cooperative Groups
- Shuffle Instructions
- Recursive Kernel Launch
- Parent-Child Synchronization
- Resource Management
- Assembly Analysis
- PTX Optimization
- Register Pressure Management
- Memory Access Patterns
- Texture Memory Usage
- Constant Memory Optimization
- Arithmetic Intensity
- Thread Coarsening
- Loop Unrolling Strategies
- Lock-Free Data Structures
- Concurrent Hash Tables
- Priority Queues
- Slab Allocators
- Memory Pools
- Defragmentation Techniques
- Compressed Formats
- Dynamic Updates
- Efficient Traversal
- Custom BLAS Operations
- Sparse Matrix Operations
- Eigenvalue Solvers
- FFT Implementation
- Differential Equations
- Monte Carlo Methods
- Parallel Sort Implementation
- Graph Algorithms
- Numerical Optimization
- Custom GEMM Implementation
- Convolution Optimization
- Tensor Core Programming
- Mixed Precision Training
- Memory-Efficient Training
- Multi-GPU Training
- Quantization Techniques
- Kernel Fusion
- Batch Processing
- P2P Communication
- NVLink Optimization
- Remote Memory Access
- Load Balancing Strategies
- Memory Distribution
- Synchronization Methods
- MPI Integration
- Multi-Node Systems
- Cluster Programming
- Nsight Compute Usage
- Roofline Analysis
- Memory Access Patterns
- CUDA-GDB Techniques
- Memory Checking Tools
- Race Detection
- Metrics Collection
- Visual Profiler
- Custom Profiling
- Kernel Scheduling
- Memory Management
- Pipeline Optimization
- Deterministic Execution
- Deadline Scheduling
- Resource Management
- Data Pipeline Design
- Continuous Processing
- Buffer Management
- Ray Tracing Cores
- New Memory Technologies
- Next-Gen Architecture
- Graph Programming
- Quantum Simulation
- Domain-Specific Languages
- Case Studies
- Performance Analysis
- Best Practices
