A collection of practical OpenCL examples demonstrating GPU computing on Windows using Visual Studio 2019 and CMake.
- OS: Windows 10/11
- Compiler: Visual Studio 2019 Build Tools (MSVC 19.29+)
- CMake: 3.15+ (bundled with VS Build Tools)
- OpenCL Runtimes: At least one of:
- NVIDIA CUDA Toolkit 12.x (for NVIDIA GPUs)
- Intel Graphics Driver (for Intel integrated GPUs)
- Intel oneAPI Base Toolkit (for CPU execution)
- GPU: NVIDIA RTX A2000 Laptop GPU
- iGPU: Intel UHD Graphics (11th Gen)
- CPU: Intel Core i7-11850H @ 2.50GHz
opencl-windows-cpp/
├── examples/
│ ├── 000_device_enumeration/ # List all OpenCL platforms and devices
│ ├── 001_hello_opencl/ # Simple "Hello World" kernel
│ ├── 002_vector_addition/ # CPU vs GPU performance comparison
│ ├── 003_breakeven_analysis/ # Find OpenCL performance crossover points
│ └── 004_async_multidevice/ # Concurrent execution across devices
├── setup/
│ ├── check_opencl_installed.bat # Verify OpenCL installation
│ └── detect_opencl_hardware.bat # Hardware detection script
└── README.md
cd setup
check_opencl_installed.bat
Expected output: Lists installed OpenCL runtimes and detects your hardware.
cd examples\001_hello_opencl
build.bat
Each example includes:
main.cpp
- Host code*.cl
- OpenCL kernel(s)CMakeLists.txt
- CMake configurationbuild.bat
- Build and run scriptREADME.md
- Example documentation
Purpose: Detect all OpenCL platforms and devices on your system.
Key Concepts: Platform querying, device properties
cd examples\000_device_enumeration
build.bat
Output: Lists all GPUs and CPUs with their capabilities.
Purpose: Simplest possible GPU kernel - write "Hello from GPU!" to a buffer.
Key Concepts: Context creation, kernel compilation, buffer management
cd examples\001_hello_opencl
build.bat
Expected Output:
Using device: NVIDIA RTX A2000 Laptop GPU
Kernel output: Hello from GPU!
Success!
Purpose: Compare CPU vs GPU performance for vector addition.
Key Concepts: Memory transfer overhead, parallel execution
cd examples\002_vector_addition
build.bat
Results (10M elements):
- Serial C++: 6.12ms (baseline)
- OpenCL GPU: 24.23ms (SLOWER - memory-bound operation)
Lesson: Simple operations don't benefit from GPUs due to transfer overhead.
Purpose: Find the vector size where OpenCL becomes faster than serial C++.
Key Concepts: Performance profiling, scaling analysis
cd examples\003_breakeven_analysis
build.bat
Key Findings:
- NVIDIA RTX A2000: Faster at 64K elements
- Intel UHD Graphics: Faster at 256K elements
Lesson: GPUs require sufficient workload to amortize overhead.
Purpose: Execute kernels simultaneously across multiple devices.
Key Concepts: Multiple contexts, asynchronous execution, cross-platform limitations
cd examples\004_async_multidevice
build.bat
Important: Timing analysis across platforms is unreliable - each vendor uses different time references.
Purpose: Compare Serial C++, C++17 std::execution, OpenMP, and OpenCL for matrix-vector multiplication.
Key Concepts: Technology trade-offs, memory bandwidth limits
cd examples\005_parallelization_comparison
build.bat
Results (4096×4096 matrix):
- Serial: 16.4ms
- OpenMP: 1.5ms (11x speedup) - Winner
- OpenCL GPU: 23.4ms (SLOWER - still memory-bound)
Lesson: OpenMP dominates moderate-intensity operations.
Purpose: Demonstrate compute-intensive operations where GPUs provide massive speedups.
Key Concepts: O(n³) complexity, tiling optimization, GFLOPS
cd examples\006_matrix_multiply
build.bat
Results (2048×2048 matrices):
- Serial: 16,306ms
- OpenMP: 5,242ms (3.1x)
- NVIDIA GPU (tiled): 105ms (155x speedup) - Winner
Lesson: Matrix multiply is the canonical GPU-suitable computation.
Purpose: Show when/where/why OpenCL becomes the optimal choice for image processing.
Key Concepts: Arithmetic intensity, separable filters, local memory optimization
cd examples\007_image_convolution
build.bat
Results (4096×4096 image, 15×15 kernel):
- Serial: 3,370ms
- OpenMP: 526ms (6.4x)
- Intel CPU OpenCL (separable): 22.5ms (150x speedup) - Winner
- NVIDIA GPU (separable): 28.7ms (117x)
Lesson: Image processing with large kernels is OpenCL's sweet spot. Separable decomposition critical.
Operation Type | Arithmetic Intensity | Winner | Best Speedup |
---|---|---|---|
Vector addition | Very low (1 op/access) | CPU (Serial) | 1x |
Matrix-vector | Low (10 ops/access) | OpenMP | 11x |
Matrix multiply | High (2000 ops/access) | OpenCL GPU | 155x |
Convolution (small kernel) | Low (9 ops/access) | OpenMP | 1.4x |
Convolution (large kernel) | Very high (225 ops/access) | OpenCL | 150x |
Key Insight: GPU advantage grows exponentially with arithmetic intensity.
Use Serial C++ when:
- Dataset is tiny (< 1K elements)
- Algorithm has poor parallelism
- Prototyping and validation
Use OpenMP when:
- Arithmetic intensity is moderate (5-50 ops per memory access)
- Quick parallelization needed
- Expect 6-12x speedup across diverse workloads
Use OpenCL GPU when:
- Arithmetic intensity is high (> 50 ops per memory access)
- Large datasets (> 10M elements)
- 100x+ speedup justifies development complexity
Use OpenCL CPU when:
- Data must stay in CPU memory
- Cache-friendly with data reuse
- Can outperform discrete GPUs for specific patterns
All examples use the same build pattern:
cd examples\<example_name>
build.bat
The build.bat
script:
- Creates a
build/
directory - Runs CMake to generate Visual Studio projects
- Builds in Release configuration
- Copies kernel files (
.cl
) to output directory - Runs the executable
mkdir build && cd build
cmake .. -G "Visual Studio 16 2019" -A x64
cmake --build . --config Release
cd Release
<executable>.exe
- Install at least one OpenCL runtime (NVIDIA CUDA, Intel Graphics Driver, or Intel oneAPI)
- Run
setup\check_opencl_installed.bat
to verify
- Ensure
.cl
files are in the same directory as the executable - Check that
build.bat
copies kernel files correctly
- Update Intel Graphics drivers to latest version
- Remove legacy Intel OpenCL CPU Runtime if installed alongside Intel oneAPI
- See: https://github.com/intel/compute-runtime
- Install Visual Studio 2019 Build Tools with "Desktop development with C++"
- Or install standalone CMake from https://cmake.org
- Memory-bound operations (like vector addition) don't benefit much from GPUs due to transfer overhead
- Compute-intensive operations (matrix multiplication, image processing) show significant GPU speedup
- Breakeven point varies by hardware - test with realistic data sizes
- Multi-device execution works best when devices have independent work chunks
Recommended order:
- 000_device_enumeration - Understand your hardware
- 001_hello_opencl - Learn basic OpenCL workflow
- 002_vector_addition - See why simple operations are slow
- 003_breakeven_analysis - Find when GPU acceleration helps
- 004_async_multidevice - Advanced: use all devices simultaneously
Cause: Conflicting Intel OpenCL runtimes
Solution: Uninstall legacy "Intel OpenCL CPU Runtime 16.x", keep only Intel oneAPI
Cause: Missing <string>
header
Solution: Add #include <string>
at top of file
Cause: Trying to use devices from NVIDIA + Intel in single context
Solution: Create separate contexts per platform (see example 004)
- OpenCL Programming Guide: https://www.khronos.org/opencl/
- NVIDIA OpenCL Best Practices: https://docs.nvidia.com/cuda/opencl-best-practices-guide/
- Intel OpenCL Documentation: https://www.intel.com/content/www/us/en/developer/tools/opencl-sdk/overview.html
MIT License - See LICENSE file for details
Contributions welcome! Please ensure:
- Code follows existing style
- Examples build cleanly on Windows + MSVC 2019
- Include README.md for new examples
- Test on at least one GPU platform
Examples developed and tested on Windows 10 with:
- Visual Studio 2019 Build Tools
- NVIDIA CUDA Toolkit 12.9
- Intel oneAPI 2025.1