OpenCL Examples for Windows (C/C++)

A collection of practical OpenCL examples demonstrating GPU computing on Windows using Visual Studio 2019 and CMake.

System Requirements

OS: Windows 10/11
Compiler: Visual Studio 2019 Build Tools (MSVC 19.29+)
CMake: 3.15+ (bundled with VS Build Tools)
OpenCL Runtimes: At least one of:
- NVIDIA CUDA Toolkit 12.x (for NVIDIA GPUs)
- Intel Graphics Driver (for Intel integrated GPUs)
- Intel oneAPI Base Toolkit (for CPU execution)

Hardware Tested

GPU: NVIDIA RTX A2000 Laptop GPU
iGPU: Intel UHD Graphics (11th Gen)
CPU: Intel Core i7-11850H @ 2.50GHz

Repository Structure

opencl-windows-cpp/
├── examples/
│   ├── 000_device_enumeration/    # List all OpenCL platforms and devices
│   ├── 001_hello_opencl/          # Simple "Hello World" kernel
│   ├── 002_vector_addition/       # CPU vs GPU performance comparison
│   ├── 003_breakeven_analysis/    # Find OpenCL performance crossover points
│   └── 004_async_multidevice/     # Concurrent execution across devices
├── setup/
│   ├── check_opencl_installed.bat # Verify OpenCL installation
│   └── detect_opencl_hardware.bat # Hardware detection script
└── README.md

Quick Start

1. Verify OpenCL Installation

cd setup
check_opencl_installed.bat

Expected output: Lists installed OpenCL runtimes and detects your hardware.

2. Build and Run an Example

cd examples\001_hello_opencl
build.bat

Each example includes:

main.cpp - Host code
*.cl - OpenCL kernel(s)
CMakeLists.txt - CMake configuration
build.bat - Build and run script
README.md - Example documentation

Examples Overview

000: Device Enumeration

Purpose: Detect all OpenCL platforms and devices on your system.

Key Concepts: Platform querying, device properties

cd examples\000_device_enumeration
build.bat

Output: Lists all GPUs and CPUs with their capabilities.

001: Hello OpenCL

Purpose: Simplest possible GPU kernel - write "Hello from GPU!" to a buffer.

Key Concepts: Context creation, kernel compilation, buffer management

cd examples\001_hello_opencl
build.bat

Expected Output:

Using device: NVIDIA RTX A2000 Laptop GPU
Kernel output: Hello from GPU!
Success!

002: Vector Addition

Purpose: Compare CPU vs GPU performance for vector addition.

Key Concepts: Memory transfer overhead, parallel execution

cd examples\002_vector_addition
build.bat

Results (10M elements):

Serial C++: 6.12ms (baseline)
OpenCL GPU: 24.23ms (SLOWER - memory-bound operation)

Lesson: Simple operations don't benefit from GPUs due to transfer overhead.

003: Breakeven Analysis

Purpose: Find the vector size where OpenCL becomes faster than serial C++.

Key Concepts: Performance profiling, scaling analysis

cd examples\003_breakeven_analysis
build.bat

Key Findings:

NVIDIA RTX A2000: Faster at 64K elements
Intel UHD Graphics: Faster at 256K elements

Lesson: GPUs require sufficient workload to amortize overhead.

004: Async Multi-Device

Purpose: Execute kernels simultaneously across multiple devices.

Key Concepts: Multiple contexts, asynchronous execution, cross-platform limitations

cd examples\004_async_multidevice
build.bat

Important: Timing analysis across platforms is unreliable - each vendor uses different time references.

005: Parallelization Technologies Comparison

Purpose: Compare Serial C++, C++17 std::execution, OpenMP, and OpenCL for matrix-vector multiplication.

Key Concepts: Technology trade-offs, memory bandwidth limits

cd examples\005_parallelization_comparison
build.bat

Results (4096×4096 matrix):

Serial: 16.4ms
OpenMP: 1.5ms (11x speedup) - Winner
OpenCL GPU: 23.4ms (SLOWER - still memory-bound)

Lesson: OpenMP dominates moderate-intensity operations.

006: Matrix Multiplication - Where GPUs Shine

Purpose: Demonstrate compute-intensive operations where GPUs provide massive speedups.

Key Concepts: O(n³) complexity, tiling optimization, GFLOPS

cd examples\006_matrix_multiply
build.bat

Results (2048×2048 matrices):

Serial: 16,306ms
OpenMP: 5,242ms (3.1x)
NVIDIA GPU (tiled): 105ms (155x speedup) - Winner

Lesson: Matrix multiply is the canonical GPU-suitable computation.

007: Image Convolution - GPU Performance Sweet Spot

Purpose: Show when/where/why OpenCL becomes the optimal choice for image processing.

Key Concepts: Arithmetic intensity, separable filters, local memory optimization

cd examples\007_image_convolution
build.bat

Results (4096×4096 image, 15×15 kernel):

Serial: 3,370ms
OpenMP: 526ms (6.4x)
Intel CPU OpenCL (separable): 22.5ms (150x speedup) - Winner
NVIDIA GPU (separable): 28.7ms (117x)

Lesson: Image processing with large kernels is OpenCL's sweet spot. Separable decomposition critical.

Performance Summary Across All Examples

Operation Type	Arithmetic Intensity	Winner	Best Speedup
Vector addition	Very low (1 op/access)	CPU (Serial)	1x
Matrix-vector	Low (10 ops/access)	OpenMP	11x
Matrix multiply	High (2000 ops/access)	OpenCL GPU	155x
Convolution (small kernel)	Low (9 ops/access)	OpenMP	1.4x
Convolution (large kernel)	Very high (225 ops/access)	OpenCL	150x

Key Insight: GPU advantage grows exponentially with arithmetic intensity.

When to Use Each Technology

Use Serial C++ when:

Dataset is tiny (< 1K elements)
Algorithm has poor parallelism
Prototyping and validation

Use OpenMP when:

Arithmetic intensity is moderate (5-50 ops per memory access)
Quick parallelization needed
Expect 6-12x speedup across diverse workloads

Use OpenCL GPU when:

Arithmetic intensity is high (> 50 ops per memory access)
Large datasets (> 10M elements)
100x+ speedup justifies development complexity

Use OpenCL CPU when:

Data must stay in CPU memory
Cache-friendly with data reuse
Can outperform discrete GPUs for specific patterns

Building from Source

All examples use the same build pattern:

cd examples\<example_name>
build.bat

The build.bat script:

Creates a build/ directory
Runs CMake to generate Visual Studio projects
Builds in Release configuration
Copies kernel files (.cl) to output directory
Runs the executable

Manual Build (Advanced)

mkdir build && cd build
cmake .. -G "Visual Studio 16 2019" -A x64
cmake --build . --config Release
cd Release
<executable>.exe

Troubleshooting

"No OpenCL platforms found"

Install at least one OpenCL runtime (NVIDIA CUDA, Intel Graphics Driver, or Intel oneAPI)
Run setup\check_opencl_installed.bat to verify

"Failed to open kernel file"

Ensure .cl files are in the same directory as the executable
Check that build.bat copies kernel files correctly

Device enumeration hangs

Update Intel Graphics drivers to latest version
Remove legacy Intel OpenCL CPU Runtime if installed alongside Intel oneAPI
See: https://github.com/intel/compute-runtime

CMake not found

Install Visual Studio 2019 Build Tools with "Desktop development with C++"
Or install standalone CMake from https://cmake.org

Performance Tips

Memory-bound operations (like vector addition) don't benefit much from GPUs due to transfer overhead
Compute-intensive operations (matrix multiplication, image processing) show significant GPU speedup
Breakeven point varies by hardware - test with realistic data sizes
Multi-device execution works best when devices have independent work chunks

Learning Path

Recommended order:

000_device_enumeration - Understand your hardware
001_hello_opencl - Learn basic OpenCL workflow
002_vector_addition - See why simple operations are slow
003_breakeven_analysis - Find when GPU acceleration helps
004_async_multidevice - Advanced: use all devices simultaneously

Common Issues & Solutions

Issue: OpenCL hangs during platform enumeration

Cause: Conflicting Intel OpenCL runtimes
Solution: Uninstall legacy "Intel OpenCL CPU Runtime 16.x", keep only Intel oneAPI

Issue: Build errors about `std::to_string`

Cause: Missing <string> header
Solution: Add #include <string> at top of file

Issue: "Cannot create context with devices from multiple platforms"

Cause: Trying to use devices from NVIDIA + Intel in single context
Solution: Create separate contexts per platform (see example 004)

Resources

OpenCL Programming Guide: https://www.khronos.org/opencl/
NVIDIA OpenCL Best Practices: https://docs.nvidia.com/cuda/opencl-best-practices-guide/
Intel OpenCL Documentation: https://www.intel.com/content/www/us/en/developer/tools/opencl-sdk/overview.html

License

MIT License - See LICENSE file for details

Contributing

Contributions welcome! Please ensure:

Code follows existing style
Examples build cleanly on Windows + MSVC 2019
Include README.md for new examples
Test on at least one GPU platform

Acknowledgments

Examples developed and tested on Windows 10 with:

Visual Studio 2019 Build Tools
NVIDIA CUDA Toolkit 12.9
Intel oneAPI 2025.1

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
setup		setup
.gitignore		.gitignore
LESSONS_LEARNED.md		LESSONS_LEARNED.md
README.md		README.md

Foadsf/opencl-windows-examples

Folders and files

Latest commit

History

Repository files navigation

OpenCL Examples for Windows (C/C++)

System Requirements

Hardware Tested

Repository Structure

Quick Start

1. Verify OpenCL Installation

2. Build and Run an Example

Examples Overview

000: Device Enumeration

001: Hello OpenCL

002: Vector Addition

003: Breakeven Analysis

004: Async Multi-Device

005: Parallelization Technologies Comparison

006: Matrix Multiplication - Where GPUs Shine

007: Image Convolution - GPU Performance Sweet Spot

Performance Summary Across All Examples

When to Use Each Technology

Building from Source

Manual Build (Advanced)

Troubleshooting

"No OpenCL platforms found"

"Failed to open kernel file"

Device enumeration hangs

CMake not found

Performance Tips

Learning Path

Common Issues & Solutions

Issue: OpenCL hangs during platform enumeration

Issue: Build errors about std::to_string

Issue: "Cannot create context with devices from multiple platforms"

Resources

License

Contributing

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Issue: Build errors about `std::to_string`

Packages