Skip to content

Practical OpenCL examples for Windows with C/C++, demonstrating GPU computing across NVIDIA and Intel hardware

Notifications You must be signed in to change notification settings

Foadsf/opencl-windows-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenCL Examples for Windows (C/C++)

A collection of practical OpenCL examples demonstrating GPU computing on Windows using Visual Studio 2019 and CMake.

System Requirements

  • OS: Windows 10/11
  • Compiler: Visual Studio 2019 Build Tools (MSVC 19.29+)
  • CMake: 3.15+ (bundled with VS Build Tools)
  • OpenCL Runtimes: At least one of:
    • NVIDIA CUDA Toolkit 12.x (for NVIDIA GPUs)
    • Intel Graphics Driver (for Intel integrated GPUs)
    • Intel oneAPI Base Toolkit (for CPU execution)

Hardware Tested

  • GPU: NVIDIA RTX A2000 Laptop GPU
  • iGPU: Intel UHD Graphics (11th Gen)
  • CPU: Intel Core i7-11850H @ 2.50GHz

Repository Structure

opencl-windows-cpp/
├── examples/
│   ├── 000_device_enumeration/    # List all OpenCL platforms and devices
│   ├── 001_hello_opencl/          # Simple "Hello World" kernel
│   ├── 002_vector_addition/       # CPU vs GPU performance comparison
│   ├── 003_breakeven_analysis/    # Find OpenCL performance crossover points
│   └── 004_async_multidevice/     # Concurrent execution across devices
├── setup/
│   ├── check_opencl_installed.bat # Verify OpenCL installation
│   └── detect_opencl_hardware.bat # Hardware detection script
└── README.md

Quick Start

1. Verify OpenCL Installation

cd setup
check_opencl_installed.bat

Expected output: Lists installed OpenCL runtimes and detects your hardware.

2. Build and Run an Example

cd examples\001_hello_opencl
build.bat

Each example includes:

  • main.cpp - Host code
  • *.cl - OpenCL kernel(s)
  • CMakeLists.txt - CMake configuration
  • build.bat - Build and run script
  • README.md - Example documentation

Examples Overview

000: Device Enumeration

Purpose: Detect all OpenCL platforms and devices on your system.

Key Concepts: Platform querying, device properties

cd examples\000_device_enumeration
build.bat

Output: Lists all GPUs and CPUs with their capabilities.


001: Hello OpenCL

Purpose: Simplest possible GPU kernel - write "Hello from GPU!" to a buffer.

Key Concepts: Context creation, kernel compilation, buffer management

cd examples\001_hello_opencl
build.bat

Expected Output:

Using device: NVIDIA RTX A2000 Laptop GPU
Kernel output: Hello from GPU!
Success!

002: Vector Addition

Purpose: Compare CPU vs GPU performance for vector addition.

Key Concepts: Memory transfer overhead, parallel execution

cd examples\002_vector_addition
build.bat

Results (10M elements):

  • Serial C++: 6.12ms (baseline)
  • OpenCL GPU: 24.23ms (SLOWER - memory-bound operation)

Lesson: Simple operations don't benefit from GPUs due to transfer overhead.


003: Breakeven Analysis

Purpose: Find the vector size where OpenCL becomes faster than serial C++.

Key Concepts: Performance profiling, scaling analysis

cd examples\003_breakeven_analysis
build.bat

Key Findings:

  • NVIDIA RTX A2000: Faster at 64K elements
  • Intel UHD Graphics: Faster at 256K elements

Lesson: GPUs require sufficient workload to amortize overhead.


004: Async Multi-Device

Purpose: Execute kernels simultaneously across multiple devices.

Key Concepts: Multiple contexts, asynchronous execution, cross-platform limitations

cd examples\004_async_multidevice
build.bat

Important: Timing analysis across platforms is unreliable - each vendor uses different time references.


005: Parallelization Technologies Comparison

Purpose: Compare Serial C++, C++17 std::execution, OpenMP, and OpenCL for matrix-vector multiplication.

Key Concepts: Technology trade-offs, memory bandwidth limits

cd examples\005_parallelization_comparison
build.bat

Results (4096×4096 matrix):

  • Serial: 16.4ms
  • OpenMP: 1.5ms (11x speedup) - Winner
  • OpenCL GPU: 23.4ms (SLOWER - still memory-bound)

Lesson: OpenMP dominates moderate-intensity operations.


006: Matrix Multiplication - Where GPUs Shine

Purpose: Demonstrate compute-intensive operations where GPUs provide massive speedups.

Key Concepts: O(n³) complexity, tiling optimization, GFLOPS

cd examples\006_matrix_multiply
build.bat

Results (2048×2048 matrices):

  • Serial: 16,306ms
  • OpenMP: 5,242ms (3.1x)
  • NVIDIA GPU (tiled): 105ms (155x speedup) - Winner

Lesson: Matrix multiply is the canonical GPU-suitable computation.


007: Image Convolution - GPU Performance Sweet Spot

Purpose: Show when/where/why OpenCL becomes the optimal choice for image processing.

Key Concepts: Arithmetic intensity, separable filters, local memory optimization

cd examples\007_image_convolution
build.bat

Results (4096×4096 image, 15×15 kernel):

  • Serial: 3,370ms
  • OpenMP: 526ms (6.4x)
  • Intel CPU OpenCL (separable): 22.5ms (150x speedup) - Winner
  • NVIDIA GPU (separable): 28.7ms (117x)

Lesson: Image processing with large kernels is OpenCL's sweet spot. Separable decomposition critical.


Performance Summary Across All Examples

Operation Type Arithmetic Intensity Winner Best Speedup
Vector addition Very low (1 op/access) CPU (Serial) 1x
Matrix-vector Low (10 ops/access) OpenMP 11x
Matrix multiply High (2000 ops/access) OpenCL GPU 155x
Convolution (small kernel) Low (9 ops/access) OpenMP 1.4x
Convolution (large kernel) Very high (225 ops/access) OpenCL 150x

Key Insight: GPU advantage grows exponentially with arithmetic intensity.

When to Use Each Technology

Use Serial C++ when:

  • Dataset is tiny (< 1K elements)
  • Algorithm has poor parallelism
  • Prototyping and validation

Use OpenMP when:

  • Arithmetic intensity is moderate (5-50 ops per memory access)
  • Quick parallelization needed
  • Expect 6-12x speedup across diverse workloads

Use OpenCL GPU when:

  • Arithmetic intensity is high (> 50 ops per memory access)
  • Large datasets (> 10M elements)
  • 100x+ speedup justifies development complexity

Use OpenCL CPU when:

  • Data must stay in CPU memory
  • Cache-friendly with data reuse
  • Can outperform discrete GPUs for specific patterns

Building from Source

All examples use the same build pattern:

cd examples\<example_name>
build.bat

The build.bat script:

  1. Creates a build/ directory
  2. Runs CMake to generate Visual Studio projects
  3. Builds in Release configuration
  4. Copies kernel files (.cl) to output directory
  5. Runs the executable

Manual Build (Advanced)

mkdir build && cd build
cmake .. -G "Visual Studio 16 2019" -A x64
cmake --build . --config Release
cd Release
<executable>.exe

Troubleshooting

"No OpenCL platforms found"

  • Install at least one OpenCL runtime (NVIDIA CUDA, Intel Graphics Driver, or Intel oneAPI)
  • Run setup\check_opencl_installed.bat to verify

"Failed to open kernel file"

  • Ensure .cl files are in the same directory as the executable
  • Check that build.bat copies kernel files correctly

Device enumeration hangs

CMake not found

  • Install Visual Studio 2019 Build Tools with "Desktop development with C++"
  • Or install standalone CMake from https://cmake.org

Performance Tips

  1. Memory-bound operations (like vector addition) don't benefit much from GPUs due to transfer overhead
  2. Compute-intensive operations (matrix multiplication, image processing) show significant GPU speedup
  3. Breakeven point varies by hardware - test with realistic data sizes
  4. Multi-device execution works best when devices have independent work chunks

Learning Path

Recommended order:

  1. 000_device_enumeration - Understand your hardware
  2. 001_hello_opencl - Learn basic OpenCL workflow
  3. 002_vector_addition - See why simple operations are slow
  4. 003_breakeven_analysis - Find when GPU acceleration helps
  5. 004_async_multidevice - Advanced: use all devices simultaneously

Common Issues & Solutions

Issue: OpenCL hangs during platform enumeration

Cause: Conflicting Intel OpenCL runtimes
Solution: Uninstall legacy "Intel OpenCL CPU Runtime 16.x", keep only Intel oneAPI

Issue: Build errors about std::to_string

Cause: Missing <string> header
Solution: Add #include <string> at top of file

Issue: "Cannot create context with devices from multiple platforms"

Cause: Trying to use devices from NVIDIA + Intel in single context
Solution: Create separate contexts per platform (see example 004)

Resources

License

MIT License - See LICENSE file for details

Contributing

Contributions welcome! Please ensure:

  • Code follows existing style
  • Examples build cleanly on Windows + MSVC 2019
  • Include README.md for new examples
  • Test on at least one GPU platform

Acknowledgments

Examples developed and tested on Windows 10 with:

  • Visual Studio 2019 Build Tools
  • NVIDIA CUDA Toolkit 12.9
  • Intel oneAPI 2025.1

About

Practical OpenCL examples for Windows with C/C++, demonstrating GPU computing across NVIDIA and Intel hardware

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published