# Workgroups in PyOpenCL

Elwin van 't Wout

PUC Chile

25-9-2024

This tutorial shows the functionality of workgroups in OpenCL.

First, we need to configure the virtual machine and install PyOpenCL.

In [1]:
!sudo apt update
!sudo apt install -y nvidia-cuda-toolkit pocl-opencl-icd
!pip install pyopencl

[33m0% [Working][0m            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,001 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:6 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Ign:9 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Hit:13 http://archive.ubuntu.com/ubuntu 

In [2]:
import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array

  warn("Unable to import recommended hash 'siphash24.siphash13', "


OpenCL always needs a `context` object, which stores information con the programming environment. Let us create a default context, where OpenCL decides about which device to use. See the tutorial `1_platform_context.ipynb` on instructions to choose a specific device type.

In [3]:
ctx = cl.create_some_context()

In [4]:
device = ctx.devices[0]
print("Platform name:", device.platform.name)
print("Device name:", device.name)
print("Device type:", cl.device_type.to_string(device.type))

Platform name: NVIDIA CUDA
Device name: Tesla T4
Device type: ALL | GPU


Also, a ````queue```` is needed to store the sequence of instructions.

In [5]:
queue = cl.CommandQueue(ctx)

OpenCL divides the data in work groups. Each work group contains many threads to perform the same operation but on different data (SIMT model). There is a maximum size of each work group, which depends on the compute device.

In [6]:
print("Maximum work group size:", device.max_work_group_size)

Maximum work group size: 1024


The code that needs to be executed by OpenCL is called a 'kernel'. The kernel needs to be written as a piece of C code, where one can use the functionality of OpenCL as well. This piece of C code is stored in a text string and will be compiled by the program. Each function in the C code needs to start with the prefix ```___kernel```.

In OpenCL, the code is written for a single thread. That is, one needs to program the instructions for a single thread and OpenCL will asign threads to the data array.

The 'local ID' is the location of the thread in the workgroup.

The 'group ID' is the workgroup to which the thread belongs.

The 'global ID' is the location of the data point in the data array.

The input variable '0' of these functions refers to the first dimension of the grid. In OpenCL, you can use 2D and 3D grids as well.

In [7]:
kernel = """
__kernel void get_id(__global int *a,
                     __global int *b,
                     __global int *c,
                     __global int *d)
{
  int id = get_global_id(0);

  a[id] = get_global_id(0);
  b[id] = get_local_id(0);
  c[id] = get_group_id(0);
  d[id] = get_local_size(0);
}
"""

The kernel needs to be compiled to build a program.

In [8]:
prg = cl.Program(ctx, kernel).build()

Now, let us specify the size of the workgroup and the number of workgroups we wish to use later on.

In [9]:
workgroup_size = 6
n_workgroups = 4
n_vector = workgroup_size * n_workgroups

Let us create four OpenCL arrays that will store the IDs of each thread.

In [10]:
cl_global_id = cl_array.empty(queue, n_vector, dtype=np.int32)
cl_local_id = cl_array.empty(queue, n_vector, dtype=np.int32)
cl_group_id = cl_array.empty(queue, n_vector, dtype=np.int32)
cl_local_size = cl_array.empty(queue, n_vector, dtype=np.int32)

With both the program and data available, we can execute the kernel on the compute device. For this, we will create an 'event' by specifying which function in the kernel we wish to execute and providing the input variables.

The input variables of the program are as follows:
 1. the command queue
 1. the global size of the data array, possibly multi-dimensional
 1. the size of the workgroups, which needs to divide the global size evenly
 1. the buffers for the input variables of the kernel function

In [11]:
event = prg.get_id(queue,
                   (n_vector,),
                   (workgroup_size,),
                   cl_global_id.data,
                   cl_local_id.data,
                   cl_group_id.data,
                   cl_local_size.data)

In [12]:
print("Global ID:")
print(cl_global_id)
print("Local ID:")
print(cl_local_id)
print("Group ID:")
print(cl_group_id)
print("Local size:")
print(cl_local_size)

Global ID:
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
Local ID:
[0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5]
Group ID:
[0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3]
Local size:
[6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6]


The output displays the identifiers of each thread within the global vector and the local workgroup. It also prints out the workgroup identifier and its size.