Skip to content

Commit 09a5915

Browse files
committed
[OpenMP][libomptarget][NFC] Add documentation regarding NextGen plugins
Differential Revision: https://reviews.llvm.org/D144975
1 parent f80a976 commit 09a5915

File tree

1 file changed

+112
-1
lines changed

1 file changed

+112
-1
lines changed

openmp/docs/design/Runtimes.rst

Lines changed: 112 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1123,8 +1123,119 @@ transformed and loaded back into the JIT pipeline via
11231123
LLVM/OpenMP Target Host Runtime Plugins (``libomptarget.rtl.XXXX``)
11241124
-------------------------------------------------------------------
11251125

1126-
.. _device_runtime:
1126+
The LLVM/OpenMP target host runtime plugins were recently re-implemented,
1127+
temporarily renamed as the NextGen plugins, and set as the default and only
1128+
plugins' implementation. Currently, these plugins have support for the NVIDIA
1129+
and AMDGPU devices as well as the GenericELF64bit host-simulated device.
1130+
1131+
The source code of the common infrastructure and the vendor-specific plugins is
1132+
in the ``openmp/libomptarget/nextgen-plugins`` directory in the LLVM project
1133+
repository. The plugin infrastructure aims at unifying the plugin code and logic
1134+
into a generic interface using object-oriented C++. There is a plugin interface
1135+
composed by multiple generic C++ classes which implement the common logic that
1136+
every vendor-specific plugin should provide. In turn, the specific plugins
1137+
inherit from those generic classes and implement the required functions that
1138+
depend on the specific vendor API. As an example, some generic classes that the
1139+
plugin interface define are for representing a device, a device image, an
1140+
efficient resource manager, etc.
1141+
1142+
With this common plugin infrastructure, several tasks have been simplified:
1143+
adding a new vendor-specific plugin, adding generic features or optimizations
1144+
to all plugins, debugging plugins, etc.
11271145

1146+
Environment Variables
1147+
^^^^^^^^^^^^^^^^^^^^^
1148+
1149+
There are several environment variables to change the behavior of the plugins:
1150+
1151+
* ``LIBOMPTARGET_SHARED_MEMORY_SIZE``
1152+
* ``LIBOMPTARGET_STACK_SIZE``
1153+
* ``LIBOMPTARGET_HEAP_SIZE``
1154+
* ``LIBOMPTARGET_NUM_INITIAL_STREAMS``
1155+
* ``LIBOMPTARGET_NUM_INITIAL_EVENTS``
1156+
* ``LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS``
1157+
* ``LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES``
1158+
* ``LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE``
1159+
* ``LIBOMPTARGET_AMDGPU_TEAMS_PER_CU``
1160+
* ``LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES``
1161+
* ``LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS``
1162+
1163+
The environment variables ``LIBOMPTARGET_SHARED_MEMORY_SIZE``,
1164+
``LIBOMPTARGET_STACK_SIZE`` and ``LIBOMPTARGET_HEAP_SIZE`` are described in
1165+
:ref:`libopenmptarget_environment_vars`.
1166+
1167+
LIBOMPTARGET_NUM_INITIAL_STREAMS
1168+
""""""""""""""""""""""""""""""""
1169+
1170+
This environment variable sets the number of pre-created streams in the plugin
1171+
(if supported) at initialization. More streams will be created dynamically
1172+
throughout the execution if needed. A stream is a queue of asynchronous
1173+
operations (e.g., kernel launches and memory copies) that are executed
1174+
sequentially. Parallelism is achieved by featuring multiple streams. The
1175+
``libomptarget`` leverages streams to exploit parallelism between plugin
1176+
operations. The default value is ``32``.
1177+
1178+
LIBOMPTARGET_NUM_INITIAL_EVENTS
1179+
"""""""""""""""""""""""""""""""
1180+
1181+
This environment variable sets the number of pre-created events in the
1182+
plugin (if supported) at initialization. More events will be created
1183+
dynamically throughout the execution if needed. An event is used to synchronize
1184+
a stream with another efficiently. The default value is ``32``.
1185+
1186+
LIBOMPTARGET_LOCK_MAPPED_HOST_BUFFERS
1187+
"""""""""""""""""""""""""""""""""""""
1188+
1189+
This environment variable indicates whether the host buffers mapped by the user
1190+
should be automatically locked/pinned by the plugin. Pinned host buffers allow
1191+
true asynchronous copies between the host and devices. Enabling this feature can
1192+
increase the performance of applications that are intensive in host-device
1193+
memory transfers. The default value is ``false``.
1194+
1195+
LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES
1196+
""""""""""""""""""""""""""""""""""
1197+
1198+
This environment variable controls the number of HSA queues per device in the
1199+
AMDGPU plugin. An HSA queue is a runtime-allocated resource that contains an
1200+
AQL (Architected Queuing Language) packet buffer and is associated with an AQL
1201+
packet processor. HSA queues are used for inserting kernel packets to launching
1202+
kernel executions. A high number of HSA queues may degrade the performance. The
1203+
default value is ``4``.
1204+
1205+
LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE
1206+
""""""""""""""""""""""""""""""""""
1207+
1208+
This environment variable controls the size of each HSA queue in the AMDGPU
1209+
plugin. The size is the number of AQL packets an HSA queue is expected to hold.
1210+
It is also the number of AQL packets that can be pushed into each queue without
1211+
waiting the driver to process them. The default value is ``512``.
1212+
1213+
LIBOMPTARGET_AMDGPU_TEAMS_PER_CU
1214+
""""""""""""""""""""""""""""""""
1215+
1216+
This environment variable controls the default number of teams relative to the
1217+
number of compute units (CUs) of the AMDGPU device. The default number of teams
1218+
is ``#default_teams = #teams_per_CU * #CUs``. The default value of teams per CU
1219+
is ``4``.
1220+
1221+
LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES
1222+
""""""""""""""""""""""""""""""""""""""""
1223+
1224+
This environment variable specifies the maximum size in bytes where the memory
1225+
copies are asynchronous operations in the AMDGPU plugin. Up to this transfer
1226+
size, the memory copies are asychronous operations pushed to the corresponding
1227+
stream. For larger transfers, they are synchronous transfers. Memory copies
1228+
involving already locked/pinned host buffers are always asychronous. The default
1229+
value is ``1*1024*1024`` bytes (1 MB).
1230+
1231+
LIBOMPTARGET_AMDGPU_NUM_INITIAL_HSA_SIGNALS
1232+
"""""""""""""""""""""""""""""""""""""""""""
1233+
1234+
This environment variable controls the initial number of HSA signals per device
1235+
in the AMDGPU plugin. There is one resource manager of signals per device
1236+
managing several pre-created signals. These signals are mainly used by AMDGPU
1237+
streams. More HSA signals will be created dynamically throughout the execution
1238+
if needed. The default value is ``64``.
11281239

11291240
.. _remote_offloading_plugin:
11301241

0 commit comments

Comments
 (0)