forked from alpaka-group/alpaka
-
Notifications
You must be signed in to change notification settings - Fork 0
/
backends.rst
678 lines (541 loc) · 58.9 KB
/
backends.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
.. highlight:: bash
Back-ends
=========
Accelerator Implementations
```````````````````````````
The table shows which native implementation or information is used to represent an alpaka functionality.
.. table::
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| alpaka | Serial | std::thread | OpenMP 2.0 | OpenMP 4.0 | CUDA 9.0+ |
+===============================================================+===============================================+=================================================================================+=====================================================================================+=======================================================================================================================================+==================================================+
| Devices | Host Core | Host Cores | Host Cores | Host Cores | NVIDIA GPUs |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| Lib/API | standard C++ | std::thread | OpenMP 2.0 | OpenMP 4.0 | CUDA 9.0+ |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| Kernel execution | sequential | std::thread(kernel) | omp_set_dynamic(0), #pragma omp parallel num_threads(iNumKernelsInBlock) | #pragma omp target, #pragma omp teams num_teams(...) thread_limit(...), #pragma omp distribute, #pragma omp parallel num_threads(...) | cudaConfigureCall, cudaSetupArgument, cudaLaunch |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| Execution strategy grid-blocks | sequential | sequential | sequential | undefined | undefined |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| Execution strategy block-kernels | sequential | preemptive multitasking | preemptive multitasking | preemptive multitasking | lock-step within warps |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| getIdx | emulated | block-kernel: mapping of std::this_thread::get_id() grid-block: member variable | block-kernel: omp_get_num_threads() to 3D index mapping grid-block: member variable | block-kernel: omp_get_num_threads() to 3D index mapping grid-block: member variable | threadIdx, blockIdx |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| getExtents | member variables | member variables | member variables | member variables | gridDim, blockDim |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| getBlockSharedMemDynSizeBytes | allocated in memory prior to kernel execution | allocated in memory prior to kernel execution | allocated in memory prior to kernel execution | allocated in memory prior to kernel execution | __shared__ |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| allocBlockSharedMem | master thread allocates | syncBlockKernels -> master thread allocates -> syncBlockKernels | syncBlockKernels -> master thread allocates -> syncBlockKernels | syncBlockKernels -> master thread allocates -> syncBlockKernels | __shared__ |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| syncBlockKernels | not required | barrier | #pragma omp barrier | #pragma omp barrier | __syncthreads |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| atomicOp | hierarchy depended | std::lock_guard< std::mutex > | #pragma omp critical | #pragma omp critical | atomicXXX |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| ALPAKA_FN_HOST_ACC, ALPAKA_FN_ACC, ALPAKA_FN_HOST | inline | inline | inline | inline | __device__, __host__, __forceinline__ |
+---------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
Serial
``````
The serial accelerator only allows blocks with exactly one thread.
Therefore it does not implement real synchronization or atomic primitives.
Threads
```````
Execution
+++++++++
To prevent recreation of the threads between execution of different blocks in the grid, the threads are stored inside a thread pool.
This thread pool is local to the invocation because making it local to the KernelExecutor could mean a heavy memory usage and lots of idling kernel-threads when there are multiple KernelExecutors around.
Because the default policy of the threads in the pool is to yield instead of waiting, this would also slow down the system immensely.
OpenMP
``````
Execution
+++++++++
Parallel execution of the kernels in a block is required because when syncBlockThreads is called all of them have to be done with their work up to this line.
So we have to spawn one real thread per kernel in a block.
``omp for`` is not useful because it is meant for cases where multiple iterations are executed by one thread but in our case a 1:1 mapping is required.
Therefore we use ``omp parallel`` with the specified number of threads in a block.
Another reason for not using ``omp for`` like ``#pragma omp parallel for collapse(3) num_threads(blockDim.x*blockDim.y*blockDim.z)`` is that ``#pragma omp barrier`` used for intra block synchronization is not allowed inside ``omp for`` blocks.
Because OpenMP is designed for a 1:1 abstraction of hardware to software threads, the block size is restricted by the number of OpenMP threads allowed by the runtime.
This could be as little as 2 or 4 kernels but on a system with 4 cores and hyper-threading OpenMP can also allow 64 threads.
Index
+++++
OpenMP only provides a linear thread index. This index is converted to a 3 dimensional index at runtime.
Atomic
++++++
We can not use ``#pragma omp atomic`` because braces or calling other functions directly after ``#pragma omp atomic`` are not allowed.
Because we are implementing the CUDA atomic operations which return the old value, this requires ``#pragma omp critical`` to be used.
``omp_set_lock`` is an alternative but is usually slower.
CUDA
````
Nearly all CUDA functionality can be directly mapped to alpaka function calls.
A major difference is that CUDA requires the block and grid sizes to be given in (x, y, z) order. alpaka uses the mathematical C/C++ array indexing scheme [z][y][x]. In both cases x is the innermost / fast running index.
Furthermore alpaka does not require the indices and extents to be 3-dimensional.
The accelerators are templatized on and support arbitrary dimensionality.
NOTE: Currently the CUDA implementation is restricted to a maximum of 3 dimensions!
NOTE: You have to be careful when mixing alpaka and non alpaka CUDA code. The CUDA-accelerator back-end can change the current CUDA device and will NOT set the device back to the one prior to the invocation of the alpaka function.
Programming Interface
---------------------
**Function Attributes**
Depending on the cmake argument ``ALPAKA_ACC_GPU_CUDA_ONLY_MODE`` the function attributes are defined differently.
*ALPAKA_ACC_GPU_CUDA_ONLY_MODE=OFF* (default)
.. table::
+-----------------------------------------------------+---------------------------------------------------------+
| CUDA | alpaka |
+=====================================================+=========================================================+
| ``__host__`` | ``ALPAKA_FN_HOST`` |
+-----------------------------------------------------+---------------------------------------------------------+
| ``__device__`` | -- |
+-----------------------------------------------------+---------------------------------------------------------+
| ``__global__`` | -- |
+-----------------------------------------------------+---------------------------------------------------------+
| ``__host__ __device__`` | ``ALPAKA_FN_HOST_ACC``, ``ALPAKA_FN_ACC`` |
+-----------------------------------------------------+---------------------------------------------------------+
*ALPAKA_ACC_GPU_CUDA_ONLY_MODE=ON*
.. table::
+-----------------------------------------------------+---------------------------------------------------------+
| CUDA | alpaka |
+=====================================================+=========================================================+
| ``__host__`` | ``ALPAKA_FN_HOST`` |
+-----------------------------------------------------+---------------------------------------------------------+
| ``__device__`` | ``ALPAKA_FN_ACC`` |
+-----------------------------------------------------+---------------------------------------------------------+
| ``__global__`` | -- |
+-----------------------------------------------------+---------------------------------------------------------+
| ``__host__ __device__`` | ``ALPAKA_FN_HOST_ACC`` |
+-----------------------------------------------------+---------------------------------------------------------+
.. note::
There is no alpaka equivalent to ``__global__`` because the design of alpaka does not allow it. When running a alpaka kernel, alpaka creates a ``__global__`` kernel that performs some setup functions, such as creating the acc object, and then runs the user kernel, which must be a CUDA ``__device__`` function.
.. note::
You can not call CUDA-only methods, except when ``ALPAKA_ACC_GPU_CUDA_ONLY_MODE`` is enabled.
.. note::
When calling a ``constexpr`` function from inside a device function, also mark the called function as a device function, e.g. by prepending ``ALPAKA_FN_ACC``.
Note that some compilers do that by default, but not all.
For details please refer to `#1580 <https://github.com/alpaka-group/alpaka/issues/1580>`_ .
**Memory**
.. table::
+-----------------------------------------------------+----------------------------------------------------------------------------+
| CUDA | alpaka |
+=====================================================+============================================================================+
| ``__shared__`` | ``alpaka::declareSharedVar<std::uint32_t, __COUNTER__>(acc)`` |
+-----------------------------------------------------+----------------------------------------------------------------------------+
| ``__constant__`` | ``ALPAKA_STATIC_ACC_MEM_CONSTANT`` |
+-----------------------------------------------------+----------------------------------------------------------------------------+
| ``__device__`` | ``ALPAKA_STATIC_ACC_MEM_GLOBAL`` |
+-----------------------------------------------------+----------------------------------------------------------------------------+
.. doxygenfunction:: alpaka::declareSharedVar
:project: alpaka
.. doxygendefine:: ALPAKA_STATIC_ACC_MEM_CONSTANT
:project: alpaka
.. doxygendefine:: ALPAKA_STATIC_ACC_MEM_GLOBAL
:project: alpaka
*Index / Work Division*
.. table::
+---------------------------------+----------------------------------------------------------------------------------+
| CUDA | alpaka |
+=================================+==================================================================================+
| ``threadIdx`` | ``alpaka::getIdx<alpaka::Block, alpaka::Threads>(acc)`` |
+---------------------------------+----------------------------------------------------------------------------------+
| ``blockIdx`` | ``alpaka::getIdx<alpaka::Grid, alpaka::Blocks>(acc)`` |
+---------------------------------+----------------------------------------------------------------------------------+
| ``blockDim`` | ``alpaka::getWorkDiv<alpaka::Block, alpaka::Threads>(acc)`` |
+---------------------------------+----------------------------------------------------------------------------------+
| ``gridDim`` | ``alpaka::getWorkDiv<alpaka::Grid, alpaka::Blocks>(acc)`` |
+---------------------------------+----------------------------------------------------------------------------------+
| ``warpSize`` | ``alpaka::warp::getSize(acc)`` |
+---------------------------------+----------------------------------------------------------------------------------+
*Types*
.. table::
+----------+-------------------------------------+
| CUDA | alpaka |
+==========+=====================================+
| ``dim3`` | ``alpaka::Vec< TDim, TVal >`` |
+----------+-------------------------------------+
CUDA Runtime API
++++++++++++++++
The following tables list the functions available in the `CUDA Runtime API <https://docs.nvidia.com/cuda/cuda-runtime-api/modules.html#modules>`_ and their equivalent alpaka functions:
*Device Management*
.. table::
+---------------------------------+-----------------------------------------------------------------------+
| CUDA | alpaka |
+=================================+=======================================================================+
| cudaChooseDevice | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceGetAttribute | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceGetByPCIBusId | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceGetCacheConfig | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceGetLimit | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceGetP2PAttribute | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceGetPCIBusId | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceGetSharedMemConfig | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceGetQueuePriorityRange | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceReset | alpaka::reset(device) |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceSetCacheConfig | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceSetLimit | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceSetSharedMemConfig | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaDeviceSynchronize | void alpaka::wait(device) |
+---------------------------------+-----------------------------------------------------------------------+
| cudaGetDevice | n/a (no current device) |
+---------------------------------+-----------------------------------------------------------------------+
| cudaGetDeviceCount | std::sizet alpaka::getDevCount< TPlatform >() |
+---------------------------------+-----------------------------------------------------------------------+
| cudaGetDeviceFlags | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaGetDeviceProperties | alpaka::getAccDevProps(dev) (Only some properties available) |
+---------------------------------+-----------------------------------------------------------------------+
| cudaIpcCloseMemHandle | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaIpcGetEventHandle | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaIpcGetMemHandle | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaIpcOpenEventHandle | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaIpcOpenMemHandle | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaSetDevice | n/a (no current device) |
+---------------------------------+-----------------------------------------------------------------------+
| cudaSetDeviceFlags | -- |
+---------------------------------+-----------------------------------------------------------------------+
| cudaSetValidDevices | -- |
+---------------------------------+-----------------------------------------------------------------------+
*Error Handling*
.. table::
+---------------------+----------------------------------------------------------+
| CUDA | alpaka |
+=====================+==========================================================+
| cudaGetErrorName | n/a (handled internally, available in exception message) |
+---------------------+----------------------------------------------------------+
| cudaGetErrorString | n/a (handled internally, available in exception message) |
+---------------------+----------------------------------------------------------+
| cudaGetLastError | n/a (handled internally) |
+---------------------+----------------------------------------------------------+
| cudaPeekAtLastError | n/a (handled internally) |
+---------------------+----------------------------------------------------------+
*Queue Management*
.. table::
+------------------------------+---------------------------------------------------------+
| CUDA | alpaka |
+==============================+=========================================================+
| cudaLaunchHostFunc | alpaka::enqueue(queue, [](){dosomething();}) |
| | |
| cudaStreamAddCallback | \ |
+------------------------------+---------------------------------------------------------+
| cudaStreamAttachMemAsync | -- |
+------------------------------+---------------------------------------------------------+
| cudaStreamCreate | - queue=alpaka::QueueCudaRtNonBlocking(device); |
| \ | - queue=alpaka::QueueCudaRtBlocking(device); |
+------------------------------+---------------------------------------------------------+
| cudaStreamCreateWithFlags | see cudaStreamCreate (cudaStreamNonBlocking hard coded) |
+------------------------------+---------------------------------------------------------+
| cudaStreamCreateWithPriority | -- |
+------------------------------+---------------------------------------------------------+
| cudaStreamDestroy | n/a (Destructor) |
+------------------------------+---------------------------------------------------------+
| cudaStreamGetFlags | -- |
+------------------------------+---------------------------------------------------------+
| cudaStreamGetPriority | -- |
+------------------------------+---------------------------------------------------------+
| cudaStreamQuery | bool alpaka::empty(queue) |
+------------------------------+---------------------------------------------------------+
| cudaStreamSynchronize | void alpaka::wait(queue) |
+------------------------------+---------------------------------------------------------+
| cudaStreamWaitEvent | void alpaka::wait(queue, event) |
+------------------------------+---------------------------------------------------------+
*Event Management*
.. table::
+--------------------------+--------------------------------------------+
| CUDA | alpaka |
+==========================+============================================+
| cudaEventCreate | alpaka::Event< TQueue > event(dev); |
+--------------------------+--------------------------------------------+
| cudaEventCreateWithFlags | -- |
+--------------------------+--------------------------------------------+
| cudaEventDestroy | n/a (Destructor) |
+--------------------------+--------------------------------------------+
| cudaEventElapsedTime | -- |
+--------------------------+--------------------------------------------+
| cudaEventQuery | bool alpaka::isComplete(event) |
+--------------------------+--------------------------------------------+
| cudaEventRecord | void alpaka::enqueue(queue, event) |
+--------------------------+--------------------------------------------+
| cudaEventSynchronize | void alpaka::wait(event) |
+--------------------------+--------------------------------------------+
*Memory Management*
.. table::
+----------------------------+--------------------------------------------------------------------------------------------+
| CUDA | alpaka |
+============================+============================================================================================+
| cudaArrayGetInfo | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaFree | n/a (automatic memory management with reference counted memory handles) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaFreeArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaFreeAsync | n/a (automatic memory management with reference counted memory handles) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaFreeHost | n/a (automatic memory management with reference counted memory handles) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaFreeMipmappedArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaGetMipmappedArrayLevel | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaGetSymbolAddress | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaGetSymbolSize | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaHostAlloc | alpaka::allocMappedBuf<TPlatform, TElement>(host, extents) 1D, 2D, 3D supported! |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaHostGetDevicePointer | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaHostGetFlags | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaHostRegister | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaHostUnregister | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMalloc | alpaka::allocBuf<TElement>(device, extents1D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMalloc3D | alpaka::allocBuf<TElement>(device, extents3D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMalloc3DArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMallocArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMallocAsync | alpaka::allocAsyncBuf<TElement>(queue, extents1D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMallocHost | alpaka::allocMappedBuf<TPlatform, TElement>(host, extents) 1D, 2D, 3D supported! |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMallocManaged | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMallocMipmappedArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMallocPitch | alpaka::allocBuf<TElement>(device, extents2D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemAdvise | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemGetInfo | - alpaka::getMemBytes |
| | - alpaka::getFreeMemBytes |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemPrefetchAsync | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemRangeGetAttribute | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemRangeGetAttributes | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy | alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy2D | alpaka::memcpy(queue, memBufDst, memBufSrc, extents2D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy2DArrayToArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy2DAsync | alpaka::memcpy(queue, memBufDst, memBufSrc, extents2D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy2DFromArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy2DFromArrayAsync | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy2DToArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy2DToArrayAsync | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy3D | alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy3DAsync | alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy3DPeer | alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpy3DPeerAsync | alpaka::memcpy(queue, memBufDst, memBufSrc, extents3D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyArrayToArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyAsync | alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyFromArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyFromArrayAsync | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyFromSymbol | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyFromSymbolAsync | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyPeer | alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyPeerAsync | alpaka::memcpy(queue, memBufDst, memBufSrc, extents1D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyToArray | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyToArrayAsync | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyToSymbol | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyToSymbolAsync | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemset | alpaka::memset(queue, memBufDst, byte, extents1D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemset2D | alpaka::memset(queue, memBufDst, byte, extents2D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemset2DAsync | alpaka::memset(queue, memBufDst, byte, extents2D, queue) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemset3D | alpaka::memset(queue, memBufDst, byte, extents3D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemset3DAsync | alpaka::memset(queue, memBufDst, byte, extents3D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemsetAsync | alpaka::memset(queue, memBufDst, byte, extents1D) |
+----------------------------+--------------------------------------------------------------------------------------------+
| makecudaExtent | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| makecudaPitchedPtr | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| makecudaPos | -- |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyHostToDevice | n/a (direction of copy is determined automatically) |
+----------------------------+--------------------------------------------------------------------------------------------+
| cudaMemcpyDeviceToHost | n/a (direction of copy is determined automatically) |
+----------------------------+--------------------------------------------------------------------------------------------+
*Execution Control*
.. table::
+----------------------------+--------------------------------------------------------------------------------------------------------------+
| CUDA | alpaka |
+============================+==============================================================================================================+
| cudaFuncGetAttributes | -- |
+----------------------------+--------------------------------------------------------------------------------------------------------------+
| cudaFuncSetCacheConfig | -- |
+----------------------------+--------------------------------------------------------------------------------------------------------------+
| cudaFuncSetSharedMemConfig | -- |
+----------------------------+--------------------------------------------------------------------------------------------------------------+
| cudaLaunchKernel | - alpaka::exec<TAcc>(queue, workDiv, kernel, params...) |
| \ | - auto byteDynSharedMem = alpaka::getBlockSharedMemDynSizeBytes(kernel, ...) |
+----------------------------+--------------------------------------------------------------------------------------------------------------+
| cudaSetDoubleForDevice | n/a (alpaka assumes double support) |
+----------------------------+--------------------------------------------------------------------------------------------------------------+
| cudaSetDoubleForHost | n/a (alpaka assumes double support) |
+----------------------------+--------------------------------------------------------------------------------------------------------------+
*Occupancy*
.. table::
+--------------------------------------------------------+--------+
| CUDA | alpaka |
+========================================================+========+
| cudaOccupancyMaxActiveBlocksPerMultiprocessor | -- |
+--------------------------------------------------------+--------+
| cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags | -- |
+--------------------------------------------------------+--------+
*Unified Addressing*
.. table::
+--------------------------+--------+
| CUDA | alpaka |
+==========================+========+
| cudaPointerGetAttributes | -- |
+--------------------------+--------+
*Peer Device Memory Access*
.. table::
+-----------------------------+----------------------------------+
| CUDA | alpaka |
+=============================+==================================+
| cudaDeviceCanAccessPeer | -- |
+-----------------------------+----------------------------------+
| cudaDeviceDisablePeerAccess | -- |
+-----------------------------+----------------------------------+
| cudaDeviceEnablePeerAccess | automatically done when required |
+-----------------------------+----------------------------------+
**OpenGL, Direct3D, VDPAU, EGL, Graphics Interoperability**
*not available*
**Texture/Surface Reference/Object Management**
*not available*
**Version Management**
*not available*
HIP
```
.. warning::
The HIP documentation is outdated and must be overworked.
Current Restrictions on HCC platform
++++++++++++++++++++++++++++++++++++
- Workaround for unsupported ``syncthreads_{count|and|or}``.
- Uses temporary shared value and atomics
- Workaround for buggy ``hipStreamQuery``, ``hipStreamSynchronize``.
- Introduces own queue management
- ``hipStreamQuery`` and ``hipStreamSynchronize`` do not work in multithreaded environment
- Workaround for missing ``cuStreamWaitValue32``.
- Polls value each 10 ms
- Device constant memory not supported yet
- Note that ``printf`` in kernels is still not supported in HIP
- Exclude ``hipMalloc3D`` and ``hipMallocPitch`` when size is zero otherwise they throw an Unknown Error
- ``TestAccs`` excludes 3D specialization of HIP back-end for now because ``verifyBytesSet`` fails in ``memView`` for 3D specialization
- ``dim3`` structure is not available on device (use ``alpaka::Vec`` instead)
- Constructors' attributes unified with destructors'.
- Host/device signature must match in HIP(HCC)
- A chain of functions must also provide correct host-device signatures
- E.g. a host function cannot be called from a host-device function
- Recompile your target when HCC linker returned the error:
"File format not recognized
clang-7: error: linker command failed with exit code 1"
- If compile-error occurred the linker still may link, but without the device code
- AMD device architecture currently hardcoded in ``alpakaConfig.cmake``
Compiling HIP from Source
+++++++++++++++++++++++++
Follow `HIP Installation`_ guide for installing HIP.
HIP requires either *nvcc* or *hcc* to be installed on your system (see guide for further details).
.. _HIP Installation: https://github.com/ROCm-Developer-Tools/HIP/blob/master/INSTALL.md
- If you want the HIP binaries to be located in a directory that does not require superuser access, be sure to change the install directory of HIP by modifying the ``CMAKE_INSTALL_PREFIX`` cmake variable.
- Also, after the installation is complete, add the following line to the ``.profile`` file in your home directory, in order to add the path to the HIP binaries to PATH: ``PATH=$PATH:<path_to_binaries>``
.. code-block::
git clone --recursive https://github.com/ROCm-Developer-Tools/HIP.git
cd HIP
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE="${CMAKE_BUILD_TYPE}" -DCMAKE_INSTALL_PREFIX=${YOUR_HIP_INSTALL_DIR} -DBUILD_TESTING=OFF ..
make
make install
- Set the appropriate paths (edit ``${YOUR_**}`` variables)
.. code-block::
# HIP_PATH required by HIP tools
export HIP_PATH=${YOUR_HIP_INSTALL_DIR}
# Paths required by HIP tools
export CUDA_PATH=${YOUR_CUDA_ROOT}
# - if required, path to HCC compiler. Default /opt/rocm/hcc.
export HCC_HOME=${YOUR_HCC_ROOT}
# - if required, path to HSA include, lib. Default /opt/rocm/hsa.
export HSA_PATH=${YOUR_HSA_PATH}
# HIP binaries and libraries
export PATH=${HIP_PATH}/bin:$PATH
export LD_LIBRARY_PATH=${HIP_PATH}/lib64:${LD_LIBRARY_PATH}
- Test the HIP binaries
.. code-block::
# calls nvcc or hcc
which hipcc
hipcc -V
which hipconfig
hipconfig -v
Verifying HIP Installation
++++++++++++++++++++++++++
- If PATH points to the location of the HIP binaries, the following command should list several relevant environment variables, and also the selected compiler on your ``system-\`hipconfig -f\```
- Compile and run the `square sample`_, as pointed out in the original `HIP install guide`_.
.. _square sample: https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/0_Intro/square
.. _HIP install guide: https://github.com/ROCm-Developer-Tools/HIP/blob/master/INSTALL.md#user-content-verify-your-installation
Compiling Examples with HIP Back End
++++++++++++++++++++++++++++++++++++
As of now, the back-end has only been tested on the NVIDIA platform.
* NVIDIA Platform
* One issue in this branch of alpaka is that the host compiler flags don't propagate to the device compiler, as they do in CUDA. This is because a counterpart to the ``CUDA_PROPAGATE_HOST_FLAGS`` cmake variable has not been defined in the FindHIP.cmake file.
alpaka forwards the host compiler flags in cmake to the ``HIP_NVCC_FLAGS`` cmake variable, which also takes user-given flags. To add flags to this variable, toggle the advanced mode in ``ccmake``.
Random Number Generator Library rocRAND for HIP Back End
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
*rocRAND* provides an interface for HIP, where the cuRAND or rocRAND API is called depending on the chosen HIP platform (can be configured with cmake in alpaka).
Clone the rocRAND repository, then build and install it
.. code-block::
git clone https://github.com/ROCmSoftwarePlatform/rocRAND
cd rocRAND
mkdir -p build
cd build
cmake -DCMAKE_INSTALL_PREFIX=${HIP_PATH} -DBUILD_BENCHMARK=OFF -DBUILD_TEST=OFF -DCMAKE_MODULE_PATH=${HIP_PATH}/cmake ..
make
The ``CMAKE_MODULE_PATH`` is a cmake variable for locating module finding scripts like *FindHIP.cmake*.
The paths to the *rocRAND* library and include directories should be appended to the ``CMAKE_PREFIX_PATH`` variable.