…n DTD and PTG.
This branch is based on common_gpu and should be merged only after
common_gpu
Add a new level_zero device (WIP)
- copy device_cuda in device_level_zero and rename things
- module_init and module_fini for level_zero
Need to factorize a little bit more.
Factorizing (need to do it in base)
Port above new common
Add DPC++ to the loop...
- Add multiple CMake logic files and commands
- jdf2c.c now generates dpcpp output files when needed
- make DEV_DPCPP be an alias to DEV_LEVEL_ZERO
- Command Lists for I/O (streams of id 0 and 1) are still immediate
- Command Lists for computations (streams of id >= 2) are now normal lists connected to a queue
that queue exists as a compute level-zero queue and as a DPC++ queue
- Missing compilation logic to compile generated dpc++ code and link it with the target binary
Risk: it is unclear that the user can still push orders / events in the command list, after it is closed,
and it is necessary to close it to force the orders to be pushed on the queue. I might need to create a
new command list after each close, and attach the command list to the event for garbage collection.
Adapt findlevel-zero.cmake to support systems where pkg-config is broken
Re-enable Level Zero test; update to latest level zero / oneAPI API
Update wrapper to allow testing both CUDA and Level Zero with new Level Zero update
use_cuda / use_cuda_index have been renamed to follow proper naming scheme; do the same for level_zero
Try to automate DPCPP generated code compilation; fix ordinal of memory allocation request in wrapper.
Command Lists need to be sent to the Command Queue if they are not created immediate (and they cannot be immediate if we want to get their Command Queue, which is necessary for the DPC++ interface)
Typo and multiple CMake fixes to make CMake link with DPCPP generated files
Add a standalone test for Zero Level capability and integration with DPC++ kernels
Rebase the entire Level Zero driver based on the susbsystem test
Buffer interface is not required. We can use the USM OneMKL interface, it seems to work ok. Need to check for performance.
We cannot mix immediate and non-immediate command lists apparently. Or at least it makes the passing of command queues unreliable
There is an exception in data.c how we handle GPU copies, it must be ported to Level Zero too.
The Level Zero runtime has a atexit procedure to delete command queues, and this seems to conflict with our own actions to delete the command queues...
Porting of the DTD GEMM test to Level Zero
NULL is not a valid MPI datatype when compiling with a clone of MPICH. The value doesn't matter in this case, just cast
Manage LEVEL_ZERO devices in DTD
Accept LEVEL_ZERO devices in the PTG generated code
Some fixes in device level_zero
Temp fix for termination detection -- tag size must be made portable. TODO!
Support LEVEL_ZERO devices in the DSL tests
Fix the subsystem test. Need to backport fixes in the MCA device
Fully functional sketch for level zero
Use level-zero fences to synchronize command lists and command queues, because command lists (or work) submitted to the command queues by SYCL (typically oneMKL) can complete in parallel with events belonging to other command lists.
Define the set of globals in DPC++ code after the includse happen to avoid polluting their namespace; cleanup some unused variables
Install LevelZero driver files; setup the environment to find the same LevelZero library as at compile time in PaRSECConfig.cmake