-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce the DPC++ and LevelZero device driver #486
Open
therault
wants to merge
3
commits into
ICLDisco:master
Choose a base branch
from
therault:level_zero_master_rebase
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@abouteiller can you test this PR on a HIP machine? Is the HIP port still working? I'm testing on CUDA and LevelZero machines. |
…y, so we must point to the source ones
…n DTD and PTG. This branch is based on common_gpu and should be merged only after common_gpu Add a new level_zero device (WIP) - copy device_cuda in device_level_zero and rename things - module_init and module_fini for level_zero Need to factorize a little bit more. Factorizing (need to do it in base) Port above new common Add DPC++ to the loop... - Add multiple CMake logic files and commands - jdf2c.c now generates dpcpp output files when needed - make DEV_DPCPP be an alias to DEV_LEVEL_ZERO - Command Lists for I/O (streams of id 0 and 1) are still immediate - Command Lists for computations (streams of id >= 2) are now normal lists connected to a queue that queue exists as a compute level-zero queue and as a DPC++ queue - Missing compilation logic to compile generated dpc++ code and link it with the target binary Risk: it is unclear that the user can still push orders / events in the command list, after it is closed, and it is necessary to close it to force the orders to be pushed on the queue. I might need to create a new command list after each close, and attach the command list to the event for garbage collection. Adapt findlevel-zero.cmake to support systems where pkg-config is broken Re-enable Level Zero test; update to latest level zero / oneAPI API Update wrapper to allow testing both CUDA and Level Zero with new Level Zero update use_cuda / use_cuda_index have been renamed to follow proper naming scheme; do the same for level_zero Try to automate DPCPP generated code compilation; fix ordinal of memory allocation request in wrapper. Command Lists need to be sent to the Command Queue if they are not created immediate (and they cannot be immediate if we want to get their Command Queue, which is necessary for the DPC++ interface) Typo and multiple CMake fixes to make CMake link with DPCPP generated files Add a standalone test for Zero Level capability and integration with DPC++ kernels Rebase the entire Level Zero driver based on the susbsystem test Buffer interface is not required. We can use the USM OneMKL interface, it seems to work ok. Need to check for performance. We cannot mix immediate and non-immediate command lists apparently. Or at least it makes the passing of command queues unreliable There is an exception in data.c how we handle GPU copies, it must be ported to Level Zero too. The Level Zero runtime has a atexit procedure to delete command queues, and this seems to conflict with our own actions to delete the command queues... Porting of the DTD GEMM test to Level Zero NULL is not a valid MPI datatype when compiling with a clone of MPICH. The value doesn't matter in this case, just cast Manage LEVEL_ZERO devices in DTD Accept LEVEL_ZERO devices in the PTG generated code Some fixes in device level_zero Temp fix for termination detection -- tag size must be made portable. TODO! Support LEVEL_ZERO devices in the DSL tests Fix the subsystem test. Need to backport fixes in the MCA device Fully functional sketch for level zero Use level-zero fences to synchronize command lists and command queues, because command lists (or work) submitted to the command queues by SYCL (typically oneMKL) can complete in parallel with events belonging to other command lists. Define the set of globals in DPC++ code after the includse happen to avoid polluting their namespace; cleanup some unused variables Install LevelZero driver files; setup the environment to find the same LevelZero library as at compile time in PaRSECConfig.cmake
therault
force-pushed
the
level_zero_master_rebase
branch
from
February 13, 2023 23:56
ea932be
to
60ed5a1
Compare
Closed
therault
added a commit
to therault/parsec
that referenced
this pull request
Aug 30, 2023
therault
added a commit
to therault/parsec
that referenced
this pull request
Oct 23, 2023
Now that #570 has been merged is this PR still necessary ? |
Not quite: I still need to import the PTG support of DPC++ part. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduce the DPC++ and LevelZero device driver and enable this device in DTD and PTG.
This PR supersedes PR #483, as the changes due to the integration to HIP made the port on top of PaRSEC master complicated. All commits of the level_zero branch are squashed in one commit to simplify this port, and many changes are added to the commit to factor the code between HIP CUDA and Level Zero code generation.
As the target BODY is in DPC++, there are two devices: the dpcpp device (natural for PTG, as this is the target BODY language), and the level_zero device (natural for the device, as all operations are at the level zero interface). I made both these devices synonyms, which makes some duplication for the default_stage_in/out and the kernel_submit.
This branch is based on common_gpu and should be merged only after common_gpu
Add a new level_zero device (WIP)
Need to factorize a little bit more.
Factorizing (need to do it in base)
Port above new common
Add DPC++ to the loop...
Risk: it is unclear that the user can still push orders / events in the command list, after it is closed, and it is necessary to close it to force the orders to be pushed on the queue. I might need to create a new command list after each close, and attach the command list to the event for garbage collection.
Adapt findlevel-zero.cmake to support systems where pkg-config is broken
Re-enable Level Zero test; update to latest level zero / oneAPI API
Update wrapper to allow testing both CUDA and Level Zero with new Level Zero update
use_cuda / use_cuda_index have been renamed to follow proper naming scheme; do the same for level_zero
Try to automate DPCPP generated code compilation; fix ordinal of memory allocation request in wrapper.
Command Lists need to be sent to the Command Queue if they are not created immediate (and they cannot be immediate if we want to get their Command Queue, which is necessary for the DPC++ interface)
Typo and multiple CMake fixes to make CMake link with DPCPP generated files
Add a standalone test for Zero Level capability and integration with DPC++ kernels
Rebase the entire Level Zero driver based on the susbsystem test
Buffer interface is not required. We can use the USM OneMKL interface, it seems to work ok. Need to check for performance.
We cannot mix immediate and non-immediate command lists apparently. Or at least it makes the passing of command queues unreliable
There is an exception in data.c how we handle GPU copies, it must be ported to Level Zero too.
The Level Zero runtime has a atexit procedure to delete command queues, and this seems to conflict with our own actions to delete the command queues...
Porting of the DTD GEMM test to Level Zero
NULL is not a valid MPI datatype when compiling with a clone of MPICH. The value doesn't matter in this case, just cast
Manage LEVEL_ZERO devices in DTD
Accept LEVEL_ZERO devices in the PTG generated code
Some fixes in device level_zero
Temp fix for termination detection -- tag size must be made portable. TODO!
Support LEVEL_ZERO devices in the DSL tests
Fix the subsystem test. Need to backport fixes in the MCA device
Fully functional sketch for level zero
Use level-zero fences to synchronize command lists and command queues, because command lists (or work) submitted to the command queues by SYCL (typically oneMKL) can complete in parallel with events belonging to other command lists.
Define the set of globals in DPC++ code after the includse happen to avoid polluting their namespace; cleanup some unused variables
Install LevelZero driver files; setup the environment to find the same LevelZero library as at compile time in PaRSECConfig.cmake