-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLVM][GPU] Added CUDADriver to execute benchmark on GPU #829
Conversation
Now, CLI has two options: `cpu` and `gpu` that allow users to target different platforms. For example, ``` bin/nmodl mod/test.mod -o out llvm --ir bin/nmodl mod/test.mod -o out llvm --ir cpu --name skylake --vector-width 2 bin/nmodl mod/test.mod -o out llvm --ir gpu --name cuda ``` Moreover, `assume_no_alias` option was dropped and made default (it didn't affect the computation in our experiments). The new CLI looks like: ``` llvm LLVM code generation option Options: --ir REQUIRED Generate LLVM IR (false) --no-debug Disable debug information (false) --opt-level-ir INT:{0,1,2,3} LLVM IR optimisation level (O0) --single-precision Use single precision floating-point types (false) --fmf TEXT:{afn,arcp,contract,ninf,nnan,nsz,reassoc,fast} ... Fast math flags for floating-point optimizations (none) cpu LLVM CPU option Options: --name TEXT Name of CPU platform to use --math-library TEXT:{Accelerate,libmvec,libsystem_m,MASSV,SLEEF,SVML,none} Math library for SIMD code generation (none) --vector-width INT Explicit vectorization width for IR generation (1) gpu LLVM GPU option Options: --name TEXT Name of GPU platform to use --math-library TEXT:{libdevice} Math library for GPU code generation (none) benchmark LLVM benchmark option Options: --run Run LLVM benchmark (false) --opt-level-codegen INT:{0,1,2,3} Machine code optimisation level (O0) --libs TEXT:FILE ... Shared libraries to link IR against --instance-size INT Instance struct size (10000) --repeat INT Number of experiments for benchmarking (100) ```
This commit introduces a handy `Plarform` class that is designed to incorporate target information for IR generation, such as precision, vectorization width (if applicable), type of target (CPU/GPU), etc. In future, more functionality can be added to `Platform`, e.g. we can move functionality of handling `llvm::Target`, math SIMD libraries, etc. Note: this is just a very basic implementation that enables easier integration of GPU code generation.
This commit adds a new AST node: `CodegenThreadId` that represents thread id used in GPU computation. Thanks to the new platform class abstraction, the code to generate compute body of NEURON block was readapted to support AST transformations needed for GPU. Example of the transformation: ``` GPU_ID id INTEGER node_id DOUBLE v IF (id<mech->node_count) { node_id = mech->node_index[id] v = mech->voltage[node_id] mech->m[id] = mech->y[id]+2 } ```
This commit introduces a handy `Plarform` class that is designed to incorporate target information for IR generation, such as precision, vectorization width (if applicable), type of target (CPU/GPU), etc. In future, more functionality can be added to `Platform`, e.g. we can move functionality of handling `llvm::Target`, math SIMD libraries, etc. Note: this is just a very basic implementation that enables easier integration of GPU code generation.
This commit adds a new AST node: `CodegenThreadId` that represents thread id used in GPU computation. Thanks to the new platform class abstraction, the code to generate compute body of NEURON block was readapted to support AST transformations needed for GPU. Example of the transformation: ``` GPU_ID id INTEGER node_id DOUBLE v IF (id<mech->node_count) { node_id = mech->node_index[id] v = mech->voltage[node_id] mech->m[id] = mech->y[id]+2 } ```
This commit adds a new AST node: `CodegenThreadId` that represents thread id used in GPU computation. Thanks to the new platform class abstraction, the code to generate compute body of NEURON block was readapted to support AST transformations needed for GPU. Example of the transformation: ``` GPU_ID id INTEGER node_id DOUBLE v IF (id<mech->node_count) { node_id = mech->node_index[id] v = mech->voltage[node_id] mech->m[id] = mech->y[id]+2 } ```
Logfiles from GitLab pipeline #51498 (:white_check_mark:) have been uploaded here! Status and direct links: |
Logfiles from GitLab pipeline #52306 (:no_entry:) have been uploaded here! Status and direct links: |
* Rearrange vec_rhs and vec_d to allocate memory properly * Setup rhs, d and their shadow vectors * Fix test Co-authored-by: Ioannis Magkanaris <ioannis.magkanaris@epfl.ch>
Logfiles from GitLab pipeline #52484 (:white_check_mark:) have been uploaded here! Status and direct links: |
Logfiles from GitLab pipeline #52530 (:no_entry:) have been uploaded here! Status and direct links: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Logfiles from GitLab pipeline #52564 (:white_check_mark:) have been uploaded here! Status and direct links: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quickly skimmed through the changes, nothing major I can point out. Apart from clarification comments, this is good from my side.
jitOptions[1] = CU_JIT_INFO_LOG_BUFFER; | ||
char* jitLogBuffer = new char[jitLogBufferSize]; | ||
jitOptVals[1] = jitLogBuffer; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Objects explicitly allocated like jitLogBuffer
and jitOptions
are internally freed by llvm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this actually. After their usage I free
them explicitly to make sure they are free
d.
…workflow and made clang-format happier
… magkanar/gpu-runner
Logfiles from GitLab pipeline #53000 (:white_check_mark:) have been uploaded here! Status and direct links: |
Logfiles from GitLab pipeline #53001 (:white_check_mark:) have been uploaded here! Status and direct links: |
Logfiles from GitLab pipeline #53525 (:no_entry:) have been uploaded here! Status and direct links: |
- Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then execute it using CUDA API - Ability to select the compilation GPU architecture and then set the proper GPU architecture based on the GPU that is going to be used - Link `libdevice` math library with GPU LLVM module - Handles kernel and wrapper functions attributes properly for GPU execution (wrapper function is `kernel` and kernel attribute is `device`) - Small fixes in InstanceStruct declaration and setup to allocate the pointer variables properly, including the shadow variables - Adds tests in the CI that run small benchmarks in CPU and GPU on BB5 - Adds replacement of `log` math function for SLEEF and libdevice, `pow` and `fabs` for libdevice - Adds GPU execution ability in PyJIT - Small improvement in PyJIT benchmark python script to handle arguments and GPU execution - Separated benchmark info from benchmark driver - Added hh and expsyn mod files in benchmarking tests
- Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then execute it using CUDA API - Ability to select the compilation GPU architecture and then set the proper GPU architecture based on the GPU that is going to be used - Link `libdevice` math library with GPU LLVM module - Handles kernel and wrapper functions attributes properly for GPU execution (wrapper function is `kernel` and kernel attribute is `device`) - Small fixes in InstanceStruct declaration and setup to allocate the pointer variables properly, including the shadow variables - Adds tests in the CI that run small benchmarks in CPU and GPU on BB5 - Adds replacement of `log` math function for SLEEF and libdevice, `pow` and `fabs` for libdevice - Adds GPU execution ability in PyJIT - Small improvement in PyJIT benchmark python script to handle arguments and GPU execution - Separated benchmark info from benchmark driver - Added hh and expsyn mod files in benchmarking tests
- Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then execute it using CUDA API - Ability to select the compilation GPU architecture and then set the proper GPU architecture based on the GPU that is going to be used - Link `libdevice` math library with GPU LLVM module - Handles kernel and wrapper functions attributes properly for GPU execution (wrapper function is `kernel` and kernel attribute is `device`) - Small fixes in InstanceStruct declaration and setup to allocate the pointer variables properly, including the shadow variables - Adds tests in the CI that run small benchmarks in CPU and GPU on BB5 - Adds replacement of `log` math function for SLEEF and libdevice, `pow` and `fabs` for libdevice - Adds GPU execution ability in PyJIT - Small improvement in PyJIT benchmark python script to handle arguments and GPU execution - Separated benchmark info from benchmark driver - Added hh and expsyn mod files in benchmarking tests
- Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then execute it using CUDA API - Ability to select the compilation GPU architecture and then set the proper GPU architecture based on the GPU that is going to be used - Link `libdevice` math library with GPU LLVM module - Handles kernel and wrapper functions attributes properly for GPU execution (wrapper function is `kernel` and kernel attribute is `device`) - Small fixes in InstanceStruct declaration and setup to allocate the pointer variables properly, including the shadow variables - Adds tests in the CI that run small benchmarks in CPU and GPU on BB5 - Adds replacement of `log` math function for SLEEF and libdevice, `pow` and `fabs` for libdevice - Adds GPU execution ability in PyJIT - Small improvement in PyJIT benchmark python script to handle arguments and GPU execution - Separated benchmark info from benchmark driver - Added hh and expsyn mod files in benchmarking tests
CUDA_ERROR_INVALID_PTX
errorlibdevice
librarylibdevice
we need to run the following:WIP:
nvvmAddModuleToProgram
. Passing string is deprecatedTriplet
andDataLayout
for GPUDriver in non-hardcoded way (?)This is the output of NVVM when it tries to compile the kernel with the default attributes:
I am using CUDA 11.4.2.
The related PTX code:
You can reproduce the issue if you remoce the
--no-debug
option in the command above