[LLVM][GPU] Added CUDADriver to execute benchmark on GPU #829

iomaganaris · 2022-03-21T12:17:12Z

Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then exexcute it using CUDA API
Tested the CUDADriver with external example LLVM IR code and works fine
CUDA used to build NMODL should be the same version as the CUDA driver version installed on the machine the benchmark is executed otherwise there is CUDA_ERROR_INVALID_PTX error
Added linkage to libdevice library
To execute the benchmark with the libdevice we need to run the following:

./bin/nmodl ../test.mod --output "llvm_cuda" --verbose debug llvm --no-debug --ir --opt-level-ir 3 gpu --target-arch "sm_80" --name "nvptx64" --math-library libdevice benchmark --run --libs "${CUDA_ROOT}/nvvm/libdevice/libdevice.10.bc" --opt-level-codegen 3 --instance-size 10000000 --repeat 2 --grid-dim-x 4096 --block-dim-x 256

WIP:

At the moment the generated LLVM IR from NMODL doesn't generate right PTX code due to the address spaces not being set properly Edit: Generated LLVM IR has instructions that convert pointers in generic address space to global pointers so generated LLVM IR code is now executable to GPU using this PR
Kernel name in nvvm anotation should change to have the real name instead of "kernel"
Handle compilation options
Transform LLVM IR to bitcode to pass to nvvmAddModuleToProgram. Passing string is deprecated
Find a way to set Triplet and DataLayout for GPUDriver in non-hardcoded way (?)
Check why code cannot be compiled on GPU with kernel attributes
This is the output of NVVM when it tries to compile the kernel with the default attributes:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Compilation Log:
 nmodl_kernel: parse Unknown attribute kind (62) (Producer: 'LLVM13.0.0' Reader: 'LLVM 7.0.1')

I am using CUDA 11.4.2.

Check why code cannot be compiled on GPU with debug flags. This is the output with debug flags:

[NMODL] [info] :: CUDA JIT ERROR LOG: ptxas application ptx input, line 264; fatal   : Parsing error near '-': syntax error
ptxas fatal   : Ptx assembly aborted due to errors

The related PTX code:

...
   .section	.debug_pubnames
   {
.b32 LpubNames_end0-LpubNames_start0
LpubNames_start0:
.b8 2
...
LpubNames_end0:
   }
   .section	.debug_pubtypes
   {
.b32 LpubTypes_end0-LpubTypes_start0
LpubTypes_start0:
.b8 2
.b8 0
.b32 .debug_info
.b32 182
.b32 0
LpubTypes_end0:
   }
   .section	.debug_loc	{	}

You can reproduce the issue if you remoce the --no-debug option in the command above

Added GitLab CI test that executes benchmark in Cascade Lake CPU and V100 GPU node

Now, CLI has two options: `cpu` and `gpu` that allow users to target different platforms. For example, ``` bin/nmodl mod/test.mod -o out llvm --ir bin/nmodl mod/test.mod -o out llvm --ir cpu --name skylake --vector-width 2 bin/nmodl mod/test.mod -o out llvm --ir gpu --name cuda ``` Moreover, `assume_no_alias` option was dropped and made default (it didn't affect the computation in our experiments). The new CLI looks like: ``` llvm LLVM code generation option Options: --ir REQUIRED Generate LLVM IR (false) --no-debug Disable debug information (false) --opt-level-ir INT:{0,1,2,3} LLVM IR optimisation level (O0) --single-precision Use single precision floating-point types (false) --fmf TEXT:{afn,arcp,contract,ninf,nnan,nsz,reassoc,fast} ... Fast math flags for floating-point optimizations (none) cpu LLVM CPU option Options: --name TEXT Name of CPU platform to use --math-library TEXT:{Accelerate,libmvec,libsystem_m,MASSV,SLEEF,SVML,none} Math library for SIMD code generation (none) --vector-width INT Explicit vectorization width for IR generation (1) gpu LLVM GPU option Options: --name TEXT Name of GPU platform to use --math-library TEXT:{libdevice} Math library for GPU code generation (none) benchmark LLVM benchmark option Options: --run Run LLVM benchmark (false) --opt-level-codegen INT:{0,1,2,3} Machine code optimisation level (O0) --libs TEXT:FILE ... Shared libraries to link IR against --instance-size INT Instance struct size (10000) --repeat INT Number of experiments for benchmarking (100) ```

This commit introduces a handy `Plarform` class that is designed to incorporate target information for IR generation, such as precision, vectorization width (if applicable), type of target (CPU/GPU), etc. In future, more functionality can be added to `Platform`, e.g. we can move functionality of handling `llvm::Target`, math SIMD libraries, etc. Note: this is just a very basic implementation that enables easier integration of GPU code generation.

This commit adds a new AST node: `CodegenThreadId` that represents thread id used in GPU computation. Thanks to the new platform class abstraction, the code to generate compute body of NEURON block was readapted to support AST transformations needed for GPU. Example of the transformation: ``` GPU_ID id INTEGER node_id DOUBLE v IF (id<mech->node_count) { node_id = mech->node_index[id] v = mech->voltage[node_id] mech->m[id] = mech->y[id]+2 } ```

This commit introduces a handy `Plarform` class that is designed to incorporate target information for IR generation, such as precision, vectorization width (if applicable), type of target (CPU/GPU), etc. In future, more functionality can be added to `Platform`, e.g. we can move functionality of handling `llvm::Target`, math SIMD libraries, etc. Note: this is just a very basic implementation that enables easier integration of GPU code generation.

This commit adds a new AST node: `CodegenThreadId` that represents thread id used in GPU computation. Thanks to the new platform class abstraction, the code to generate compute body of NEURON block was readapted to support AST transformations needed for GPU. Example of the transformation: ``` GPU_ID id INTEGER node_id DOUBLE v IF (id<mech->node_count) { node_id = mech->node_index[id] v = mech->voltage[node_id] mech->m[id] = mech->y[id]+2 } ```

…d is enabled

bbpbuildbot · 2022-04-28T15:30:40Z

Logfiles from GitLab pipeline #51498 (:white_check_mark:) have been uploaded here!

Status and direct links:

bbpbuildbot · 2022-05-02T15:50:13Z

Logfiles from GitLab pipeline #52306 (:no_entry:) have been uploaded here!

Status and direct links:

* Rearrange vec_rhs and vec_d to allocate memory properly * Setup rhs, d and their shadow vectors * Fix test Co-authored-by: Ioannis Magkanaris <ioannis.magkanaris@epfl.ch>

bbpbuildbot · 2022-05-03T12:14:36Z

Logfiles from GitLab pipeline #52484 (:white_check_mark:) have been uploaded here!

Status and direct links:

… if llvm_backend is enabled

…the same purpose

bbpbuildbot · 2022-05-03T13:32:54Z

Logfiles from GitLab pipeline #52530 (:no_entry:) have been uploaded here!

Status and direct links:

ohm314

Looks good!

…pported on GPU

bbpbuildbot · 2022-05-03T14:41:04Z

Logfiles from GitLab pipeline #52564 (:white_check_mark:) have been uploaded here!

Status and direct links:

pramodk

I quickly skimmed through the changes, nothing major I can point out. Apart from clarification comments, this is good from my side.

test/benchmark/cuda_driver.cpp

pramodk · 2022-05-05T10:06:37Z

test/benchmark/cuda_driver.cpp

+    jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
+    char* jitLogBuffer = new char[jitLogBufferSize];
+    jitOptVals[1] = jitLogBuffer;
+


Objects explicitly allocated like jitLogBuffer and jitOptions are internally freed by llvm?

I'm not sure about this actually. After their usage I free them explicitly to make sure they are freed.

…workflow and made clang-format happier

… magkanar/gpu-runner

bbpbuildbot · 2022-05-05T10:52:11Z

Logfiles from GitLab pipeline #53000 (:white_check_mark:) have been uploaded here!

Status and direct links:

bbpbuildbot · 2022-05-05T10:54:39Z

Logfiles from GitLab pipeline #53001 (:white_check_mark:) have been uploaded here!

Status and direct links:

test/benchmark/cuda_driver.cpp

bbpbuildbot · 2022-05-09T10:16:56Z

Logfiles from GitLab pipeline #53525 (:no_entry:) have been uploaded here!

Status and direct links:

- Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then execute it using CUDA API - Ability to select the compilation GPU architecture and then set the proper GPU architecture based on the GPU that is going to be used - Link `libdevice` math library with GPU LLVM module - Handles kernel and wrapper functions attributes properly for GPU execution (wrapper function is `kernel` and kernel attribute is `device`) - Small fixes in InstanceStruct declaration and setup to allocate the pointer variables properly, including the shadow variables - Adds tests in the CI that run small benchmarks in CPU and GPU on BB5 - Adds replacement of `log` math function for SLEEF and libdevice, `pow` and `fabs` for libdevice - Adds GPU execution ability in PyJIT - Small improvement in PyJIT benchmark python script to handle arguments and GPU execution - Separated benchmark info from benchmark driver - Added hh and expsyn mod files in benchmarking tests

georgemitenkov and others added 30 commits March 11, 2022 08:46

fixed comments

9940fd8

Added code generation for thread id

196a5a3

Added kernel annotation generation

7044204

Added tests for annotations/intrinsics

9351e39

fixed comments

b4943fd

Added code generation for thread id

06deff9

Added kernel annotation generation

b506574

Added tests for annotations/intrinsics

d097cf0

fixed comments

0530417

Initial work for gpu runner

1d4117a

Compile module and load it in the GPUJITDriver

d0d1051

more changes to support gpu execution

b9b1418

Template BaseRunner

7e9484d

Fixed compilation issues with templates

9f4a142

More small fixes

d515961

Separated CUDADriver from JITDriver files

b69f2c7

Small fixes and setting the compute arch in options

0882fc3

Tried workflow with test kernel and linking with libdevice

0ec0826

Make clang-format happy and only compile gpu benchmark if cuda backen…

0f72a74

…d is enabled

Improved a bit logs

881d85e

Merge branch 'georgemitenkov/llvm-gpu-codegen' into magkanar/gpu-runner

a1af210

Added newline in EOF

7de726c

Use cmake 3.18 in the CI

18df661

Added optimization option and printing PTX to file

7aceb5b

Fix CUDA_HOME path and make cmake-format happy

690ffab

Merge branch 'llvm' into magkanar/gpu-runner

f0bc5cd

Rearrange vec_rhs and vec_d to allocate memory properly (#856)

d6d419a

* Rearrange vec_rhs and vec_d to allocate memory properly * Setup rhs, d and their shadow vectors * Fix test Co-authored-by: Ioannis Magkanaris <ioannis.magkanaris@epfl.ch>

iomaganaris added 3 commits May 3, 2022 15:47

Use hh.mod from CoreNEURON for the benchmark test and enable inlining…

984933b

… if llvm_backend is enabled

Cleaned up llvm_ir and llvm_backend since there were 2 variables for …

d7615af

…the same purpose

Added arg parsing in benchmark script and added expsyn.mod test

a8a44ed

ohm314 approved these changes May 3, 2022

View reviewed changes

iomaganaris added 2 commits May 3, 2022 16:09

Replace fabs with the libdevice corresponding function

2594854

Disable expsyn test on GPU because atomic instructions are not yet su…

f970ac4

…pported on GPU

pramodk approved these changes May 5, 2022

View reviewed changes

iomaganaris added 2 commits May 5, 2022 13:32

Small function renaming, added comments with explanation of CUDA JIT …

23e337e

…workflow and made clang-format happier

Merge branch 'magkanar/gpu-runner' of github.com:BlueBrain/nmodl into…

6cd7aed

… magkanar/gpu-runner

pramodk requested changes May 9, 2022

View reviewed changes

test/benchmark/cuda_driver.cpp Outdated Show resolved Hide resolved

iomaganaris added 2 commits May 9, 2022 12:01

Fix bug with setting gpu target arch from python

682f978

Use delete instead of free

9e412cc

pramodk approved these changes May 9, 2022

View reviewed changes

iomaganaris merged commit 95782bc into llvm May 9, 2022

iomaganaris deleted the magkanar/gpu-runner branch May 9, 2022 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVM][GPU] Added CUDADriver to execute benchmark on GPU #829

[LLVM][GPU] Added CUDADriver to execute benchmark on GPU #829

iomaganaris commented Mar 21, 2022 •

edited

Loading

bbpbuildbot commented Apr 28, 2022

bbpbuildbot commented May 2, 2022

bbpbuildbot commented May 3, 2022

bbpbuildbot commented May 3, 2022

ohm314 left a comment

bbpbuildbot commented May 3, 2022

pramodk left a comment

pramodk May 5, 2022

iomaganaris May 5, 2022

bbpbuildbot commented May 5, 2022

bbpbuildbot commented May 5, 2022

bbpbuildbot commented May 9, 2022

[LLVM][GPU] Added CUDADriver to execute benchmark on GPU #829

[LLVM][GPU] Added CUDADriver to execute benchmark on GPU #829

Conversation

iomaganaris commented Mar 21, 2022 • edited Loading

bbpbuildbot commented Apr 28, 2022

bbpbuildbot commented May 2, 2022

bbpbuildbot commented May 3, 2022

bbpbuildbot commented May 3, 2022

ohm314 left a comment

Choose a reason for hiding this comment

bbpbuildbot commented May 3, 2022

pramodk left a comment

Choose a reason for hiding this comment

pramodk May 5, 2022

Choose a reason for hiding this comment

iomaganaris May 5, 2022

Choose a reason for hiding this comment

bbpbuildbot commented May 5, 2022

bbpbuildbot commented May 5, 2022

bbpbuildbot commented May 9, 2022

iomaganaris commented Mar 21, 2022 •

edited

Loading