Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVM][GPU] Added CUDADriver to execute benchmark on GPU #829

Merged
merged 125 commits into from
May 9, 2022

Conversation

iomaganaris
Copy link
Contributor

@iomaganaris iomaganaris commented Mar 21, 2022

  • Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then exexcute it using CUDA API
  • Tested the CUDADriver with external example LLVM IR code and works fine
  • CUDA used to build NMODL should be the same version as the CUDA driver version installed on the machine the benchmark is executed otherwise there is CUDA_ERROR_INVALID_PTX error
  • Added linkage to libdevice library
  • To execute the benchmark with the libdevice we need to run the following:
./bin/nmodl ../test.mod --output "llvm_cuda" --verbose debug llvm --no-debug --ir --opt-level-ir 3 gpu --target-arch "sm_80" --name "nvptx64" --math-library libdevice benchmark --run --libs "${CUDA_ROOT}/nvvm/libdevice/libdevice.10.bc" --opt-level-codegen 3 --instance-size 10000000 --repeat 2 --grid-dim-x 4096 --block-dim-x 256

WIP:

  • At the moment the generated LLVM IR from NMODL doesn't generate right PTX code due to the address spaces not being set properly Edit: Generated LLVM IR has instructions that convert pointers in generic address space to global pointers so generated LLVM IR code is now executable to GPU using this PR
  • Kernel name in nvvm anotation should change to have the real name instead of "kernel"
  • Handle compilation options
  • Transform LLVM IR to bitcode to pass to nvvmAddModuleToProgram. Passing string is deprecated
  • Find a way to set Triplet and DataLayout for GPUDriver in non-hardcoded way (?)
  • Check why code cannot be compiled on GPU with kernel attributes
    This is the output of NVVM when it tries to compile the kernel with the default attributes:
terminate called after throwing an instance of 'std::runtime_error'
  what():  Compilation Log:
 nmodl_kernel: parse Unknown attribute kind (62) (Producer: 'LLVM13.0.0' Reader: 'LLVM 7.0.1')

I am using CUDA 11.4.2.

  • Check why code cannot be compiled on GPU with debug flags. This is the output with debug flags:
[NMODL] [info] :: CUDA JIT ERROR LOG: ptxas application ptx input, line 264; fatal   : Parsing error near '-': syntax error
ptxas fatal   : Ptx assembly aborted due to errors

The related PTX code:

...
   .section	.debug_pubnames
   {
.b32 LpubNames_end0-LpubNames_start0
LpubNames_start0:
.b8 2
...
LpubNames_end0:
   }
   .section	.debug_pubtypes
   {
.b32 LpubTypes_end0-LpubTypes_start0
LpubTypes_start0:
.b8 2
.b8 0
.b32 .debug_info
.b32 182
.b32 0
LpubTypes_end0:
   }
   .section	.debug_loc	{	}

You can reproduce the issue if you remoce the --no-debug option in the command above

  • Added GitLab CI test that executes benchmark in Cascade Lake CPU and V100 GPU node

georgemitenkov and others added 30 commits March 11, 2022 08:46
Now, CLI has two options: `cpu` and `gpu` that allow
users to target different platforms. For example,

```
bin/nmodl mod/test.mod -o out llvm --ir

bin/nmodl mod/test.mod -o out llvm --ir cpu --name skylake --vector-width 2

bin/nmodl mod/test.mod -o out llvm --ir gpu --name cuda
```

Moreover, `assume_no_alias` option was dropped and
made default (it didn't affect the computation in
our experiments).

The new CLI looks like:
```
llvm
  LLVM code generation option
  Options:
    --ir REQUIRED                         Generate LLVM IR (false)
    --no-debug                            Disable debug information (false)
    --opt-level-ir INT:{0,1,2,3}          LLVM IR optimisation level (O0)
    --single-precision                    Use single precision floating-point types (false)
    --fmf TEXT:{afn,arcp,contract,ninf,nnan,nsz,reassoc,fast} ...
                                          Fast math flags for floating-point optimizations (none)

cpu
  LLVM CPU option
  Options:
    --name TEXT                           Name of CPU platform to use
    --math-library TEXT:{Accelerate,libmvec,libsystem_m,MASSV,SLEEF,SVML,none}
                                          Math library for SIMD code generation (none)
    --vector-width INT                    Explicit vectorization width for IR generation (1)

gpu
  LLVM GPU option
  Options:
    --name TEXT                           Name of GPU platform to use
    --math-library TEXT:{libdevice}       Math library for GPU code generation (none)

benchmark
  LLVM benchmark option
  Options:
    --run                                 Run LLVM benchmark (false)
    --opt-level-codegen INT:{0,1,2,3}     Machine code optimisation level (O0)
    --libs TEXT:FILE ...                  Shared libraries to link IR against
    --instance-size INT                   Instance struct size (10000)
    --repeat INT                          Number of experiments for benchmarking (100)
```
This commit introduces a handy `Plarform` class
that is designed to incorporate target information
for IR  generation, such as precision, vectorization
width (if applicable), type of target (CPU/GPU), etc.

In future, more functionality can be added to `Platform`,
e.g. we can move functionality of handling `llvm::Target`,
math SIMD libraries, etc.

Note: this is just a very basic implementation that enables
easier integration of GPU code generation.
This commit adds a new AST node: `CodegenThreadId` that
represents thread id used in GPU computation. Thanks to
the new platform class abstraction, the code to generate
compute body of NEURON block was readapted to support
AST transformations needed for GPU.

Example of the transformation:
```
GPU_ID id
INTEGER node_id
DOUBLE v
IF (id<mech->node_count) {
    node_id = mech->node_index[id]
    v = mech->voltage[node_id]
    mech->m[id] = mech->y[id]+2
}
```
This commit introduces a handy `Plarform` class
that is designed to incorporate target information
for IR  generation, such as precision, vectorization
width (if applicable), type of target (CPU/GPU), etc.

In future, more functionality can be added to `Platform`,
e.g. we can move functionality of handling `llvm::Target`,
math SIMD libraries, etc.

Note: this is just a very basic implementation that enables
easier integration of GPU code generation.
This commit adds a new AST node: `CodegenThreadId` that
represents thread id used in GPU computation. Thanks to
the new platform class abstraction, the code to generate
compute body of NEURON block was readapted to support
AST transformations needed for GPU.

Example of the transformation:
```
GPU_ID id
INTEGER node_id
DOUBLE v
IF (id<mech->node_count) {
    node_id = mech->node_index[id]
    v = mech->voltage[node_id]
    mech->m[id] = mech->y[id]+2
}
```
This commit adds a new AST node: `CodegenThreadId` that
represents thread id used in GPU computation. Thanks to
the new platform class abstraction, the code to generate
compute body of NEURON block was readapted to support
AST transformations needed for GPU.

Example of the transformation:
```
GPU_ID id
INTEGER node_id
DOUBLE v
IF (id<mech->node_count) {
    node_id = mech->node_index[id]
    v = mech->voltage[node_id]
    mech->m[id] = mech->y[id]+2
}
```
@bbpbuildbot
Copy link
Collaborator

Logfiles from GitLab pipeline #51498 (:white_check_mark:) have been uploaded here!

Status and direct links:

@bbpbuildbot
Copy link
Collaborator

Logfiles from GitLab pipeline #52306 (:no_entry:) have been uploaded here!

Status and direct links:

* Rearrange vec_rhs and vec_d to allocate memory properly
* Setup rhs, d and their shadow vectors
* Fix test

Co-authored-by: Ioannis Magkanaris <ioannis.magkanaris@epfl.ch>
@bbpbuildbot
Copy link
Collaborator

Logfiles from GitLab pipeline #52484 (:white_check_mark:) have been uploaded here!

Status and direct links:

@bbpbuildbot
Copy link
Collaborator

Logfiles from GitLab pipeline #52530 (:no_entry:) have been uploaded here!

Status and direct links:

Copy link
Contributor

@ohm314 ohm314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@bbpbuildbot
Copy link
Collaborator

Logfiles from GitLab pipeline #52564 (:white_check_mark:) have been uploaded here!

Status and direct links:

Copy link
Contributor

@pramodk pramodk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly skimmed through the changes, nothing major I can point out. Apart from clarification comments, this is good from my side.

test/benchmark/cuda_driver.cpp Show resolved Hide resolved
jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
char* jitLogBuffer = new char[jitLogBufferSize];
jitOptVals[1] = jitLogBuffer;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects explicitly allocated like jitLogBuffer and jitOptions are internally freed by llvm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this actually. After their usage I free them explicitly to make sure they are freed.

@bbpbuildbot
Copy link
Collaborator

Logfiles from GitLab pipeline #53000 (:white_check_mark:) have been uploaded here!

Status and direct links:

@bbpbuildbot
Copy link
Collaborator

Logfiles from GitLab pipeline #53001 (:white_check_mark:) have been uploaded here!

Status and direct links:

test/benchmark/cuda_driver.cpp Outdated Show resolved Hide resolved
@bbpbuildbot
Copy link
Collaborator

Logfiles from GitLab pipeline #53525 (:no_entry:) have been uploaded here!

Status and direct links:

@iomaganaris iomaganaris merged commit 95782bc into llvm May 9, 2022
@iomaganaris iomaganaris deleted the magkanar/gpu-runner branch May 9, 2022 12:00
iomaganaris added a commit that referenced this pull request May 10, 2022
- Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then execute it using CUDA API
- Ability to select the compilation GPU architecture and then set the proper GPU architecture based on the GPU that is going to be used
- Link `libdevice` math library with GPU LLVM module
- Handles kernel and wrapper functions attributes properly for GPU execution (wrapper function is `kernel` and kernel attribute is `device`)
- Small fixes in InstanceStruct declaration and setup to allocate the pointer variables properly, including the shadow variables
- Adds tests in the CI that run small benchmarks in CPU and GPU on BB5
- Adds replacement of `log` math function for SLEEF and libdevice, `pow` and `fabs` for libdevice
- Adds GPU execution ability in PyJIT
- Small improvement in PyJIT benchmark python script to handle arguments and GPU execution
- Separated benchmark info from benchmark driver
- Added hh and expsyn mod files in benchmarking tests
iomaganaris added a commit that referenced this pull request May 12, 2022
- Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then execute it using CUDA API
- Ability to select the compilation GPU architecture and then set the proper GPU architecture based on the GPU that is going to be used
- Link `libdevice` math library with GPU LLVM module
- Handles kernel and wrapper functions attributes properly for GPU execution (wrapper function is `kernel` and kernel attribute is `device`)
- Small fixes in InstanceStruct declaration and setup to allocate the pointer variables properly, including the shadow variables
- Adds tests in the CI that run small benchmarks in CPU and GPU on BB5
- Adds replacement of `log` math function for SLEEF and libdevice, `pow` and `fabs` for libdevice
- Adds GPU execution ability in PyJIT
- Small improvement in PyJIT benchmark python script to handle arguments and GPU execution
- Separated benchmark info from benchmark driver
- Added hh and expsyn mod files in benchmarking tests
iomaganaris added a commit that referenced this pull request Sep 15, 2022
- Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then execute it using CUDA API
- Ability to select the compilation GPU architecture and then set the proper GPU architecture based on the GPU that is going to be used
- Link `libdevice` math library with GPU LLVM module
- Handles kernel and wrapper functions attributes properly for GPU execution (wrapper function is `kernel` and kernel attribute is `device`)
- Small fixes in InstanceStruct declaration and setup to allocate the pointer variables properly, including the shadow variables
- Adds tests in the CI that run small benchmarks in CPU and GPU on BB5
- Adds replacement of `log` math function for SLEEF and libdevice, `pow` and `fabs` for libdevice
- Adds GPU execution ability in PyJIT
- Small improvement in PyJIT benchmark python script to handle arguments and GPU execution
- Separated benchmark info from benchmark driver
- Added hh and expsyn mod files in benchmarking tests
iomaganaris added a commit that referenced this pull request Sep 15, 2022
- Added CUDADriver to compile LLVM IR string generated from CodegenLLVMVisitor to PTX string and then execute it using CUDA API
- Ability to select the compilation GPU architecture and then set the proper GPU architecture based on the GPU that is going to be used
- Link `libdevice` math library with GPU LLVM module
- Handles kernel and wrapper functions attributes properly for GPU execution (wrapper function is `kernel` and kernel attribute is `device`)
- Small fixes in InstanceStruct declaration and setup to allocate the pointer variables properly, including the shadow variables
- Adds tests in the CI that run small benchmarks in CPU and GPU on BB5
- Adds replacement of `log` math function for SLEEF and libdevice, `pow` and `fabs` for libdevice
- Adds GPU execution ability in PyJIT
- Small improvement in PyJIT benchmark python script to handle arguments and GPU execution
- Separated benchmark info from benchmark driver
- Added hh and expsyn mod files in benchmarking tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants