Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise Jitify Preprocessor. #602

Open
Robadob opened this issue Jul 26, 2021 · 6 comments
Open

Optimise Jitify Preprocessor. #602

Robadob opened this issue Jul 26, 2021 · 6 comments

Comments

@Robadob
Copy link
Member

Robadob commented Jul 26, 2021

Compilation with main GLM include, leads to a 63 second call to the jitify::Program constructor, of which it appears only 600 milliseconds is spent by NVRTC (createProgram, compileProgram, ..., destroyProgram). This means the Jitify preprocessor is likely to blame. We either need to profile and optimise it heavily, or add aggressive caching to processed headers.

Might be worth raising an issue on https://github.com/NVIDIA/jitify, to see if they have any thoughts on the matter. But it appears most of their attention has moved to Jitify2, so unlikely they would do any work directly on optimising the pre-processor. If taking the aggressive caching approach, might be worth getting Ben to agree whether it's something they'd be interested in merging, so we can decide whether to make our header cache internal or external to Jitify.

@Robadob
Copy link
Member Author

Robadob commented Jul 27, 2021

Ok, so having looked into this further, my initial hypothesis was incorrect. The time is being taken up by NVRTC, however this because Jitify is hammering NVRTC, with many calls to jitify::detail::compile_kernel per agent function. It appears it silently catches errors to detect include files, rather than parsing the include files itself. Replacing this with a tiny pre-processor to detect include files recursively might be the solution (or precaching our include files so it doesn't have to follow them). Writing a legitimate preprocessor, that can appropriately handle #if with arithmetic expressions is probably a bad idea so 2 options:

  1. Recursively collect #include statements, treating all files as once, and doing as Jitify does, commenting out includes we can't find.
  2. Use a tool like pcpp to pre-process the headers, this will actually support the defines atleast limiting the non-rtc headers from fgpu2.

This is the offending block (line numbers might be off)

image

@Robadob
Copy link
Member Author

Robadob commented Jul 27, 2021

So I've ran a test file through the python preprocessor lib pcpp, the input/output visible here: https://gist.github.com/Robadob/53701a5217dc9089800f5a37716fc69b

In short:

  • It took a total of 3.0s, to parse the full include hierarchy. (assuming it did it correctly)
  • pcpp time per bottom level file, was anywhere from 500ms to 50ms, (higher level files account for child file times too, so less useful).
  • In comparison, Jitify costs 130-600ms per file (A quick count was 198 times compile_kernel was called with a silent error), each fail leads to an include being added and the whole tree re-parsed, so the time gradually increases. The final successful all only takes around a second, but the preceding calls add up to ~60s+.

@Robadob
Copy link
Member Author

Robadob commented Jul 27, 2021

Created a Jitify issue to see their thoughts here: NVIDIA/jitify#90

From our perspective, we should consider whether we want to go as far as including a complex library build step which generates a flattened header/s for including within RTC (using something like pcpp or clang, with us manually specifying all the defines). I think this would give us better RTC performance than even the best case, where NVRTC adds support for loading include files straight from disk. Not too sure of the license implications of flattening glm headers.

@ptheywood
Copy link
Member

GLM is MIT, so as long as we include the glm licence file with it then it's not an issue.

@Robadob
Copy link
Member Author

Robadob commented Jul 28, 2021

So, I've managed to get RTC to build with partially flattened headers.

Create a file test.cpp

#define SEATBELTS 1
#define USE_GLM
#define NDEBUG
#define __CUDACC_RTC__
#define __CUDACC__
#define __CUDA_ARCH__ 50
#define __CUDACC_VER_MAJOR__ 11
#define __CUDACC_VER_MINOR__ 1
#define __CUDACC_VER_BUILD__
#define NULL nullptr
#define __cplusplus
#define _WIN64 1
#define __cdecl
#define __ptr64
#define INCLUDE_FLAMEGPU_RUNTIME_UTILITY_DEVICEENVIRONMENT_CUH_

#include "flamegpu/exception/FLAMEGPUDeviceException.cuh"
#include "flamegpu/runtime/DeviceAPI.cuh"
#include "flamegpu/runtime/messaging/None/NoneDevice.cuh"
#include "flamegpu/runtime/messaging/Bucket/BucketDevice.cuh"
#include "flamegpu/runtime/messaging/BruteForce/BruteForceDevice.cuh"
#include "flamegpu/runtime/messaging/Array/ArrayDevice.cuh"
#include "flamegpu/runtime/messaging/Array2D/Array2DDevice.cuh"
#include "flamegpu/runtime/messaging/Array3D/Array3DDevice.cuh"
#include "flamegpu/runtime/messaging/Spatial2D/Spatial2DDevice.cuh"
#include "flamegpu/runtime/messaging/Spatial3D/Spatial3DDevice.cuh"

Run pcpp -o test.h -I"C:\Users\Robadob\fgpu2\include" -I"C:\Users\Robadob\fgpu2\build\_deps\glm-src" --time --passthru-defines test.cpp

(You can install pcpp with pip)

Edit the output file test.h and delete the first 16 lines, we don't want the defines we manually created to match compiler status and DeviceEnvironment header passed through, these will cause problems.

Move the edited test.h to the flamegpu include directory.

Update AgentDescription::newRTCFunction() so it includes test.h instead of DeviceAPI.h, and comment out the dynamic message includes.

Now RTC can build agent functions with GLM in ~8 seconds, rather than 60+ seconds.

This could be further improved by flattening the remaining system/cuda/curand headers, and flattening device environment header into the dynamic curve header (because it has to be included late).

@Robadob
Copy link
Member Author

Robadob commented Feb 18, 2022

Managed to improve RTC compile times (albeit not extended to GLM) with some test code:

Here are the tests and their improvements:

  • TestCUDASimulation.RTCElapsedTime : 7865ms -> 1083ms
  • TestCUDASimulationConcurrency.RTCLayerConcurrency : 29905ms -> 4498ms
  • RTCDeviceEnvironmentTest.Get_array_glm : 50297ms -> 41798ms

The hacky fix I used was to add this long block of code to headers inside JitifyCache::compileKernel():

    // Add known headers from hierarchy
    headers.push_back("algorithm");
    headers.push_back("assert.h");
    headers.push_back("cassert");
    headers.push_back("cfloat");
    headers.push_back("climits");
    headers.push_back("cmath");
    headers.push_back("cstddef");
    headers.push_back("cstdint");
    headers.push_back("cstring");
    headers.push_back("cuda_runtime.h");
    headers.push_back("curand.h");
    headers.push_back("curand_discrete.h");
    headers.push_back("curand_discrete2.h");
    headers.push_back("curand_globals.h");
    headers.push_back("curand_kernel.h");
    headers.push_back("curand_lognormal.h");
    headers.push_back("curand_mrg32k3a.h");
    headers.push_back("curand_mtgp32.h");
    headers.push_back("curand_mtgp32_kernel.h");
    headers.push_back("curand_normal.h");
    headers.push_back("curand_normal_static.h");
    headers.push_back("curand_philox4x32_x.h");
    headers.push_back("curand_poisson.h");
    headers.push_back("curand_precalc.h");
    headers.push_back("curand_uniform.h");
    headers.push_back("device_launch_parameters.h");
    //headers.push_back("dynamic/curve_rtc_dynamic.h");  // This is included proper below, having this makes a vague compile err
    headers.push_back("flamegpu/defines.h");
    headers.push_back("flamegpu/exception/FLAMEGPUDeviceException.cuh");
    headers.push_back("flamegpu/exception/FLAMEGPUDeviceException_device.cuh");
    headers.push_back("flamegpu/gpu/CUDAScanCompaction.h");
    headers.push_back("flamegpu/runtime/AgentFunction.cuh");
    headers.push_back("flamegpu/runtime/AgentFunctionCondition.cuh");
    headers.push_back("flamegpu/runtime/AgentFunctionCondition_shim.cuh");
    headers.push_back("flamegpu/runtime/AgentFunction_shim.cuh");
    headers.push_back("flamegpu/runtime/DeviceAPI.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageArray.h");
    headers.push_back("flamegpu/runtime/messaging/MessageArray/MessageArrayDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageArray2D.h");
    headers.push_back("flamegpu/runtime/messaging/MessageArray2D/MessageArray2DDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageArray3D.h");
    headers.push_back("flamegpu/runtime/messaging/MessageArray3D/MessageArray3DDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageBruteForce.h");
    headers.push_back("flamegpu/runtime/messaging/MessageBruteForce/MessageBruteForceDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageBucket.h");
    headers.push_back("flamegpu/runtime/messaging/MessageBucket/MessageBucketDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageSpatial2D.h");
    headers.push_back("flamegpu/runtime/messaging/MessageSpatial2D/MessageSpatial2DDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageSpatial3D.h");
    headers.push_back("flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageNone.h");
    headers.push_back("flamegpu/runtime/utility/AgentRandom.cuh");
    headers.push_back("flamegpu/runtime/utility/DeviceEnvironment.cuh");
    headers.push_back("flamegpu/runtime/utility/DeviceMacroProperty.cuh");
    headers.push_back("flamegpu/util/detail/StaticAssert.h");
    //headers.push_back("jitify_preinclude.h");  // I think Jitify adds this itself
    headers.push_back("limits");
    headers.push_back("limits.h");
    headers.push_back("math.h");
    headers.push_back("memory.h");
    headers.push_back("stddef.h");
    headers.push_back("stdint.h");
    headers.push_back("stdio.h");
    headers.push_back("stdlib.h");
    headers.push_back("string");
    headers.push_back("string.h");
    headers.push_back("time.h");
    headers.push_back("type_traits");

These are all the headers reported by the keys in jitify::experimental::Program::_sources.

The issue with adding GLM to this, is that internally GLM has many relative path includes, many which map to duplicate absolute paths. It might be possible to address that by giving lots of bad include paths, but this seems grim. I think the optimal solution to GLM would be to feed glm through pcpp, as done in the above comment, to flatten glm. This could presumably be automated at cmake time. Although this wouldn't solve the issue where users wanted tertiary glm includes, which will include back in core glm headers.

As Pete has pointed out on slack, we probably want to automate detection of the fgpu/curand include hierarchies, so they are stable with library changes. Best method for that requires discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants