Optimise Jitify Preprocessor. #602

Robadob · 2021-07-26T17:51:44Z

Compilation with main GLM include, leads to a 63 second call to the jitify::Program constructor, of which it appears only 600 milliseconds is spent by NVRTC (createProgram, compileProgram, ..., destroyProgram). ~~This means the Jitify preprocessor is likely to blame. We either need to profile and optimise it heavily, or add aggressive caching to processed headers.~~

Might be worth raising an issue on https://github.com/NVIDIA/jitify, to see if they have any thoughts on the matter. But it appears most of their attention has moved to Jitify2, so unlikely they would do any work directly on optimising the pre-processor. If taking the aggressive caching approach, might be worth getting Ben to agree whether it's something they'd be interested in merging, so we can decide whether to make our header cache internal or external to Jitify.

The text was updated successfully, but these errors were encountered:

Robadob · 2021-07-27T18:27:19Z

Ok, so having looked into this further, my initial hypothesis was incorrect. The time is being taken up by NVRTC, however this because Jitify is hammering NVRTC, with many calls to jitify::detail::compile_kernel per agent function. It appears it silently catches errors to detect include files, rather than parsing the include files itself. Replacing this with a tiny pre-processor to detect include files recursively might be the solution (or precaching our include files so it doesn't have to follow them). Writing a legitimate preprocessor, that can appropriately handle #if with arithmetic expressions is probably a bad idea so 2 options:

Recursively collect #include statements, treating all files as once, and doing as Jitify does, commenting out includes we can't find.
Use a tool like pcpp to pre-process the headers, this will actually support the defines atleast limiting the non-rtc headers from fgpu2.

This is the offending block (line numbers might be off)

Robadob · 2021-07-27T19:29:26Z

So I've ran a test file through the python preprocessor lib pcpp, the input/output visible here: https://gist.github.com/Robadob/53701a5217dc9089800f5a37716fc69b

In short:

It took a total of 3.0s, to parse the full include hierarchy. (assuming it did it correctly)
pcpp time per bottom level file, was anywhere from 500ms to 50ms, (higher level files account for child file times too, so less useful).
In comparison, Jitify costs 130-600ms per file (A quick count was 198 times compile_kernel was called with a silent error), each fail leads to an include being added and the whole tree re-parsed, so the time gradually increases. The final successful all only takes around a second, but the preceding calls add up to ~60s+.

Robadob · 2021-07-27T20:23:05Z

Created a Jitify issue to see their thoughts here: NVIDIA/jitify#90

From our perspective, we should consider whether we want to go as far as including a complex library build step which generates a flattened header/s for including within RTC (using something like pcpp or clang, with us manually specifying all the defines). I think this would give us better RTC performance than even the best case, where NVRTC adds support for loading include files straight from disk. Not too sure of the license implications of flattening glm headers.

ptheywood · 2021-07-27T20:42:54Z

GLM is MIT, so as long as we include the glm licence file with it then it's not an issue.

Robadob · 2021-07-28T13:00:28Z

So, I've managed to get RTC to build with partially flattened headers.

Create a file test.cpp

#define SEATBELTS 1
#define USE_GLM
#define NDEBUG
#define __CUDACC_RTC__
#define __CUDACC__
#define __CUDA_ARCH__ 50
#define __CUDACC_VER_MAJOR__ 11
#define __CUDACC_VER_MINOR__ 1
#define __CUDACC_VER_BUILD__
#define NULL nullptr
#define __cplusplus
#define _WIN64 1
#define __cdecl
#define __ptr64
#define INCLUDE_FLAMEGPU_RUNTIME_UTILITY_DEVICEENVIRONMENT_CUH_

#include "flamegpu/exception/FLAMEGPUDeviceException.cuh"
#include "flamegpu/runtime/DeviceAPI.cuh"
#include "flamegpu/runtime/messaging/None/NoneDevice.cuh"
#include "flamegpu/runtime/messaging/Bucket/BucketDevice.cuh"
#include "flamegpu/runtime/messaging/BruteForce/BruteForceDevice.cuh"
#include "flamegpu/runtime/messaging/Array/ArrayDevice.cuh"
#include "flamegpu/runtime/messaging/Array2D/Array2DDevice.cuh"
#include "flamegpu/runtime/messaging/Array3D/Array3DDevice.cuh"
#include "flamegpu/runtime/messaging/Spatial2D/Spatial2DDevice.cuh"
#include "flamegpu/runtime/messaging/Spatial3D/Spatial3DDevice.cuh"

Run pcpp -o test.h -I"C:\Users\Robadob\fgpu2\include" -I"C:\Users\Robadob\fgpu2\build\_deps\glm-src" --time --passthru-defines test.cpp

(You can install pcpp with pip)

Edit the output file test.h and delete the first 16 lines, we don't want the defines we manually created to match compiler status and DeviceEnvironment header passed through, these will cause problems.

Move the edited test.h to the flamegpu include directory.

Update AgentDescription::newRTCFunction() so it includes test.h instead of DeviceAPI.h, and comment out the dynamic message includes.

Now RTC can build agent functions with GLM in ~8 seconds, rather than 60+ seconds.

This could be further improved by flattening the remaining system/cuda/curand headers, and flattening device environment header into the dynamic curve header (because it has to be included late).

Robadob · 2022-02-18T16:55:23Z

Managed to improve RTC compile times (albeit not extended to GLM) with some test code:

Here are the tests and their improvements:

TestCUDASimulation.RTCElapsedTime : 7865ms -> 1083ms
TestCUDASimulationConcurrency.RTCLayerConcurrency : 29905ms -> 4498ms
RTCDeviceEnvironmentTest.Get_array_glm : 50297ms -> 41798ms

The hacky fix I used was to add this long block of code to headers inside JitifyCache::compileKernel():

    // Add known headers from hierarchy
    headers.push_back("algorithm");
    headers.push_back("assert.h");
    headers.push_back("cassert");
    headers.push_back("cfloat");
    headers.push_back("climits");
    headers.push_back("cmath");
    headers.push_back("cstddef");
    headers.push_back("cstdint");
    headers.push_back("cstring");
    headers.push_back("cuda_runtime.h");
    headers.push_back("curand.h");
    headers.push_back("curand_discrete.h");
    headers.push_back("curand_discrete2.h");
    headers.push_back("curand_globals.h");
    headers.push_back("curand_kernel.h");
    headers.push_back("curand_lognormal.h");
    headers.push_back("curand_mrg32k3a.h");
    headers.push_back("curand_mtgp32.h");
    headers.push_back("curand_mtgp32_kernel.h");
    headers.push_back("curand_normal.h");
    headers.push_back("curand_normal_static.h");
    headers.push_back("curand_philox4x32_x.h");
    headers.push_back("curand_poisson.h");
    headers.push_back("curand_precalc.h");
    headers.push_back("curand_uniform.h");
    headers.push_back("device_launch_parameters.h");
    //headers.push_back("dynamic/curve_rtc_dynamic.h");  // This is included proper below, having this makes a vague compile err
    headers.push_back("flamegpu/defines.h");
    headers.push_back("flamegpu/exception/FLAMEGPUDeviceException.cuh");
    headers.push_back("flamegpu/exception/FLAMEGPUDeviceException_device.cuh");
    headers.push_back("flamegpu/gpu/CUDAScanCompaction.h");
    headers.push_back("flamegpu/runtime/AgentFunction.cuh");
    headers.push_back("flamegpu/runtime/AgentFunctionCondition.cuh");
    headers.push_back("flamegpu/runtime/AgentFunctionCondition_shim.cuh");
    headers.push_back("flamegpu/runtime/AgentFunction_shim.cuh");
    headers.push_back("flamegpu/runtime/DeviceAPI.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageArray.h");
    headers.push_back("flamegpu/runtime/messaging/MessageArray/MessageArrayDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageArray2D.h");
    headers.push_back("flamegpu/runtime/messaging/MessageArray2D/MessageArray2DDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageArray3D.h");
    headers.push_back("flamegpu/runtime/messaging/MessageArray3D/MessageArray3DDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageBruteForce.h");
    headers.push_back("flamegpu/runtime/messaging/MessageBruteForce/MessageBruteForceDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageBucket.h");
    headers.push_back("flamegpu/runtime/messaging/MessageBucket/MessageBucketDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageSpatial2D.h");
    headers.push_back("flamegpu/runtime/messaging/MessageSpatial2D/MessageSpatial2DDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageSpatial3D.h");
    headers.push_back("flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh");
    headers.push_back("flamegpu/runtime/messaging/MessageNone.h");
    headers.push_back("flamegpu/runtime/utility/AgentRandom.cuh");
    headers.push_back("flamegpu/runtime/utility/DeviceEnvironment.cuh");
    headers.push_back("flamegpu/runtime/utility/DeviceMacroProperty.cuh");
    headers.push_back("flamegpu/util/detail/StaticAssert.h");
    //headers.push_back("jitify_preinclude.h");  // I think Jitify adds this itself
    headers.push_back("limits");
    headers.push_back("limits.h");
    headers.push_back("math.h");
    headers.push_back("memory.h");
    headers.push_back("stddef.h");
    headers.push_back("stdint.h");
    headers.push_back("stdio.h");
    headers.push_back("stdlib.h");
    headers.push_back("string");
    headers.push_back("string.h");
    headers.push_back("time.h");
    headers.push_back("type_traits");

These are all the headers reported by the keys in jitify::experimental::Program::_sources.

The issue with adding GLM to this, is that internally GLM has many relative path includes, many which map to duplicate absolute paths. It might be possible to address that by giving lots of bad include paths, but this seems grim. I think the optimal solution to GLM would be to feed glm through pcpp, as done in the above comment, to flatten glm. This could presumably be automated at cmake time. Although this wouldn't solve the issue where users wanted tertiary glm includes, which will include back in core glm headers.

As Pete has pointed out on slack, we probably want to automate detection of the fgpu/curand include hierarchies, so they are stable with library changes. Best method for that requires discussion.

Robadob added enhancement optimisation RTC labels Jul 26, 2021

Robadob mentioned this issue Aug 12, 2021

Reducing RTC compilation time #402

Open

ptheywood mentioned this issue Aug 12, 2021

Vector Types (GLM) #217

Closed

mondus mentioned this issue Feb 18, 2022

RTC Address Patching #802

Closed

Robadob mentioned this issue Feb 24, 2022

Prespecify known headers when asking Jitify to compile. #811

Merged

Robadob mentioned this issue Mar 30, 2022

Python API support for glm types #826

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise Jitify Preprocessor. #602

Optimise Jitify Preprocessor. #602

Robadob commented Jul 26, 2021 •

edited

Loading

Robadob commented Jul 27, 2021 •

edited

Loading

Robadob commented Jul 27, 2021

Robadob commented Jul 27, 2021 •

edited

Loading

ptheywood commented Jul 27, 2021

Robadob commented Jul 28, 2021

Robadob commented Feb 18, 2022 •

edited

Loading

Optimise Jitify Preprocessor. #602

Optimise Jitify Preprocessor. #602

Comments

Robadob commented Jul 26, 2021 • edited Loading

Robadob commented Jul 27, 2021 • edited Loading

Robadob commented Jul 27, 2021

Robadob commented Jul 27, 2021 • edited Loading

ptheywood commented Jul 27, 2021

Robadob commented Jul 28, 2021

Robadob commented Feb 18, 2022 • edited Loading

Robadob commented Jul 26, 2021 •

edited

Loading

Robadob commented Jul 27, 2021 •

edited

Loading

Robadob commented Jul 27, 2021 •

edited

Loading

Robadob commented Feb 18, 2022 •

edited

Loading