Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing RTC compilation time #402

Open
ptheywood opened this issue Oct 21, 2020 · 7 comments
Open

Reducing RTC compilation time #402

ptheywood opened this issue Oct 21, 2020 · 7 comments

Comments

@ptheywood
Copy link
Member

RTC compilation takes ~ 5 seconds per agent function.

Currently each agent function is it's own jitify program, so template expansion etc happens every time.

Jitify/nvrtc appears to support passing multiple kernels (agent functions) to a single program compilation, and then calling individual kernels from that program.

A quick test (adding multiple agent functions to the same string) shows that this has a negligable effect on compilation time, so by only creating one program (per agent?) this could masivley reduce compilation time (for non trivial models), however this will require big changes to RTC Curve due to per-agent-function specialisation within CURVE.

@ptheywood
Copy link
Member Author

While waiting for the python test suite to run, i've manually looked through the header includes for RTC.
Reducing some of these may improve rtc build times.

Note that this was done mid namespaces branch, so filenames / location / includes may be a little incorrect / out of date when coming back to this.

  • curve_rtc_dynamic.cu
    • FGPUDeviceException.h
      • "flamegpu/gpu/CUDAScanCompaction.h"
        • Doesn't have any includes, doesn't neeed cuda currently? (no cuda symbols in the header).
      • "flamegpu/exception/FLAMEGPUDeviceException_device.h" (if seatbelts)
        • cuda_runtime - Doesn't seem required?
        • device_launch_parameters - Doesn't seem required?
        • <cstring>
    • DeviceEnvironment.h
      • <cstdint>
      • <string>
      • <cassert>
  • _impl.cu
    • flamegpu/runtime/DeviceAPI.h
      +
      • <cassert>
      • <cstdint>
      • <limits>
      • "dynamic/curve_rtc_dynamic.h" ifdef __CUDACC_RTC__
        • "flamegpu/runtime/cuRVE/curve.h" ifndef __CUDACC_RTC__
      • "flamegpu/runtime/utility/AgentRandom.cuh"
        • <curand_kernel.h>
        • <cassert>
        • "flamegpu/util/StaticAssert.h"
          • <cstdint>
        • "flamegpu/exception/FLAMEGPUDeviceException.h"
          • <cuda_runtime.h> - contains cudaStream_t.
          • <device_launch_parameters.h> - not required?
          • <string>
          • <type_traits>
          • "flamegpu/gpu/CUDAScanCompaction.h"
            • None
          • "flamegpu/exception/FLAMEGPUDeviceException_device.h" if SEATBELTS
            • cuda_runtime - Doesn't seem required?
            • device_launch_parameters - Doesn't seem required?
            • <cstring>
      • "flamegpu/runtime/utility/DeviceEnvironment.cuh"
        • <cstdint>
        • <string>
        • <cassert>
      • "flamegpu/gpu/CUDAScanCompaction.h"
        • None
      • "flamegpu/runtime/AgentFunction.cuh"
        • <cuda_runtime.h> - __syncthreads() when built by cuda in a templated method.
        • <device_launch_parameters.h> - don't think this is needed in header
        • <curand_kernel.h> - needed.
        • "flamegpu/defines.h"
        • "flamegpu/exception/FLAMEGPUDeviceException.h"
          • <cuda_runtime.h> - contains cudaStream_t.
          • <device_launch_parameters.h> - not required?
          • <string>
          • <type_traits>
          • "flamegpu/gpu/CUDAScanCompaction.h"
            • None
          • "flamegpu/exception/FLAMEGPUDeviceException_device.h" if SEATBELTS
            • cuda_runtime - Doesn't seem required?
            • device_launch_parameters - Doesn't seem required?
            • <cstring>
        • "flamegpu/runtime/AgentFunction_shim.h"
          • None
        • "flamegpu/gpu/CUDAScanCompaction.h"
          • None
        • Uses Curve::NamespaceHash but doesn't directly include the source of that?
      • "flamegpu/runtime/AgentFunctionCondition.h"
        • <cuda_runtime.h> - __syncthreads in templated method.
        • <device_launch_parameters.h> - not needed?
        • "flamegpu/runtime/DeviceAPI.h"
          • Include cycle, but guarded.
        • "flamegpu/runtime/AgentFunctionCondition_shim.h"
          • None, just defines a macro (which needs flamegpu::ReadOnlyDeviceAPI) + __device__ / __forceinline__
        • "flamegpu/gpu/CUDAScanCompaction.h"
          • None
        • Uses Curve::NamespaceHash but not directly included.
        • Curand type used, but not directly included.
      • "flamegpu/runtime/messaging_device.h"
        • "flamegpu/runtime/messaging/None/NoneDevice.h"
          • "flamegpu/runtime/cuRVE/curve.h" - If __CUDACC_RTC__
          • "flamegpu/runtime/messaging/None.h"
            • No includes
        • "flamegpu/runtime/messaging/BruteForce/BruteForceDevice.h"
          • "flamegpu/runtime/messaging/None.h"
          • "flamegpu/runtime/messaging/BruteForce.h"
            • "flamegpu/runtime/messaging/None.h" - not actually inherrited from / required?
        • "flamegpu/runtime/messaging/Spatial2D/Spatial2DDevice.h"
          • "flamegpu/runtime/messaging/Spatial2D.h"
            • "flamegpu/runtime/messaging/BruteForce.h" - not actually inherrited from / required?
          • "flamegpu/runtime/messaging/BruteForce/BruteForceDevice.h"
            • Above
        • "flamegpu/runtime/messaging/Spatial3D/Spatial3DDevice.h"
          • "flamegpu/runtime/messaging/Spatial3D.h"
            • "flamegpu/runtime/messaging/BruteForce.h" - not actually inherrited from / required?
          • "flamegpu/runtime/messaging/Spatial2D/Spatial2DDevice.h" - not required? only needs bruteDevice
          • "flamegpu/runtime/messaging/BruteForce/BruteForceDevice.h"
        • "flamegpu/runtime/messaging/Array/ArrayDevice.h"
          • "flamegpu/runtime/messaging/Array.h"
            • #include "flamegpu/runtime/messaging/BruteForce.h" - not actually required.
          • "flamegpu/runtime/messaging/BruteForce/BruteForceDevice.h" - not actually required?
        • "flamegpu/runtime/messaging/Array2D/Array2DDevice.h"
          • "flamegpu/runtime/messaging/Array2D.h"
            • "flamegpu/runtime/messaging/BruteForce.h" - not needed
          • "flamegpu/runtime/messaging/BruteForce/BruteForceDevice.h" - not needed
        • "flamegpu/runtime/messaging/Array3D/Array3DDevice.h"
          • "flamegpu/runtime/messaging/Array3D.h"
            • "flamegpu/runtime/messaging/BruteForce.h" - not req
            • "flamegpu/runtime/messaging/Array2D.h"
          • "flamegpu/runtime/messaging/BruteForce/BruteForceDevice.h" - not requied
        • "flamegpu/runtime/messaging/Bucket/BucketDevice.h"
          • "flamegpu/runtime/messaging/Bucket.h"
            • #ifndef __CUDACC_RTC__
              • <memory>
              • <string>
              • "flamegpu/runtime/cuRVE/curve.h"
            • "flamegpu/runtime/messaging/None.h" - used for size type
            • "flamegpu/runtime/messaging/BruteForce.h"
              • not required?
          • "flamegpu/runtime/messaging/BruteForce/BruteForceDevice.h"
            • This one actually inherits.
      • "flamegpu/defines.h"

@Robadob
Copy link
Member

Robadob commented Jun 29, 2021 via email

@ptheywood
Copy link
Member Author

The inheritence is mostly in the MsgXHost header though, not the MsgXDevice or MsgX headers, hence not required in the RTC path.

Yes compile time shouldn't matter (significanlty) with the macro guards, but i'm assuming there will be still some cost associated (the file has to be parsed again line by line during preproc as a minimum), and when the two test suites take 1500s most of which is RTC compilation any saving will add up.

@ptheywood
Copy link
Member Author

As an incredibly rough check I've ran boids_rtc_bruteforce (2 agent fns) on my office threadripper, which takes roughlty 17s to RTC.

I then commented out all messaging other than None and Bruteforce from messaging_device.h. This brought the build time down to less than 13s (~30%).
Obviously this isn't the actual fix, as it just removes features but it shows how much time is spent in those headers alone.

If we could save 30% from RTC time, then a running both test suites would be over 5 mins quicker, so from a development / testing perspective it seems worthwhile, and will help with large RTC agent models.

Alternatively, given the message type is provided in a macro we could potentially only include the relevant message types in RTC to achieve this saving (if duplicate includes are not the issue, hard to know without trying)

@Robadob
Copy link
Member

Robadob commented Jun 29, 2021

That's a surprising saving imo.

We could possibly write something hacky to generate a compact header for RTC if really necessary.

@ptheywood
Copy link
Member Author

ptheywood commented Jul 1, 2021

Couple of simple changes to AgentDescription::newRTCFunction and messaging_device.h made it easy to only include the required messaging types for a given agent function, reducing RTC time for brute force from 13s to 17s as described above for boids bruteforce on my office machine (CUDA 11.2, SM70, SEATBELTS=ON for reference).

Running the c++ test suite took 355s (compared to ~500s before?), 29s with RTC cache hits.

Running pytest took 870s (compared to > 1000s before? one cuda contest per test so ~100s of context creation extra in python + other python overheads i guess?), 101s with RTC (and python?) cache hits. 144s with the __pycache__ dirs in tests/swig deleted.

@ptheywood ptheywood mentioned this issue Jul 7, 2021
27 tasks
@Robadob
Copy link
Member

Robadob commented Aug 12, 2021

My issue seems like a dupe of this: #602

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants