In-Layer Concurrency #379

ptheywood · 2020-09-11T16:57:20Z

Implement and detect in-layer concurrency via streams.

This is now ready for review. It is not compelte, but is most of the way there and provides the majority of the intended benefit, although things can be improved.

Need to check that it has worked as intended on Windows (i.e. do the tests pass) in Release mode.

Issues

Closes #189

Outstanding improvements detailed in #449

Completed bits

This has primarily been tested / implemented under Linux.

Validate on Linux
Validate on Windows (WDDM), may be discrepancy.

Ideally the problem sizes should be small enough that individual agent functions will not fill the device, but this requires blocksize information to accurately target. Rather than exposing this, just use very small populations.

Simple test case

4 agent types, 1 function each in the same layer, with small populations (4k?) each doing lots of pointless global memory accesses and floating point ops to ensure larger run time so that concurrency is (almost) guaranteed to occur.
Time each function N times, and compare average with and without streams enabled.

nvprof output for the simple test case, with NO_SEATBELTS=OFF showing that concurrency is already being achieved. My test correctly detects this! ~ 70ms with streams, ~ 200ms without

ptheywood · 2020-09-11T17:08:35Z

The offending memset blockign NO_SEATBELTS=OFF is in DeviceExceptionManager::getDevicePtr.

The reset needs moving before outside the loop. Probably need to make sure that device exceptions work with concurrency too.

Robadob · 2020-09-11T17:27:13Z

No reason they shouldn't. Each stream should have its own output buffer.

…

On Fri, 11 Sep 2020, 18:08 Peter Heywood, ***@***.***> wrote: The offending memset blockign NO_SEATBELTS=OFF is in DeviceExceptionManager::getDevicePtr. The reset needs moving before outside the loop. Probably need to make sure that device exceptions work with concurrency too. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#379 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFVGCTEXKB5PCHRKYSHBYTSFJKSFANCNFSM4RH4NSWA> .

ptheywood · 2020-10-05T15:20:23Z

detecting concurrency within the post agent function steps is going to be very difficult, as I can't artificially slow doen these methods (not user defined) so can't get reliable timing info, even if I add extra timers to measure the cost of the post-step phase rather than simulation/layers themselves

I.e. for 512 agents on a 1070, the pbm reconstruction kernel takes ~ 14us, vs ~200us for a message iteration kernel which only contains return true (with seatbelts on). I can't reliably time such a short running kernel, and can't (currently) separate the timing out without (potentailyl) adding runtime costs to the entire library through excessive timing.

src/flamegpu/model/ModelData.cpp

ptheywood · 2021-02-10T18:36:23Z

Now rebased onto current master (d9c3fc5).

Currently failing the 12 concurreny tests, due to additional syncs added during the rebase that need moving.
1 additional test failure which also occurs in master on linux (#438)

ptheywood · 2021-02-18T20:00:23Z

Currently getting 2 test failures on windows with SEATBELTS=OFF:

[  FAILED  ] 2 tests, listed below:
[  FAILED  ] TestCUDASimulationConcurrency.ConditionConcurrencyAllDisabled
[  FAILED  ] TestUtilCUDAEventTimer.CUDAEventTimer

This requires some investigation (but the other new tests passing is promising)

TestCUDASimulationConcurrency.ConditionConcurrencyAllDisabled also fails on my office machine (linux). The kernel itself is very short running so the test is essentially showing that scan/scatters are not being executed concurrently. There's a synchronous memset I'd missed, plus the lack of pinned memory for the async copies prevents concurrency.
Even after addressing those WDDM might still cause issues that require restructuring the code further (lots of short for loops to ensure that potentially concurrent work is in the same command buffer, though cudaStreamQuery can be used in some places to force flushes at the right time)

TestUtilCUDAEventTimer.CUDAEventTimer fails because the elapsed time meausred is less than the 1000ms requested (900ms threshold currently), taking 0.001ms when on wddm it seems.
This appears to be a wddm issue? In which case I'm not sure CudaEventTimers are reliable?
We could use std::chrono::steady_clock instead in some places (where ~20ns of precision is not required) such as for the duration of a simulation (or ensemble which already uses this?)

ptheywood · 2021-02-23T16:51:30Z

The test issue has been resolved under WDDM, by a cudaDeviceSync prior to the thread sleep (for wddm devices).

Long term we should check this further, and consider not using cudaEvent_t timers under windows on wddm devices, instead using a lower precision chrono::steady_clock timer (util class already provided and used for ensemble timing).
This should be precise enough for total simulation timing, but may not be too useful on the per step or per layer per step timing levels.
I'll open a new very low priority issue.

Otherwise this should be good to merge/review and merge.

This involves significant restructuring of CUDASimulation, and use of cudaStream_t in many (but not all) places. A (incomplete) set of tests designed to detect concurrency are provided, through a new CUDASimulation CUDAConfig parameter (inLayerConcurrency) Utility classes for timers are provided (CudaEvent and chrono::steady_clock based), and used to time RTCinit, init functions, step functions, exit functions and calls to simulate and per step timing. These are programatically accessible through methods on the CUDASimulation object. WDDM and cudaEvent based timers can be a little odd, it may ignore some time spent on the host. Further investigation required. There is still room to make further improvements through streams (i.e. random init) and reduce the number of explicit and implicit synchronisation points required. Pinned memory will also help in places. An issue will be created to highlight remaining things to do. Adds some disabled tests to detect concurrency in the code executing after an agent condition or agent function. These cannot be fixed without pinned memory + a lot of refactoring, for minimal perf gain.

ptheywood force-pushed the concurrency branch from 94d0c43 to b9e6b70 Compare October 2, 2020 18:54

ptheywood force-pushed the rename-cuda-agent-model branch from 6f0de04 to 367fc46 Compare October 7, 2020 17:06

ptheywood force-pushed the concurrency branch from 7a3a310 to 6db4171 Compare October 7, 2020 17:44

ptheywood changed the base branch from rename-cuda-agent-model to master October 7, 2020 17:44

ptheywood mentioned this pull request Oct 20, 2020

Feature: Instrumentation / Profiling tools #55

Open

15 tasks

ptheywood force-pushed the concurrency branch 2 times, most recently from d9b52f4 to abf9be7 Compare October 22, 2020 15:12

Robadob reviewed Oct 23, 2020

View reviewed changes

src/flamegpu/model/ModelData.cpp Outdated Show resolved Hide resolved

ptheywood force-pushed the concurrency branch from b3f5665 to 471cf2a Compare February 10, 2021 18:33

ptheywood force-pushed the concurrency branch 5 times, most recently from 35f4e80 to e59bd50 Compare February 17, 2021 17:58

ptheywood mentioned this pull request Feb 17, 2021

Concurrency/Stream Improvements #449

Open

16 tasks

ptheywood force-pushed the concurrency branch from e59bd50 to 8110cd8 Compare February 17, 2021 18:18

ptheywood marked this pull request as ready for review February 17, 2021 18:18

ptheywood force-pushed the concurrency branch from 1d22252 to 27fb78f Compare February 23, 2021 16:48

ptheywood mentioned this pull request Feb 23, 2021

Elapsed Time within Host/step/exitConditions #358

Closed

ptheywood force-pushed the concurrency branch from 27fb78f to 9705e26 Compare February 24, 2021 10:51

mondus merged commit 6a40189 into master Feb 24, 2021

mondus deleted the concurrency branch February 24, 2021 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-Layer Concurrency #379

In-Layer Concurrency #379

ptheywood commented Sep 11, 2020 •

edited

Loading

ptheywood commented Sep 11, 2020

Robadob commented Sep 11, 2020 via email

ptheywood commented Oct 5, 2020

ptheywood commented Feb 10, 2021

ptheywood commented Feb 18, 2021 •

edited

Loading

ptheywood commented Feb 23, 2021

In-Layer Concurrency #379

In-Layer Concurrency #379

Conversation

ptheywood commented Sep 11, 2020 • edited Loading

Issues

Completed bits

Simple test case

ptheywood commented Sep 11, 2020

Robadob commented Sep 11, 2020 via email

ptheywood commented Oct 5, 2020

ptheywood commented Feb 10, 2021

ptheywood commented Feb 18, 2021 • edited Loading

ptheywood commented Feb 23, 2021

ptheywood commented Sep 11, 2020 •

edited

Loading

ptheywood commented Feb 18, 2021 •

edited

Loading