Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-Layer Concurrency #379

Merged
merged 1 commit into from
Feb 24, 2021
Merged

In-Layer Concurrency #379

merged 1 commit into from
Feb 24, 2021

Conversation

ptheywood
Copy link
Member

@ptheywood ptheywood commented Sep 11, 2020

Implement and detect in-layer concurrency via streams.

This is now ready for review. It is not compelte, but is most of the way there and provides the majority of the intended benefit, although things can be improved.

Need to check that it has worked as intended on Windows (i.e. do the tests pass) in Release mode.

Issues

Closes #189

Outstanding improvements detailed in #449

Completed bits

  • add CUDAConfig::inLayerConcurrency property to prevent stream-based concurrency
    • required for testing, and potentially useful for debugging.
  • Programatic Step Timing, init, exit and RTC timing.
    • Plus output via -t/--timing?
    • tests
  • Simulate() NVTX Range
  • NVTX Improvements CUDASimulation
  • Refactor Step() into smaller chunks.
    • Make Init functions user-callable
    • Make Exit functions user-callable
  • Better CUDASteam management/re-use
  • Concurrent post-processing via streams (PBM construction, Scattering, CUB/Thrust calls...)
  • Add tests which detect concurrency, through the average runtime of steps:
    • Naive agent function, different agent types
    • Message Output to independant lists
    • Message Input from independant lists
    • Message Input from the same message list
    • Optional Message Output
    • Agent Birth (independant agent types)
    • Agent Death
    • Device Exception check (only runs with SEATBELTS on)
  • Add RTC tests which detect concurrency, through the average runtime of steps:
    • Naive agent function, different agent types
    • Others (see issue)
  • Check synchronisation is still performed as appropriate
  • Disable performance related tests in debug mode.

This has primarily been tested / implemented under Linux.

  • Validate on Linux
  • Validate on Windows (WDDM), may be discrepancy.

Ideally the problem sizes should be small enough that individual agent functions will not fill the device, but this requires blocksize information to accurately target. Rather than exposing this, just use very small populations.

Simple test case

4 agent types, 1 function each in the same layer, with small populations (4k?) each doing lots of pointless global memory accesses and floating point ops to ensure larger run time so that concurrency is (almost) guaranteed to occur.
Time each function N times, and compare average with and without streams enabled.

nvprof output for the simple test case, with NO_SEATBELTS=OFF showing that concurrency is already being achieved. My test correctly detects this! ~ 70ms with streams, ~ 200ms without
image

@ptheywood
Copy link
Member Author

The offending memset blockign NO_SEATBELTS=OFF is in DeviceExceptionManager::getDevicePtr.

The reset needs moving before outside the loop. Probably need to make sure that device exceptions work with concurrency too.

@Robadob
Copy link
Member

Robadob commented Sep 11, 2020 via email

@ptheywood
Copy link
Member Author

detecting concurrency within the post agent function steps is going to be very difficult, as I can't artificially slow doen these methods (not user defined) so can't get reliable timing info, even if I add extra timers to measure the cost of the post-step phase rather than simulation/layers themselves

I.e. for 512 agents on a 1070, the pbm reconstruction kernel takes ~ 14us, vs ~200us for a message iteration kernel which only contains return true (with seatbelts on). I can't reliably time such a short running kernel, and can't (currently) separate the timing out without (potentailyl) adding runtime costs to the entire library through excessive timing.

@ptheywood ptheywood changed the base branch from rename-cuda-agent-model to master October 7, 2020 17:44
@ptheywood ptheywood force-pushed the concurrency branch 2 times, most recently from d9b52f4 to abf9be7 Compare October 22, 2020 15:12
@ptheywood
Copy link
Member Author

Now rebased onto current master (d9c3fc5).

Currently failing the 12 concurreny tests, due to additional syncs added during the rebase that need moving.
1 additional test failure which also occurs in master on linux (#438)

@ptheywood ptheywood force-pushed the concurrency branch 5 times, most recently from 35f4e80 to e59bd50 Compare February 17, 2021 17:58
@ptheywood ptheywood mentioned this pull request Feb 17, 2021
16 tasks
@ptheywood ptheywood marked this pull request as ready for review February 17, 2021 18:18
@ptheywood
Copy link
Member Author

ptheywood commented Feb 18, 2021

Currently getting 2 test failures on windows with SEATBELTS=OFF:

[  FAILED  ] 2 tests, listed below:
[  FAILED  ] TestCUDASimulationConcurrency.ConditionConcurrencyAllDisabled
[  FAILED  ] TestUtilCUDAEventTimer.CUDAEventTimer

This requires some investigation (but the other new tests passing is promising)

TestCUDASimulationConcurrency.ConditionConcurrencyAllDisabled also fails on my office machine (linux). The kernel itself is very short running so the test is essentially showing that scan/scatters are not being executed concurrently. There's a synchronous memset I'd missed, plus the lack of pinned memory for the async copies prevents concurrency.
Even after addressing those WDDM might still cause issues that require restructuring the code further (lots of short for loops to ensure that potentially concurrent work is in the same command buffer, though cudaStreamQuery can be used in some places to force flushes at the right time)

TestUtilCUDAEventTimer.CUDAEventTimer fails because the elapsed time meausred is less than the 1000ms requested (900ms threshold currently), taking 0.001ms when on wddm it seems.
This appears to be a wddm issue? In which case I'm not sure CudaEventTimers are reliable?
We could use std::chrono::steady_clock instead in some places (where ~20ns of precision is not required) such as for the duration of a simulation (or ensemble which already uses this?)

@ptheywood
Copy link
Member Author

The test issue has been resolved under WDDM, by a cudaDeviceSync prior to the thread sleep (for wddm devices).

Long term we should check this further, and consider not using cudaEvent_t timers under windows on wddm devices, instead using a lower precision chrono::steady_clock timer (util class already provided and used for ensemble timing).
This should be precise enough for total simulation timing, but may not be too useful on the per step or per layer per step timing levels.
I'll open a new very low priority issue.

Otherwise this should be good to merge/review and merge.

This involves significant restructuring of CUDASimulation, and use of cudaStream_t in many (but not all) places.

A (incomplete) set of tests designed to detect concurrency are provided, through a new CUDASimulation CUDAConfig parameter (inLayerConcurrency)

Utility classes for timers are provided (CudaEvent and chrono::steady_clock based), and used to time RTCinit, init functions, step functions, exit functions and calls to simulate and per step timing.
These are programatically accessible through methods on the CUDASimulation object.
WDDM and cudaEvent based timers can be a little odd, it may ignore some time spent on the host. Further investigation required.

There is still room to make further improvements through streams (i.e. random init) and reduce the number of explicit and implicit synchronisation points required.
Pinned memory will also help in places. An issue will be created to highlight remaining things to do.

Adds some disabled tests to detect concurrency in the code executing after an agent condition or agent function. These cannot be fixed without pinned memory + a lot of refactoring, for minimal perf gain.
@mondus mondus merged commit 6a40189 into master Feb 24, 2021
@mondus mondus deleted the concurrency branch February 24, 2021 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concurrency within Layers
3 participants