-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-Layer Concurrency #379
In-Layer Concurrency #379
Conversation
The offending The reset needs moving before outside the loop. Probably need to make sure that device exceptions work with concurrency too. |
No reason they shouldn't. Each stream should have its own output buffer.
…On Fri, 11 Sep 2020, 18:08 Peter Heywood, ***@***.***> wrote:
The offending memset blockign NO_SEATBELTS=OFF is in
DeviceExceptionManager::getDevicePtr.
The reset needs moving before outside the loop. Probably need to make sure
that device exceptions work with concurrency too.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#379 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFVGCTEXKB5PCHRKYSHBYTSFJKSFANCNFSM4RH4NSWA>
.
|
detecting concurrency within the post agent function steps is going to be very difficult, as I can't artificially slow doen these methods (not user defined) so can't get reliable timing info, even if I add extra timers to measure the cost of the post-step phase rather than simulation/layers themselves I.e. for 512 agents on a 1070, the pbm reconstruction kernel takes ~ 14us, vs ~200us for a message iteration kernel which only contains return true (with seatbelts on). I can't reliably time such a short running kernel, and can't (currently) separate the timing out without (potentailyl) adding runtime costs to the entire library through excessive timing. |
6f0de04
to
367fc46
Compare
d9b52f4
to
abf9be7
Compare
b3f5665
to
471cf2a
Compare
35f4e80
to
e59bd50
Compare
e59bd50
to
8110cd8
Compare
Currently getting 2 test failures on windows with
This requires some investigation (but the other new tests passing is promising)
|
1d22252
to
27fb78f
Compare
The test issue has been resolved under WDDM, by a Long term we should check this further, and consider not using cudaEvent_t timers under windows on wddm devices, instead using a lower precision Otherwise this should be good to merge/review and merge. |
This involves significant restructuring of CUDASimulation, and use of cudaStream_t in many (but not all) places. A (incomplete) set of tests designed to detect concurrency are provided, through a new CUDASimulation CUDAConfig parameter (inLayerConcurrency) Utility classes for timers are provided (CudaEvent and chrono::steady_clock based), and used to time RTCinit, init functions, step functions, exit functions and calls to simulate and per step timing. These are programatically accessible through methods on the CUDASimulation object. WDDM and cudaEvent based timers can be a little odd, it may ignore some time spent on the host. Further investigation required. There is still room to make further improvements through streams (i.e. random init) and reduce the number of explicit and implicit synchronisation points required. Pinned memory will also help in places. An issue will be created to highlight remaining things to do. Adds some disabled tests to detect concurrency in the code executing after an agent condition or agent function. These cannot be fixed without pinned memory + a lot of refactoring, for minimal perf gain.
27fb78f
to
9705e26
Compare
Implement and detect in-layer concurrency via streams.
This is now ready for review. It is not compelte, but is most of the way there and provides the majority of the intended benefit, although things can be improved.
Need to check that it has worked as intended on Windows (i.e. do the tests pass) in Release mode.
Issues
Closes #189
Outstanding improvements detailed in #449
Completed bits
CUDAConfig::inLayerConcurrency
property to prevent stream-based concurrency-t/--timing
?CUDASimulation
SEATBELTS
on)This has primarily been tested / implemented under Linux.
Ideally the problem sizes should be small enough that individual agent functions will not fill the device, but this requires blocksize information to accurately target. Rather than exposing this, just use very small populations.
Simple test case
4 agent types, 1 function each in the same layer, with small populations (4k?) each doing lots of pointless global memory accesses and floating point ops to ensure larger run time so that concurrency is (almost) guaranteed to occur.
Time each function N times, and compare average with and without streams enabled.
nvprof output for the simple test case, with
NO_SEATBELTS=OFF
showing that concurrency is already being achieved. My test correctly detects this! ~ 70ms with streams, ~ 200ms without