Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency/Stream Improvements #449

Open
16 tasks
ptheywood opened this issue Feb 17, 2021 · 1 comment
Open
16 tasks

Concurrency/Stream Improvements #449

ptheywood opened this issue Feb 17, 2021 · 1 comment

Comments

@ptheywood
Copy link
Member

ptheywood commented Feb 17, 2021

Once #379 is merged, there are still improvements that can be made to the use of streams to improve performance through the use of streams, within a single simulation and when used as part of an ensemble.

Some but not all of the potential improvements / outstanding todos

  • Use non-default streams in more places

    • RandomManager::resizeDeviceArray / RandomManager::resize
    • CUDAScanCompaction::zero
    • mapNewRuntimeVariables
    • the use of various CUDAScatter methods which are currently just passed the default stream (0).
  • Better ways of passing streams around, where the stream belongs to a simulation (or an ensemble?).

  • Memory Pinning

    • Async memcpy block unless the memory is pinned.
    • Cannot pin everything, as pinning too much memory can cause systems to lock up (by preventing the OS from paging anything)
  • _async variants of some methods (i.e. some CUDAScatter methods)

    • Allows these to be used without synchronisation when streams are passed. Less syncs are better (where possible) but this should be opt in (and clear)
    • The non async methods can just call the _async version + add a stream sync, so minimal overhead of maintaining this.
    • Some of these return values copied back, so require the sync. In that case switching to a batch operation to process N reductions concurrently may be required.
  • Expanded Testing

    • Test(s) for each communication strategy
    • Make the tests check for more than just performance
    • More RTC test coverage
    • Performance test(s) within an ensemble
    • Attempt to test the concurrency of pre/post processing (i.e. scatter) although this may be difficult to time accurately
  • More refactoring of stepLayer - it's still a huge method.

    • Possibly use methods in an unnamed namespace to prevent them being called by users.
  • Per layer timing

    • Additional syncing/events might have a negative impact on perf, + potentially high memory requirements (one element per layer per step (per simulation in an ensemble)). May be inaccurate on WDDM devices?
  • Timing within Ensembles (Logging)

    • Timing of individual parts of individual simulations is less important when part of an ensemble, but might still be useful.
    • It should be made accessible through logging (or as part of the ensemble object?)
  • Use a dynamic range of per-stream elements, rather than a hard cap at 128. This was naively used as it is the limit on the number of concurrent streams which can execute, but models could have more than 128 individual kernels launched within a layer, they would just be serialised.

@ptheywood ptheywood mentioned this issue Feb 17, 2021
24 tasks
@ptheywood
Copy link
Member Author

See the cineca_concurrency branch for some steps towards this, focussed on fixing the regression introduce by automatic IDs so far.

@ptheywood ptheywood added this to the v2.0.0-alpha.N milestone Aug 11, 2021
@ptheywood ptheywood removed this from the v2.0.0-alpha.N milestone Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant