Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a scaling issue in MagneticField/VolumeBasedEngine #28180

Closed
wants to merge 2 commits into from

Conversation

amadio
Copy link

@amadio amadio commented Oct 15, 2019

PR description:

In commit 0f77253, the MagGeometry was made thread-safe by using an std::atomic<> for caching the last volume used. This means that all threads, despite working on unrelated events, share a common cached volume, which is not suitable most of the time. The atomic also creates problems for concurrency, since whenever a thread caches a volume, the CPU cacheline is invalidated, forcing all other threads to fetch the value of the cached volume from memory every time, consequently suffering from higher latency. Since that volume is not suitable, it's likely that each thread will then replace it with a new value, which is not good for the other threads, and so on, significantly slowing down the whole simulation, since the magnetic field is called many times for each track to integrate its trajectory.

This problem has been profiled and benchmarked in Intel's VTune Amplifier against cmssw branch CMSSW_11_0_GEANT4_X_2019-10-10-2300. The results are shown below with labels before and after the changes from this pull request. In all cases, hyper-threading was disabled, and the threads were pinned using taskset with an appropriate number of CPUs. The machine where all runs were performed is a dual-socket Intel Xeon E5-2698 v3 with 64GB of RAM. We simulate minimum bias events at Ecm = 13TeV, and vary the total number of events and number of threads in each analysis. The cmsDriver command used to create the input file is shown below:

cmsDriver.py MinBias_13TeV_cfi --era Run2_2018 -s GEN,SIM --pileup=NoPileUp
--conditions auto:phase1_2018_realistic --geometry DB:Extended
--eventcontent=RAWSIM --datatier GEN-SIM --dirout=./ -n $nevents --nThreads $nthreads
--fileout file:scaling.root --mc --no_exec cmsRun MinBias_13TeV_cfi_GEN_SIM.py

PR validation:

There is no functional change introduced by this pull request. It is purely a performance optimization. The only thing that is different is that some barrel properties are computed only once at initialization rather than at every call to the magnetic field, as that seemed like a sensible thing to do as extra optimization.

Here is a scaling plot showing scaling before and after the changes:

A throughput table before and after the changes is shown below. Throughput is calculated as (1024 events) / Δt, where Δt = (t₂ - t₁), t₁ is the start time for processing the first event, and t₂ is the end of the job (i.e. estimate of total time minus initialization time).

Threads Throughput (Events/s) [before] Throughput (Events/s) [after] Percent Change
1 0.290 0.289 -0.34
2 0.609 0.594 -2.46
4 1.166 1.206 +3.43
8 1.903 1.950 +2.47
12 2.767 2.868 +3.65
16 3.368 3.390 +0.65
24 4.675 5.120 +9.52
32 4.471 6.059 +35.52

Comparison of hotspots analysis at 32 threads in Intel VTune Amplifier (before - after):

Comparison of micro-architecture analysis at 32 threads in Intel VTune Amplifier (before - after):
screenshot

Bottom-up micro-architecture comparison of before and after for 32 threads in Intel VTune Amplifier:
screenshot

Best regards,
—Guilherme

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@amadio
Copy link
Author

amadio commented Oct 15, 2019

@civanch Please review. Thank you.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28180/12257

  • This PR adds an extra 20KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@civanch
Copy link
Contributor

civanch commented Oct 15, 2019

@amadio , this is a very good fix. You need to fix code-format issues and may be remove "cout", which is left in the code.

@amadio
Copy link
Author

amadio commented Oct 15, 2019

@civanch, Thank you. I wondered if it was ok to remove, since it can be a performance issue, and it's easy to add back if needed, but decided to make only the minimal changes. I will fix formatting and make other improvements like removing the print out calls.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28180/12278

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @amadio (Guilherme Amadio) for master.

It involves the following packages:

MagneticField/GeomBuilder
MagneticField/VolumeBasedEngine

@perrotta, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks.
@namapane this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@perrotta
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 16, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/2984/console Started: 2019/10/16 14:59

@amadio
Copy link
Author

amadio commented Oct 16, 2019

@perrotta I have a question. Is the field in SimG4Core/MagneticField/interface/Field.h a global or is a field instance owned by each thread? The oldx and oldb cached values should probably not be shared between threads either.

@cmsbuild
Copy link
Contributor

@namapane
Copy link
Contributor

namapane commented Apr 24, 2020

So as anticipated I made #29556 with minor optimizations from this one (revisited) + couts cleanup discussed above.
I think it would be interesting to run some profiling of SIM with @VinInn 's proposal while the discussion evolves. I can implement it in my cmssw repo on top of the above PR if anybody from SIM is willing to give it a try.

@VinInn
Copy link
Contributor

VinInn commented Apr 24, 2020

maybe we should move the discussion to an "issue"

@civanch
Copy link
Contributor

civanch commented Apr 24, 2020

@namapane , whatever try you will propose as a PR we will try to test your PR on CPU efficiency promptly.

@VinInn , I am trying convince my Geant4 collègues to remove maximally thread_local varriables but in fact this clean-up is going slowly. May be some quantitative analysis can help.

@namapane
Copy link
Contributor

namapane commented Apr 25, 2020

@civanch , I pushed a branch in my clone:
https://github.com/namapane/cmssw/tree/MF_TLScache
Not sure it is appropriate to make a PR for it, given the ongoing discussion.
But if that makes it technically easier for you to pick it, and if there is no objection, I can do.

Some technicalities:

  • It is made on top of Magnetic Field optimization #29556 (ie includes it)
  • Passes the MF regression-tests
  • I added a test that creates several different IOVs in the same job and verified that instance counters behave in the correct way, at least when IOV changes happen sequentially in a single thread.

I don't know how to test IOV transition across multiple threads. I tried to set several events per run (ie repeating the regression test once for each event) so that different events in the same run and IOV can be processed in parallel. I think this is not happening, at least according to the output:

Begin processing the 1st record. Run 300001, Event 1, LumiSection 1 on stream 0 at 25-Apr-2020 11:40:17.085 CEST
Begin processing the 2nd record. Run 300001, Event 2, LumiSection 1 on stream 0 at 25-Apr-2020 11:40:24.426 CEST
Begin processing the 3rd record. Run 300001, Event 3, LumiSection 1 on stream 0 at 25-Apr-2020 11:40:25.546 CEST
Begin processing the 4th record. Run 300001, Event 4, LumiSection 1 on stream 0 at 25-Apr-2020 11:40:26.676 CEST
Begin processing the 5th record. Run 300001, Event 5, LumiSection 1 on stream 0 at 25-Apr-2020 11:40:27.793 CEST
Begin processing the 6th record. Run 300002, Event 1, LumiSection 2 on stream 0 at 25-Apr-2020 11:40:28.927 CEST
Begin processing the 7th record. Run 300002, Event 2, LumiSection 2 on stream 0 at 25-Apr-2020 11:40:35.809 CEST
Begin processing the 8th record. Run 300002, Event 3, LumiSection 2 on stream 0 at 25-Apr-2020 11:40:36.995 CEST
Begin processing the 9th record. Run 300002, Event 4, LumiSection 2 on stream 0 at 25-Apr-2020 11:40:38.185 CEST
...

I interpret "on stream 0" to mean they are all in the same thread (right?)

Do I need to/can I instruct the FW to run different events in parallel?

@civanch
Copy link
Contributor

civanch commented Apr 25, 2020

@namapane , these days for me would be easier if this would be PR. Can I simply test #29556 ?

@namapane
Copy link
Contributor

No, #29556 does not contain the TLS cache, only some minor improvements.
I will make a PR so that you can test it.

@slava77
Copy link
Contributor

slava77 commented Apr 25, 2020

Strictly speaking in cmssw we need just ONE global thread_local (a small integer) that identifies in which thread one is.
This thread_local variable can be located in the main program and accessed through a simple function.
it can be set by the framework before a module is called or just autonomously in the function itself (if (0==threadId) threadId = ++atomicThreadCounter;)
Then every body can build is preferred container indexed using such an identifier. (if the FW provides a global function telling how many threads there are even better)

would std::this_thread::get_id() work?
If this doesn't cycle up to infinity due to task start/stop cycles,
then even a map (or a vector?) of caches will work.

@namapane
Copy link
Contributor

@civanch, done: it's #29561.

@makortel
Copy link
Contributor

@namapane

I interpret "on stream 0" to mean they are all in the same thread (right?)

Strictly speaking no. It means that there is one concurrent event in flight (or one concurrent "EDM stream" processing events). The default is to use one thread (which is likely what you observe), although technically multiple threads can be used also with a single stream.

Do I need to/can I instruct the FW to run different events in parallel?

The number of threads to use can be set at command line with cmsRun -n <NT> config.py. Then the framework will use the same number of streams (concurrent events) as the number of threads. The two (threads and streams) can be also controlled separately in the configuration with

process.options.numberOfThreads = <NT>
process.options.numberOfStreams = <NS> # default is 0, which means to use the number of threads

If you want to try out concurrent lumis, that can be enabled with

process.options.numberOfConcurrentLuminosityBlocks = 2

@makortel
Copy link
Contributor

@Dr15Jones did a little test of crafting a virtual interface and an implementation that does a thread_local caching in a shared library, and compared that to a local caching (code here). The test confirms the observation mentioned e.g. in this blog post, that is in conjunction with shared libraries accessing thread_local is slower than accessing a regular variable. In Chris' test the thread_local access was 1.5-2x slower than accessing a "local cache" (in case all accesses are cache hits).

Design-wise a local cache suits better with expressing parallelism in terms of tasks than a per-thread cache (implemented in whatever way). For example, if a job uses less streams than threads and has algorithm(s) using e.g. tbb::parallel_for, some threads are necessarily running tasks from different events. With a per-thread cache, it can be that a thread processes a task from event 1 (starts likely with a cache miss), then goes to process a task from event 2 (starts likely with a cache miss), then goes to process a task from event 3 (starts likely with a cache miss) etc. With a local cache, the cache can be copied for each task from the code creating the task, so tasks using the same volume as the caller start their work with a cache hit.

Our memory needs typically grow on the number of streams (concurrent events) instead of number of threads. Even if today we use the same number of threads and streams, it may be that in the future we need ot (or can) use less streams than threads. In such a case per-thread objects would consume more memory (or resources in general) than per-stream objects.

In general our experience so far is that getting per-thread work/storage to work properly with tasks is tricky (Geant4 integration being an extreme example). Also, local work/storage is easier to reason about, and therefore likely easier to maintain in the long term, than per-thread work/storage.

From the discussion in #29561 (comment) it seems that we are looking for a general pattern rather than a one-off workaround (#28180 (comment)), so core advices against using a per-thread cache.

@slava77
Copy link
Contributor

slava77 commented Apr 27, 2020

From the discussion in #29561 (comment) it seems that we are looking for a general pattern rather than a one-off workaround (#28180 (comment)), so core advices against using a per-thread cache.

do I read this correctly as we are back to #28284,
namely, no to thread_local or other variants like array indexed by "int"(this_thread::get_id())?

@makortel
Copy link
Contributor

From the discussion in #29561 (comment) it seems that we are looking for a general pattern rather than a one-off workaround (#28180 (comment)), so core advices against using a per-thread cache.

do I read this correctly as we are back to #28284,
namely, no to thread_local or other variants like array indexed by "int"(this_thread::get_id())?

Yes, the "local cache" is the solution preferred by core as the most future proof solution (in terms of both performance and maintenance).

@namapane
Copy link
Contributor

namapane commented Apr 27, 2020

@makortel ,

  1. The overhead of TLS is obviously there; it has has been measured by @amadio to be 0.34% in a REAL WORLD test of SIM (first row of the table). Thus, negligible.
    In a callgrind test of non-cached MF calls, I see an overhead of about 2.3% (MF calls only). This is the worst-case scenario - a job doing MF calls only, and looks quite negligible as well.

  2. Your argument of the cache miss at start of the event is, frankly, nonsense.
    The start of the event is likely to be in the tracker region, where we use a parametrization, for which we don't even look at the cache. For whatever happens next, or for whatever other scenario that does not start from the tracker, it is extremely unlikely that two events will make their first MF call that is not in the tracker region into the same volume, be it in SIM, RECO or HLT. There's simply no realistic scenario where this can happen.
    Besides, the real advantage of the cache is that it saves volume search dozens, or hundreds of times when stepping in a meters-long volume. Saving one more or one less search is pretty much irrelevant.
    On the other hand, as I already pointed out, fragmenting the cache into hundreds of contexts may well result in more cache misses if some disconnected contexts work e.g. sequentially on the same objects.

  3. Regarding memory growing with the number of streams, are we speaking of an issue of 16 bytes per stream?? Or did I miss something?

  4. Finally, I disagree on the fact that a very invasive design that has to be implemented in 286 contexts in 130 packages is easier to maintain. It will make a simple thing as calling the MF a very complicated matter for the average user.

@namapane
Copy link
Contributor

namapane commented Apr 27, 2020

To elaborate: the advantage of the cache is essentially within a single "task" when that task has to call the MF a very large number of times in a limited region. I can't imagine any scenario where the first call of a task is not a cache miss; not only across events, but even for tasks repeaded within the same event (e.g. tracking different particles in parallel).

@makortel
Copy link
Contributor

@namapane
To repeat from #28180 (comment): for a one-shot use in the MagneticField the thread_local can be tolerated, but for a general pattern we advice against it.

@namapane
Copy link
Contributor

Understood; but Slava's comment #28180 (comment) was not about general patterns, on which we all agree; it was on this specific use case. In this context your answer reads as a suggestion of rejecting #29561.

@makortel
Copy link
Contributor

Sorry that I was unclear about the context. I've understood the discussion here such that a one-shot solution would be on the table. But then I got confused in #29561 (comment) by the discussion about "precedent" and what exactly that would imply. Aiming to not spread the discussion too much I thought this PR (which is very close of really being an issue now) would be better for comments concerning general use beyond the #29561 .

@VinInn
Copy link
Contributor

VinInn commented Apr 28, 2020

Fine all this.

Can the Core team make a concrete proposal for a code pattern that avoid percolation?
I think it is responsibility of the core team to provide such a solution w/o dumping the burden to the end users.

@makortel
Copy link
Contributor

I'm assuming the MagneticField is already percolated everywhere it is needed. The "local cache" proposal would then only change

// from
void function(const MagneticField& mf);
// to
void function(const local::MagneticField& mf);

(although for intra-module concurrent constructs like tbb::parallel_for a copy of the local::MagneticField would need to be passed as it itself should not be accessed concurrently).

The code getting the MagneticField from the EventSetup would change

// from
const MagneticField& mf = eventSetup.getData(magneticFieldToken_);
// to
local::MagneticField mf(&eventSetup.getData(magneticFieldToken_));

Core team can help in the migration.

@VinInn
Copy link
Contributor

VinInn commented Apr 28, 2020

I'm assuming the MagneticField is already percolated everywhere it is needed.
and it was a pain: at the limit of non scaling. (and there are still places where it is somehow locally cached).

@makortel
Copy link
Contributor

(and there are still places where it is somehow locally cached).

But even there the local::MagneticField should work. The interface supports updating the cache so a helper keeping the MagneticField const* as a member can be updated in the same manner

struct Helper {
  void update(EventSetup const& eventSetup) {
    mf = &eventSetup.getData(mfToken);
  }

  MagneticField const *mf;
};

// to
struct Helper {
  void update(EventSetup const& eventSetup) {
    mf.reset(&eventSetup.getData(mfToken));
  }

  local::MagneticField mf;
};
// and "->" to "." for mf access

@slava77
Copy link
Contributor

slava77 commented Apr 28, 2020

I'm assuming the MagneticField is already percolated everywhere it is needed. The "local cache" proposal would then only change

please clarify if this proposal requires that every user of MF to change to it, or is it possible to be done gradually, while both interfaces are supported.
From the discussion last fall (at least) I was still thinking that both local and global caching can coexist.

@makortel
Copy link
Contributor

I'm assuming the MagneticField is already percolated everywhere it is needed. The "local cache" proposal would then only change

please clarify if this proposal requires that every user of MF to change to it, or is it possible to be done gradually, while both interfaces are supported.
From the discussion last fall (at least) I was still thinking that both local and global caching can coexist.

Correct, the "local cache" and the current "shared std::atomic cache" can coexist and a migration can be done gradually.

@smuzaffar
Copy link
Contributor

smuzaffar commented Apr 28, 2020 via email

@namapane
Copy link
Contributor

The code getting the MagneticField from the EventSetup would change

// from
const MagneticField& mf = eventSetup.getData(magneticFieldToken_);
// to
local::MagneticField mf(&eventSetup.getData(magneticFieldToken_));

Doesn't the caller have to own the cache? At least, in the original proposal (#28180 (comment))
the local::MagneticField(::MagneticField const* field) seems to be cache-less.

@makortel
Copy link
Contributor

The code getting the MagneticField from the EventSetup would change

// from
const MagneticField& mf = eventSetup.getData(magneticFieldToken_);
// to
local::MagneticField mf(&eventSetup.getData(magneticFieldToken_));

Doesn't the caller have to own the cache? At least, in the original proposal (#28180 (comment))
the local::MagneticField(::MagneticField const* field) seems to be cache-less.

Yes, the caller (or local::MagneticField to be precise) owns the cache (MagneticFieldCache). In the example above it is alive until the scope ends. It can also be made a member of a helper class (or even EDModule, but I'd expect that to be unnecessary, in most cases at least).

@makortel
Copy link
Contributor

makortel commented May 4, 2020

I implemented the "local cache" approach in #29631.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet