Fix a scaling issue in MagneticField/VolumeBasedEngine #28180

amadio · 2019-10-15T15:54:34Z

PR description:

In commit 0f77253, the MagGeometry was made thread-safe by using an std::atomic<> for caching the last volume used. This means that all threads, despite working on unrelated events, share a common cached volume, which is not suitable most of the time. The atomic also creates problems for concurrency, since whenever a thread caches a volume, the CPU cacheline is invalidated, forcing all other threads to fetch the value of the cached volume from memory every time, consequently suffering from higher latency. Since that volume is not suitable, it's likely that each thread will then replace it with a new value, which is not good for the other threads, and so on, significantly slowing down the whole simulation, since the magnetic field is called many times for each track to integrate its trajectory.

This problem has been profiled and benchmarked in Intel's VTune Amplifier against cmssw branch CMSSW_11_0_GEANT4_X_2019-10-10-2300. The results are shown below with labels before and after the changes from this pull request. In all cases, hyper-threading was disabled, and the threads were pinned using taskset with an appropriate number of CPUs. The machine where all runs were performed is a dual-socket Intel Xeon E5-2698 v3 with 64GB of RAM. We simulate minimum bias events at E_cm = 13TeV, and vary the total number of events and number of threads in each analysis. The cmsDriver command used to create the input file is shown below:

cmsDriver.py MinBias_13TeV_cfi --era Run2_2018 -s GEN,SIM --pileup=NoPileUp
--conditions auto:phase1_2018_realistic --geometry DB:Extended
--eventcontent=RAWSIM --datatier GEN-SIM --dirout=./ -n $nevents --nThreads $nthreads
--fileout file:scaling.root --mc --no_exec cmsRun MinBias_13TeV_cfi_GEN_SIM.py

PR validation:

There is no functional change introduced by this pull request. It is purely a performance optimization. The only thing that is different is that some barrel properties are computed only once at initialization rather than at every call to the magnetic field, as that seemed like a sensible thing to do as extra optimization.

Here is a scaling plot showing scaling before and after the changes:

A throughput table before and after the changes is shown below. Throughput is calculated as (1024 events) / Δt, where Δt = (t₂ - t₁), t₁ is the start time for processing the first event, and t₂ is the end of the job (i.e. estimate of total time minus initialization time).

Threads	Throughput (Events/s) [before]	Throughput (Events/s) [after]	Percent Change
1	0.290	0.289	-0.34
2	0.609	0.594	-2.46
4	1.166	1.206	+3.43
8	1.903	1.950	+2.47
12	2.767	2.868	+3.65
16	3.368	3.390	+0.65
24	4.675	5.120	+9.52
32	4.471	6.059	+35.52

Comparison of hotspots analysis at 32 threads in Intel VTune Amplifier (before - after):

Comparison of micro-architecture analysis at 32 threads in Intel VTune Amplifier (before - after):

Bottom-up micro-architecture comparison of before and after for 32 threads in Intel VTune Amplifier:

Best regards,
—Guilherme

cmsbuild · 2019-10-15T15:55:06Z

The code-checks are being triggered in jenkins.

amadio · 2019-10-15T15:55:15Z

@civanch Please review. Thank you.

cmsbuild · 2019-10-15T16:03:43Z

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28180/12257

This PR adds an extra 20KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

code-format:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28180/12257/code-format.patch
e.g. curl https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28180/12257/code-format.patch | patch -p1
You can also run scram build code-format to apply code format directly

civanch · 2019-10-15T16:24:01Z

@amadio , this is a very good fix. You need to fix code-format issues and may be remove "cout", which is left in the code.

amadio · 2019-10-15T19:50:36Z

@civanch, Thank you. I wondered if it was ok to remove, since it can be a performance issue, and it's easy to add back if needed, but decided to make only the minimal changes. I will fix formatting and make other improvements like removing the print out calls.

cmsbuild · 2019-10-16T11:31:27Z

The code-checks are being triggered in jenkins.

cmsbuild · 2019-10-16T11:37:58Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28180/12278

This PR adds an extra 20KB to repository

cmsbuild · 2019-10-16T11:38:24Z

A new Pull Request was created by @amadio (Guilherme Amadio) for master.

It involves the following packages:

MagneticField/GeomBuilder
MagneticField/VolumeBasedEngine

@perrotta, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks.
@namapane this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

perrotta · 2019-10-16T12:58:20Z

please test

cmsbuild · 2019-10-16T12:58:51Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/2984/console Started: 2019/10/16 14:59

amadio · 2019-10-16T14:02:09Z

@perrotta I have a question. Is the field in SimG4Core/MagneticField/interface/Field.h a global or is a field instance owned by each thread? The oldx and oldb cached values should probably not be shared between threads either.

cmsbuild · 2019-10-16T14:52:32Z

+1
Tested at: 510a62d
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7ee75b/2984/summary.html

namapane · 2020-04-24T11:41:52Z

So as anticipated I made #29556 with minor optimizations from this one (revisited) + couts cleanup discussed above.
I think it would be interesting to run some profiling of SIM with @VinInn 's proposal while the discussion evolves. I can implement it in my cmssw repo on top of the above PR if anybody from SIM is willing to give it a try.

VinInn · 2020-04-24T13:51:07Z

maybe we should move the discussion to an "issue"

civanch · 2020-04-24T20:03:32Z

@namapane , whatever try you will propose as a PR we will try to test your PR on CPU efficiency promptly.

@VinInn , I am trying convince my Geant4 collègues to remove maximally thread_local varriables but in fact this clean-up is going slowly. May be some quantitative analysis can help.

namapane · 2020-04-25T09:49:28Z

@civanch , I pushed a branch in my clone:
https://github.com/namapane/cmssw/tree/MF_TLScache
Not sure it is appropriate to make a PR for it, given the ongoing discussion.
But if that makes it technically easier for you to pick it, and if there is no objection, I can do.

Some technicalities:

It is made on top of Magnetic Field optimization #29556 (ie includes it)
Passes the MF regression-tests
I added a test that creates several different IOVs in the same job and verified that instance counters behave in the correct way, at least when IOV changes happen sequentially in a single thread.

I don't know how to test IOV transition across multiple threads. I tried to set several events per run (ie repeating the regression test once for each event) so that different events in the same run and IOV can be processed in parallel. I think this is not happening, at least according to the output:

Begin processing the 1st record. Run 300001, Event 1, LumiSection 1 on stream 0 at 25-Apr-2020 11:40:17.085 CEST
Begin processing the 2nd record. Run 300001, Event 2, LumiSection 1 on stream 0 at 25-Apr-2020 11:40:24.426 CEST
Begin processing the 3rd record. Run 300001, Event 3, LumiSection 1 on stream 0 at 25-Apr-2020 11:40:25.546 CEST
Begin processing the 4th record. Run 300001, Event 4, LumiSection 1 on stream 0 at 25-Apr-2020 11:40:26.676 CEST
Begin processing the 5th record. Run 300001, Event 5, LumiSection 1 on stream 0 at 25-Apr-2020 11:40:27.793 CEST
Begin processing the 6th record. Run 300002, Event 1, LumiSection 2 on stream 0 at 25-Apr-2020 11:40:28.927 CEST
Begin processing the 7th record. Run 300002, Event 2, LumiSection 2 on stream 0 at 25-Apr-2020 11:40:35.809 CEST
Begin processing the 8th record. Run 300002, Event 3, LumiSection 2 on stream 0 at 25-Apr-2020 11:40:36.995 CEST
Begin processing the 9th record. Run 300002, Event 4, LumiSection 2 on stream 0 at 25-Apr-2020 11:40:38.185 CEST
...

I interpret "on stream 0" to mean they are all in the same thread (right?)

Do I need to/can I instruct the FW to run different events in parallel?

civanch · 2020-04-25T12:01:49Z

@namapane , these days for me would be easier if this would be PR. Can I simply test #29556 ?

namapane · 2020-04-25T12:13:44Z

No, #29556 does not contain the TLS cache, only some minor improvements.
I will make a PR so that you can test it.

slava77 · 2020-04-25T14:31:30Z

Strictly speaking in cmssw we need just ONE global thread_local (a small integer) that identifies in which thread one is.
This thread_local variable can be located in the main program and accessed through a simple function.
it can be set by the framework before a module is called or just autonomously in the function itself (if (0==threadId) threadId = ++atomicThreadCounter;)
Then every body can build is preferred container indexed using such an identifier. (if the FW provides a global function telling how many threads there are even better)

would std::this_thread::get_id() work?
If this doesn't cycle up to infinity due to task start/stop cycles,
then even a map (or a vector?) of caches will work.

namapane · 2020-04-25T18:14:20Z

@civanch, done: it's #29561.

makortel · 2020-04-25T18:33:29Z

@namapane

I interpret "on stream 0" to mean they are all in the same thread (right?)

Strictly speaking no. It means that there is one concurrent event in flight (or one concurrent "EDM stream" processing events). The default is to use one thread (which is likely what you observe), although technically multiple threads can be used also with a single stream.

Do I need to/can I instruct the FW to run different events in parallel?

The number of threads to use can be set at command line with cmsRun -n <NT> config.py. Then the framework will use the same number of streams (concurrent events) as the number of threads. The two (threads and streams) can be also controlled separately in the configuration with

process.options.numberOfThreads = <NT>
process.options.numberOfStreams = <NS> # default is 0, which means to use the number of threads

If you want to try out concurrent lumis, that can be enabled with

process.options.numberOfConcurrentLuminosityBlocks = 2

makortel · 2020-04-27T19:04:45Z

@Dr15Jones did a little test of crafting a virtual interface and an implementation that does a thread_local caching in a shared library, and compared that to a local caching (code here). The test confirms the observation mentioned e.g. in this blog post, that is in conjunction with shared libraries accessing thread_local is slower than accessing a regular variable. In Chris' test the thread_local access was 1.5-2x slower than accessing a "local cache" (in case all accesses are cache hits).

Design-wise a local cache suits better with expressing parallelism in terms of tasks than a per-thread cache (implemented in whatever way). For example, if a job uses less streams than threads and has algorithm(s) using e.g. tbb::parallel_for, some threads are necessarily running tasks from different events. With a per-thread cache, it can be that a thread processes a task from event 1 (starts likely with a cache miss), then goes to process a task from event 2 (starts likely with a cache miss), then goes to process a task from event 3 (starts likely with a cache miss) etc. With a local cache, the cache can be copied for each task from the code creating the task, so tasks using the same volume as the caller start their work with a cache hit.

Our memory needs typically grow on the number of streams (concurrent events) instead of number of threads. Even if today we use the same number of threads and streams, it may be that in the future we need ot (or can) use less streams than threads. In such a case per-thread objects would consume more memory (or resources in general) than per-stream objects.

In general our experience so far is that getting per-thread work/storage to work properly with tasks is tricky (Geant4 integration being an extreme example). Also, local work/storage is easier to reason about, and therefore likely easier to maintain in the long term, than per-thread work/storage.

From the discussion in #29561 (comment) it seems that we are looking for a general pattern rather than a one-off workaround (#28180 (comment)), so core advices against using a per-thread cache.

slava77 · 2020-04-27T19:28:03Z

From the discussion in #29561 (comment) it seems that we are looking for a general pattern rather than a one-off workaround (#28180 (comment)), so core advices against using a per-thread cache.

do I read this correctly as we are back to #28284,
namely, no to thread_local or other variants like array indexed by "int"(this_thread::get_id())?

makortel · 2020-04-27T19:34:51Z

From the discussion in #29561 (comment) it seems that we are looking for a general pattern rather than a one-off workaround (#28180 (comment)), so core advices against using a per-thread cache.

do I read this correctly as we are back to #28284,
namely, no to thread_local or other variants like array indexed by "int"(this_thread::get_id())?

Yes, the "local cache" is the solution preferred by core as the most future proof solution (in terms of both performance and maintenance).

namapane · 2020-04-27T20:24:36Z

@makortel ,

The overhead of TLS is obviously there; it has has been measured by @amadio to be 0.34% in a REAL WORLD test of SIM (first row of the table). Thus, negligible.
In a callgrind test of non-cached MF calls, I see an overhead of about 2.3% (MF calls only). This is the worst-case scenario - a job doing MF calls only, and looks quite negligible as well.
Your argument of the cache miss at start of the event is, frankly, nonsense.
The start of the event is likely to be in the tracker region, where we use a parametrization, for which we don't even look at the cache. For whatever happens next, or for whatever other scenario that does not start from the tracker, it is extremely unlikely that two events will make their first MF call that is not in the tracker region into the same volume, be it in SIM, RECO or HLT. There's simply no realistic scenario where this can happen.
Besides, the real advantage of the cache is that it saves volume search dozens, or hundreds of times when stepping in a meters-long volume. Saving one more or one less search is pretty much irrelevant.
On the other hand, as I already pointed out, fragmenting the cache into hundreds of contexts may well result in more cache misses if some disconnected contexts work e.g. sequentially on the same objects.
Regarding memory growing with the number of streams, are we speaking of an issue of 16 bytes per stream?? Or did I miss something?
Finally, I disagree on the fact that a very invasive design that has to be implemented in 286 contexts in 130 packages is easier to maintain. It will make a simple thing as calling the MF a very complicated matter for the average user.

namapane · 2020-04-27T20:46:45Z

To elaborate: the advantage of the cache is essentially within a single "task" when that task has to call the MF a very large number of times in a limited region. I can't imagine any scenario where the first call of a task is not a cache miss; not only across events, but even for tasks repeaded within the same event (e.g. tracking different particles in parallel).

makortel · 2020-04-27T21:31:32Z

@namapane
To repeat from #28180 (comment): for a one-shot use in the MagneticField the thread_local can be tolerated, but for a general pattern we advice against it.

namapane · 2020-04-27T21:48:24Z

Understood; but Slava's comment #28180 (comment) was not about general patterns, on which we all agree; it was on this specific use case. In this context your answer reads as a suggestion of rejecting #29561.

makortel · 2020-04-28T01:23:03Z

Sorry that I was unclear about the context. I've understood the discussion here such that a one-shot solution would be on the table. But then I got confused in #29561 (comment) by the discussion about "precedent" and what exactly that would imply. Aiming to not spread the discussion too much I thought this PR (which is very close of really being an issue now) would be better for comments concerning general use beyond the #29561 .

VinInn · 2020-04-28T07:22:25Z

Fine all this.

Can the Core team make a concrete proposal for a code pattern that avoid percolation?
I think it is responsibility of the core team to provide such a solution w/o dumping the burden to the end users.

makortel · 2020-04-28T14:23:04Z

I'm assuming the MagneticField is already percolated everywhere it is needed. The "local cache" proposal would then only change

// from
void function(const MagneticField& mf);
// to
void function(const local::MagneticField& mf);

(although for intra-module concurrent constructs like tbb::parallel_for a copy of the local::MagneticField would need to be passed as it itself should not be accessed concurrently).

The code getting the MagneticField from the EventSetup would change

// from
const MagneticField& mf = eventSetup.getData(magneticFieldToken_);
// to
local::MagneticField mf(&eventSetup.getData(magneticFieldToken_));

Core team can help in the migration.

VinInn · 2020-04-28T14:26:20Z

I'm assuming the MagneticField is already percolated everywhere it is needed.
and it was a pain: at the limit of non scaling. (and there are still places where it is somehow locally cached).

makortel · 2020-04-28T14:37:08Z

(and there are still places where it is somehow locally cached).

But even there the local::MagneticField should work. The interface supports updating the cache so a helper keeping the MagneticField const* as a member can be updated in the same manner

struct Helper {
  void update(EventSetup const& eventSetup) {
    mf = &eventSetup.getData(mfToken);
  }

  MagneticField const *mf;
};

// to
struct Helper {
  void update(EventSetup const& eventSetup) {
    mf.reset(&eventSetup.getData(mfToken));
  }

  local::MagneticField mf;
};
// and "->" to "." for mf access

slava77 · 2020-04-28T14:37:17Z

I'm assuming the MagneticField is already percolated everywhere it is needed. The "local cache" proposal would then only change

please clarify if this proposal requires that every user of MF to change to it, or is it possible to be done gradually, while both interfaces are supported.
From the discussion last fall (at least) I was still thinking that both local and global caching can coexist.

makortel · 2020-04-28T14:42:12Z

I'm assuming the MagneticField is already percolated everywhere it is needed. The "local cache" proposal would then only change

please clarify if this proposal requires that every user of MF to change to it, or is it possible to be done gradually, while both interfaces are supported.
From the discussion last fall (at least) I was still thinking that both local and global caching can coexist.

Correct, the "local cache" and the current "shared std::atomic cache" can coexist and a migration can be done gradually.

smuzaffar · 2020-04-28T14:53:33Z

thanks, For one of the sim warnings, I do not know what should be the correct fix so I had opened a issue #29569https://github.com/cms-sw/cmssw/issues/29569

…

________________________________ From: Slava Krutelyov [notifications@github.com] Sent: 28 April 2020 16:37 To: cms-sw/cmssw Cc: Subscribed Subject: Re: [cms-sw/cmssw] Fix a scaling issue in MagneticField/VolumeBasedEngine (#28180) I'm assuming the MagneticField is already percolated everywhere it is needed. The "local cache" proposal would then only change please clarify if this proposal requires that every user of MF to change to it, or is it possible to be done gradually, while both interfaces are supported. From the discussion last fall (at least) I was still thinking that both local and global caching can coexist. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#28180 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA7MVQQ6RDMFEJ4ICF6MEX3RO3S3BANCNFSM4JA6VBRA>.

namapane · 2020-04-28T15:29:50Z

The code getting the MagneticField from the EventSetup would change

// from
const MagneticField& mf = eventSetup.getData(magneticFieldToken_);
// to
local::MagneticField mf(&eventSetup.getData(magneticFieldToken_));

Doesn't the caller have to own the cache? At least, in the original proposal (#28180 (comment))
the local::MagneticField(::MagneticField const* field) seems to be cache-less.

makortel · 2020-04-28T15:41:22Z

The code getting the MagneticField from the EventSetup would change
// from
const MagneticField& mf = eventSetup.getData(magneticFieldToken_);
// to
local::MagneticField mf(&eventSetup.getData(magneticFieldToken_));
Doesn't the caller have to own the cache? At least, in the original proposal (#28180 (comment))
the local::MagneticField(::MagneticField const* field) seems to be cache-less.

Yes, the caller (or local::MagneticField to be precise) owns the cache (MagneticFieldCache). In the example above it is alive until the scope ends. It can also be made a member of a helper class (or even EDModule, but I'd expect that to be unnecessary, in most cases at least).

makortel · 2020-05-04T18:06:40Z

I implemented the "local cache" approach in #29631.

cmsbuild added this to the CMSSW_11_0_X milestone Oct 15, 2019

cmsbuild added code-checks-pending comparison-pending orp-pending pending-signatures reconstruction-pending tests-pending labels Oct 15, 2019

cmsbuild added code-checks-rejected and removed code-checks-pending labels Oct 15, 2019

amadio added 2 commits October 16, 2019 13:29

Compute barrel dimensions only at initialization

6d4970c

Use a thread-local volume cache rather than atomic

510a62d

amadio force-pushed the fix-magnetic-field-scaling branch from 04120d7 to 510a62d Compare October 16, 2019 11:30

cmsbuild added code-checks-pending and removed code-checks-rejected labels Oct 16, 2019

cmsbuild added code-checks-approved and removed code-checks-pending labels Oct 16, 2019

cmsbuild added tests-started and removed tests-pending labels Oct 16, 2019

namapane mentioned this pull request Apr 24, 2020

Magnetic Field optimization #29556

Merged

namapane mentioned this pull request Apr 25, 2020

Tentative implementation of TLS cache for MF #29561

Merged

makortel mentioned this pull request May 4, 2020

Introduce a local cache for any caching MagneticField functions, and use them in Oscar(MT)Producer #29631

Closed

Fix a scaling issue in MagneticField/VolumeBasedEngine #28180

Fix a scaling issue in MagneticField/VolumeBasedEngine #28180

Conversation

amadio commented Oct 15, 2019

PR description:

PR validation:

cmsbuild commented Oct 15, 2019

amadio commented Oct 15, 2019

cmsbuild commented Oct 15, 2019

civanch commented Oct 15, 2019

amadio commented Oct 15, 2019

cmsbuild commented Oct 16, 2019

cmsbuild commented Oct 16, 2019

cmsbuild commented Oct 16, 2019

perrotta commented Oct 16, 2019

cmsbuild commented Oct 16, 2019 • edited

amadio commented Oct 16, 2019 • edited

cmsbuild commented Oct 16, 2019

namapane commented Apr 24, 2020 • edited

VinInn commented Apr 24, 2020

civanch commented Apr 24, 2020

namapane commented Apr 25, 2020 • edited

civanch commented Apr 25, 2020

namapane commented Apr 25, 2020

slava77 commented Apr 25, 2020

namapane commented Apr 25, 2020

makortel commented Apr 25, 2020

makortel commented Apr 27, 2020

slava77 commented Apr 27, 2020

makortel commented Apr 27, 2020

namapane commented Apr 27, 2020 • edited

namapane commented Apr 27, 2020 • edited

makortel commented Apr 27, 2020

namapane commented Apr 27, 2020

makortel commented Apr 28, 2020

VinInn commented Apr 28, 2020

makortel commented Apr 28, 2020

VinInn commented Apr 28, 2020

makortel commented Apr 28, 2020

slava77 commented Apr 28, 2020

makortel commented Apr 28, 2020

smuzaffar commented Apr 28, 2020 via email

namapane commented Apr 28, 2020

makortel commented Apr 28, 2020

makortel commented May 4, 2020

cmsbuild commented Oct 16, 2019 •

edited

amadio commented Oct 16, 2019 •

edited

namapane commented Apr 24, 2020 •

edited

namapane commented Apr 25, 2020 •

edited

namapane commented Apr 27, 2020 •

edited

namapane commented Apr 27, 2020 •

edited