New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a scaling issue in MagneticField/VolumeBasedEngine #28180
Conversation
The code-checks are being triggered in jenkins. |
@civanch Please review. Thank you. |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28180/12257
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
@amadio , this is a very good fix. You need to fix code-format issues and may be remove "cout", which is left in the code. |
@civanch, Thank you. I wondered if it was ok to remove, since it can be a performance issue, and it's easy to add back if needed, but decided to make only the minimal changes. I will fix formatting and make other improvements like removing the print out calls. |
04120d7
to
510a62d
Compare
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28180/12278
|
A new Pull Request was created by @amadio (Guilherme Amadio) for master. It involves the following packages: MagneticField/GeomBuilder @perrotta, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
The tests are being triggered in jenkins. |
@perrotta I have a question. Is the field in |
So as anticipated I made #29556 with minor optimizations from this one (revisited) + couts cleanup discussed above. |
maybe we should move the discussion to an "issue" |
@civanch , I pushed a branch in my clone: Some technicalities:
I don't know how to test IOV transition across multiple threads. I tried to set several events per run (ie repeating the regression test once for each event) so that different events in the same run and IOV can be processed in parallel. I think this is not happening, at least according to the output:
I interpret "on stream 0" to mean they are all in the same thread (right?) Do I need to/can I instruct the FW to run different events in parallel? |
No, #29556 does not contain the TLS cache, only some minor improvements. |
would |
Strictly speaking no. It means that there is one concurrent event in flight (or one concurrent "EDM stream" processing events). The default is to use one thread (which is likely what you observe), although technically multiple threads can be used also with a single stream.
The number of threads to use can be set at command line with process.options.numberOfThreads = <NT>
process.options.numberOfStreams = <NS> # default is 0, which means to use the number of threads If you want to try out concurrent lumis, that can be enabled with process.options.numberOfConcurrentLuminosityBlocks = 2 |
@Dr15Jones did a little test of crafting a virtual interface and an implementation that does a Design-wise a local cache suits better with expressing parallelism in terms of tasks than a per-thread cache (implemented in whatever way). For example, if a job uses less streams than threads and has algorithm(s) using e.g. Our memory needs typically grow on the number of streams (concurrent events) instead of number of threads. Even if today we use the same number of threads and streams, it may be that in the future we need ot (or can) use less streams than threads. In such a case per-thread objects would consume more memory (or resources in general) than per-stream objects. In general our experience so far is that getting per-thread work/storage to work properly with tasks is tricky (Geant4 integration being an extreme example). Also, local work/storage is easier to reason about, and therefore likely easier to maintain in the long term, than per-thread work/storage. From the discussion in #29561 (comment) it seems that we are looking for a general pattern rather than a one-off workaround (#28180 (comment)), so core advices against using a per-thread cache. |
do I read this correctly as we are back to #28284, |
Yes, the "local cache" is the solution preferred by core as the most future proof solution (in terms of both performance and maintenance). |
|
To elaborate: the advantage of the cache is essentially within a single "task" when that task has to call the MF a very large number of times in a limited region. I can't imagine any scenario where the first call of a task is not a cache miss; not only across events, but even for tasks repeaded within the same event (e.g. tracking different particles in parallel). |
@namapane |
Understood; but Slava's comment #28180 (comment) was not about general patterns, on which we all agree; it was on this specific use case. In this context your answer reads as a suggestion of rejecting #29561. |
Sorry that I was unclear about the context. I've understood the discussion here such that a one-shot solution would be on the table. But then I got confused in #29561 (comment) by the discussion about "precedent" and what exactly that would imply. Aiming to not spread the discussion too much I thought this PR (which is very close of really being an issue now) would be better for comments concerning general use beyond the #29561 . |
Fine all this. Can the Core team make a concrete proposal for a code pattern that avoid percolation? |
I'm assuming the // from
void function(const MagneticField& mf);
// to
void function(const local::MagneticField& mf); (although for intra-module concurrent constructs like The code getting the // from
const MagneticField& mf = eventSetup.getData(magneticFieldToken_);
// to
local::MagneticField mf(&eventSetup.getData(magneticFieldToken_)); Core team can help in the migration. |
|
But even there the struct Helper {
void update(EventSetup const& eventSetup) {
mf = &eventSetup.getData(mfToken);
}
MagneticField const *mf;
};
// to
struct Helper {
void update(EventSetup const& eventSetup) {
mf.reset(&eventSetup.getData(mfToken));
}
local::MagneticField mf;
};
// and "->" to "." for mf access |
please clarify if this proposal requires that every user of MF to change to it, or is it possible to be done gradually, while both interfaces are supported. |
Correct, the "local cache" and the current "shared |
thanks,
For one of the sim warnings, I do not know what should be the correct fix so I had opened a issue
#29569https://github.com/cms-sw/cmssw/issues/29569
…________________________________
From: Slava Krutelyov [notifications@github.com]
Sent: 28 April 2020 16:37
To: cms-sw/cmssw
Cc: Subscribed
Subject: Re: [cms-sw/cmssw] Fix a scaling issue in MagneticField/VolumeBasedEngine (#28180)
I'm assuming the MagneticField is already percolated everywhere it is needed. The "local cache" proposal would then only change
please clarify if this proposal requires that every user of MF to change to it, or is it possible to be done gradually, while both interfaces are supported.
From the discussion last fall (at least) I was still thinking that both local and global caching can coexist.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#28180 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA7MVQQ6RDMFEJ4ICF6MEX3RO3S3BANCNFSM4JA6VBRA>.
|
Doesn't the caller have to own the cache? At least, in the original proposal (#28180 (comment)) |
Yes, the caller (or |
I implemented the "local cache" approach in #29631. |
PR description:
In commit 0f77253, the MagGeometry was made thread-safe by using an
std::atomic<>
for caching the last volume used. This means that all threads, despite working on unrelated events, share a common cached volume, which is not suitable most of the time. The atomic also creates problems for concurrency, since whenever a thread caches a volume, the CPU cacheline is invalidated, forcing all other threads to fetch the value of the cached volume from memory every time, consequently suffering from higher latency. Since that volume is not suitable, it's likely that each thread will then replace it with a new value, which is not good for the other threads, and so on, significantly slowing down the whole simulation, since the magnetic field is called many times for each track to integrate its trajectory.This problem has been profiled and benchmarked in Intel's VTune Amplifier against cmssw branch
CMSSW_11_0_GEANT4_X_2019-10-10-2300
. The results are shown below with labelsbefore
andafter
the changes from this pull request. In all cases, hyper-threading was disabled, and the threads were pinned usingtaskset
with an appropriate number of CPUs. The machine where all runs were performed is a dual-socket Intel Xeon E5-2698 v3 with 64GB of RAM. We simulate minimum bias events at Ecm = 13TeV, and vary the total number of events and number of threads in each analysis. ThecmsDriver
command used to create the input file is shown below:PR validation:
There is no functional change introduced by this pull request. It is purely a performance optimization. The only thing that is different is that some barrel properties are computed only once at initialization rather than at every call to the magnetic field, as that seemed like a sensible thing to do as extra optimization.
Here is a scaling plot showing scaling before and after the changes:
A throughput table before and after the changes is shown below. Throughput is calculated as (1024 events) / Δt, where Δt = (t₂ - t₁), t₁ is the start time for processing the first event, and t₂ is the end of the job (i.e. estimate of total time minus initialization time).
Comparison of hotspots analysis at 32 threads in Intel VTune Amplifier (before - after):
Comparison of micro-architecture analysis at 32 threads in Intel VTune Amplifier (before - after):
Bottom-up micro-architecture comparison of before and after for 32 threads in Intel VTune Amplifier:
Best regards,
—Guilherme