Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tentative implementation of TLS cache for MF #29561

Merged
merged 3 commits into from May 12, 2020

Conversation

namapane
Copy link
Contributor

@namapane namapane commented Apr 25, 2020

PR description:

Tentative impementation of per-thread MF volume cache as suggested by @VinInn in the discussion of #28180 (#28180 (comment)), intended to allow further testing.

PR validation:

Regression-tested for identical values against reference files for all maps/currents/eras.

Also verified this in a multithread job spanning across multiple IOVs to verify that the cache gets invalidated when necessary as expected.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29561/14859

Code check has found code style and quality issues which could be resolved by applying following patch(s)

namespace {
std::atomic<int> instanceCounter(0);
thread_local int localInstance = 0;
thread_local MagVolume const* lastVolume = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering thread_local is restrictive in our environment and should be avoided,
how would this differ from using e.g. std::map<std::thread::id, MagVolume const*> lastVolume; and then accessing it via lastVolume[std::this_thread::get_id()]?
the cost of navigating the map is likely still much smaller than calling the volume inTesla.
A slightly faster version could use a predetermined array size with the id conversion to int of the array size.

@makortel @VinInn @Dr15Jones

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every access to an std::map shared across threads would need to be protected e.g. with a mutex. I would rather not add synchronization points.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unordered_map with a large enough reserve wouldn't need a sync, or would it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to stay with thread_local, however if this is acceptable, it should be acceptable in all packages where it's reasonably needed.
Can we go for this precedent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meanwhile I finally managed test the cache re-set on several threads running in parallel (besides specifiyng process.options.numberOfThreads and numberOfStreams I also had to turn my analyzer into a stream analyzer; thanks to @VinInn for the tip).

So I verified that the cache is re-set correctly in each of the active threads at each IOV transition.

@slava77 : I'm not the one who can answer your last question, but my 2 cents: I do understand your concern about setting a precedent. However, MF is a fundamental service for pretty much anything in SIM, RECO and HLT, not some obscure analysis package.
If it sets a precedent, I think it also sets a pretty high standard on what is to be considered reasonable...

Regarding the (unordered) map: nice idea for what I am concerned, but I don't have enough experience to tell whether it's better than the TLS one. If experts agree it is, I can try implementing it as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'd save the effort if the sync issue is considered to be a showstopper...

In general we should avoid adding synchronization points that the framework is unaware of (they will decrease CPU efficiency).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the framework provide us with a thread id -> int index map object (best if the range of int is small and can be used as an array index) that would not require synchronization on the client side?
This way the framework will be aware of how it's filled and we'll just have an array of cache pointers (possibly atomic) to read from in the MF code instead of a thread_local.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replied on the general "per-thread" question in #28180 (comment).

An array with per-thread indexing would technically work, but for a good performance one has to give each thread a separate cache line to prevent false sharing (so a simple 'std::array<MagVolume const*, N>` would not be sufficient). How would you decide the array size? Strictly speaking we do not guarantee to keep the same threads though a job. And allowing resize leads to the need to lock/synchronize (there may be ways to do an update without locking with a price of complicated code).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the array I was thinking (if a fixed thread indexing is not practical) to keep it std::array<std::atomic<MagVolume const*>, N> and do (int)thread_index%N
where N can be a reasonable max (like 256)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAIT. the framework should guarantee NOT to exceed the number of "processing" threads in the configuration. I do not think that the framework should be free to add a couple of threads to overcome a temporary shortage. If the framework wants to have such a freedom it should reserve those thread in advance.

I proposed somewhere above a possible implementation of all this.
Essentially the framework should provide a global function that return a struct with all info (process-status?) that may be of interest during processing (event,setup, module etc as pointers or just identifiers) including an "index" that is per thread and per stream and it is sequential and has a known maximum (returned by another global function or stored in the struct above). Internally will use a thread_local and I suggest to locate thise functions in the "main" program (statically linked)

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29561/14860

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @namapane (Nicola Amapane) for master.

It involves the following packages:

MagneticField/Engine
MagneticField/GeomBuilder
MagneticField/Layers
MagneticField/VolumeBasedEngine

@perrotta, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks.
@silviodonato, @dpiparo you are the release manager for this.

cms-bot commands are listed here

@civanch
Copy link
Contributor

civanch commented Apr 27, 2020

@namapane , effect on CPU for SIM step is less than expected. To be sure, I have done the test for MinBias and ttbar events:
image

@civanch
Copy link
Contributor

civanch commented Apr 27, 2020

There is a spead-up for 5% for both type of events for 16 threads. For MinBias events scaling is not good, for ttbar is better. This plot is different from one of @amadio shown in #28180. The difference may be connected with different Geant4 or different way of caching.

@slava77
Copy link
Contributor

slava77 commented Apr 27, 2020

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@slava77
Copy link
Contributor

slava77 commented Apr 27, 2020

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 27, 2020

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/5867/console Started: 2020/04/27 16:19

@slava77
Copy link
Contributor

slava77 commented Apr 27, 2020

There is a spead-up for 5% for both type of events for 16 threads. For MinBias events scaling is not good, for ttbar is better. This plot is different from one of @amadio shown in #28180. The difference may be connected with different Geant4 or different way of caching.

which machine did you use?
was the test based on the same command as in #28180 (comment) ?

cmsDriver.py MinBias_13TeV_cfi --era Run2_2018 -s GEN,SIM --pileup=NoPileUp
--conditions auto:phase1_2018_realistic --geometry DB:Extended
--eventcontent=RAWSIM --datatier GEN-SIM --dirout=./ -n $nevents --nThreads $nthreads
--fileout file:scaling.root --mc --no_exec cmsRun MinBias_13TeV_cfi_GEN_SIM.py

@cmsbuild
Copy link
Contributor

Pull request #29561 was updated. @perrotta, @smuzaffar, @Dr15Jones, @makortel, @cmsbuild, @slava77 can you please check and sign again.

@makortel
Copy link
Contributor

@cmsbuild, please test

Thanks!

@cmsbuild
Copy link
Contributor

cmsbuild commented May 11, 2020

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/6225/console Started: 2020/05/11 17:37

@cmsbuild
Copy link
Contributor

+1
Tested at: df4ab9b
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9a9cfb/6225/summary.html
CMSSW: CMSSW_11_1_X_2020-05-11-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9a9cfb/6225/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2697527
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2697207
  • DQMHistoTests: Total skipped: 319
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@makortel
Copy link
Contributor

+1

@slava77
Copy link
Contributor

slava77 commented May 11, 2020

+1

for #29561 df4ab9b

  • added comments in the code since the last reco signoff

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo (and backports should be raised in the release meeting by the corresponding L2)

@silviodonato
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants