New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tentative implementation of TLS cache for MF #29561
Conversation
The code-checks are being triggered in jenkins. |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29561/14859
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
namespace { | ||
std::atomic<int> instanceCounter(0); | ||
thread_local int localInstance = 0; | ||
thread_local MagVolume const* lastVolume = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering thread_local is restrictive in our environment and should be avoided,
how would this differ from using e.g. std::map<std::thread::id, MagVolume const*> lastVolume;
and then accessing it via lastVolume[std::this_thread::get_id()]
?
the cost of navigating the map is likely still much smaller than calling the volume inTesla.
A slightly faster version could use a predetermined array size with the id conversion to int of the array size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every access to an std::map
shared across threads would need to be protected e.g. with a mutex
. I would rather not add synchronization points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unordered_map with a large enough reserve wouldn't need a sync, or would it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to stay with thread_local, however if this is acceptable, it should be acceptable in all packages where it's reasonably needed.
Can we go for this precedent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meanwhile I finally managed test the cache re-set on several threads running in parallel (besides specifiyng process.options.numberOfThreads and numberOfStreams I also had to turn my analyzer into a stream analyzer; thanks to @VinInn for the tip).
So I verified that the cache is re-set correctly in each of the active threads at each IOV transition.
@slava77 : I'm not the one who can answer your last question, but my 2 cents: I do understand your concern about setting a precedent. However, MF is a fundamental service for pretty much anything in SIM, RECO and HLT, not some obscure analysis package.
If it sets a precedent, I think it also sets a pretty high standard on what is to be considered reasonable...
Regarding the (unordered) map: nice idea for what I am concerned, but I don't have enough experience to tell whether it's better than the TLS one. If experts agree it is, I can try implementing it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I'd save the effort if the sync issue is considered to be a showstopper...
In general we should avoid adding synchronization points that the framework is unaware of (they will decrease CPU efficiency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the framework provide us with a thread id -> int index map object (best if the range of int is small and can be used as an array index) that would not require synchronization on the client side?
This way the framework will be aware of how it's filled and we'll just have an array of cache pointers (possibly atomic) to read from in the MF code instead of a thread_local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replied on the general "per-thread" question in #28180 (comment).
An array with per-thread indexing would technically work, but for a good performance one has to give each thread a separate cache line to prevent false sharing (so a simple 'std::array<MagVolume const*, N>` would not be sufficient). How would you decide the array size? Strictly speaking we do not guarantee to keep the same threads though a job. And allowing resize leads to the need to lock/synchronize (there may be ways to do an update without locking with a price of complicated code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the array I was thinking (if a fixed thread indexing is not practical) to keep it std::array<std::atomic<MagVolume const*>, N>
and do (int)thread_index%N
where N can be a reasonable max (like 256)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WAIT. the framework should guarantee NOT to exceed the number of "processing" threads in the configuration. I do not think that the framework should be free to add a couple of threads to overcome a temporary shortage. If the framework wants to have such a freedom it should reserve those thread in advance.
I proposed somewhere above a possible implementation of all this.
Essentially the framework should provide a global function that return a struct with all info (process-status?) that may be of interest during processing (event,setup, module etc as pointers or just identifiers) including an "index" that is per thread and per stream and it is sequential and has a known maximum (returned by another global function or stored in the struct above). Internally will use a thread_local and I suggest to locate thise functions in the "main" program (statically linked)
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29561/14860
|
A new Pull Request was created by @namapane (Nicola Amapane) for master. It involves the following packages: MagneticField/Engine @perrotta, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@namapane , effect on CPU for SIM step is less than expected. To be sure, I have done the test for MinBias and ttbar events: |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@cmsbuild please test |
The tests are being triggered in jenkins. |
which machine did you use?
|
Pull request #29561 was updated. @perrotta, @smuzaffar, @Dr15Jones, @makortel, @cmsbuild, @slava77 can you please check and sign again. |
@cmsbuild, please test Thanks! |
The tests are being triggered in jenkins. |
+1 |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
Tentative impementation of per-thread MF volume cache as suggested by @VinInn in the discussion of #28180 (#28180 (comment)), intended to allow further testing.
PR validation:
Regression-tested for identical values against reference files for all maps/currents/eras.
Also verified this in a multithread job spanning across multiple IOVs to verify that the cache gets invalidated when necessary as expected.