-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a local cache for any caching MagneticField functions, and use them in Oscar(MT)Producer #29631
Conversation
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29631/15010
|
A new Pull Request was created by @makortel (Matti Kortelainen) for master. It involves the following packages: MagneticField/Engine @perrotta, @cmsbuild, @civanch, @mdhildreth, @slava77 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@cmsbuild, please test |
The tests are being triggered in jenkins. |
@civanch Could you run the SIM scaling test also for this PR, please? |
+1 |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
CPU results (events/s) for comparison: WF "Nthreads" pre6 29561 29631 MInBias 8 1.991 2.072 2.068 There is some systematic but in general both #29561 and #29631 provide similar CPU improvements. #29631 is not better than #29561. What I do not understand, why in this PR Cache is used by value not by reference. |
@namapane |
curiously, this was posted within seconds of #29631 (comment) |
Yes but I was quicker. :-) |
Just to be clear, I am perfectly fine with this implementation if core believes it is technically superior. I just want to point out that a workaround may be needed to migrate SHP and other odd cases (although only SHP comes to my mind right away); and that it would be disturbing to find out a degradation due to cache fragmentation at a point where going back is no longer an option (although that may not necessarily be the case; and there would probably be workarounds as well). |
the (SteppingHelix)propagator is caching the volume pointer internally and will call its inTesla interface. In situations where a propagation crosses volume boundaries, a calling code that would also use the field at about the same time before calling the propagator or after getting a return value from it would very likely reuse the same thread_local pointer, it will perhaps similarly (but less) likely hit the "global" (current atomic) cache, and will likely have a cache miss if it implements the caching as in this PR but will not pass the cache to the propagator call. |
I agree with your analysis for the specific case of SHP, but my point is that I am not sure that "percolation through all the calls depending on the field in the context of the muon track reco" is really possible; e.g. SteppingHelixPropagatorESProducer.cc creates a propagator with a local:: field with its own cache. |
@namapane I certainly share the pain on the implied code changes, but the argumentation goes again to the "special case" vs. "generally acceptable pattern". The I think it is possible to deal with the ESProducts holding a MagneticField (Propagators, NavigationSchool, TrackingRecHitPropagator, TransientTrackBuilder; did I miss any) with the "local cache design", but it will need a bit more thought. |
@makortel, I agree on the fact that local cache solution cannot be adopted in cases like this one, where SHP is created as an ESProduct. |
BTW: of course, SHP is protected against failing of the dynamic_cast that will occur if it is constructed passing a local::MagneticField instead of the field itself. Explicit construction of SHP is a marginal use pattern compared to serving it as an ESProduct; but other odd use cases can of course exist. In my simplified understanding, not having to deal with all this, with a solution that is fully internal to MF (ie TLS cache) is a clear advantage. But I am completely fine if you think these complications are not significant and are worth to be addressed for the gain of not using TLS. |
I expect that not everything needs a migration by lack of performance benefits. |
IIUC, the cost benefit analysis for TLS in the MF, compared to local caching:
Unless the last con is misrepresented, I think that it trumps all benefits of convenience. |
I would agree if we had a figure of what the limit is, how fare we are from hitting it, and how significant the addition of TLS for MF is, wrt hitting the limit. Forgive me for over-simplifiying: say, we are at 80% of the limit (in whatever relevant metrics) and MF adds 5%? Then I agree. Also, nothing prevents to move to external caching the day when we figure out we need to add some new external which makes extensive usage of TLS. This is a very hypothetical scenario that may well never happen, so why forward-optimizing for it? This said, I start becoming uncomfortable with this discussion; as I already mentioned, if the experts say TLS must be avoided at all costs I have no objection. |
For the records, if you refer to the overhead of accessing TLS variables, this is included in @civanch 's profiling, which shows it is at least as small of the overheads of external caching. So I think this also does not hold as an argument. |
We are already hitting limits (on the number of shared objects with TLS) in ARM and POWER (albeit somewhat randomly). Risks for letting the use of TLS (or memory usage to scale wrt. the number of (worker) threads) to spread include missing (large) allocation opportunities in HPC sites because we would need months of work to get our code to run there. By being proactive we try hard to be as flexible as we can for the unknown future of our computing infrastructure. |
From earlier discussion I understood that there is a limit on the count of libraries using it and that's the limit we are hitting on the ARM/POWER @makortel please clarify on this. |
Why don't we then remove cases where TLS usage is not a sensible choice instead. Just looking at random places in CMSSW, I come into these (among others):
I find it hard to believe that TLS is appropriate in any of these, and I believe it would make much more sense to invest work in a campaign of fixing these specific pieces of code than to migrate all usage of a fundamental, ubiquitous service like MF to external cache. |
BTW, for my own curiosity and education:
|
Thank you for your patience in comments. |
Hi all, as I mentioned before it would be good to discuss numbers. It would be good to know HPC numbers for CMSSW. I my memory when we introduced Geant4 MT, TLS limit was about 4032 B, Geant4 10.0 used about ~10 MB of TLS. Numbers may be not exact but we realized that we have no chance to fit the limit. Because of that, we start to use "dynamic" build options, which bring ~10% slow down of simulation. We reduced TLS usage in Geant4 but we never setup some milestones. I do not know value of Geant4 10.6 TLS but we definitely need numbers and limits for different hardware. |
@civanch |
@namapane You are exactly right that we should go over all
This is correct. While it is true that the library count limit could be mitigated by hiding the TLS within the framework, such a move would send the message that it would be acceptable to craft data structures whose size scales according to the number of (worker) threads instead of the number of concurrent events. For this specific case the overall effect would be tiny, but if we make such a pattern as an example, no-one knows how bad the situation will become until it is too late (and fixing that would require much more work than what we are discussing about now). @civanch Essentially TLS and stack share the same memory block, stack increases downwards and TLS upwards. When TLS "goes over the limit" the first symptoms would be stack corruption, that in the best case would lead to a crash. I am not aware of any tool that would monitor the size of the TLS and the stack (I'd expect valgrind to be able to spot the stack corruptions though). |
Said that, how about we look at this from a different angle. As discussed several times, the impact of this particular case to either TLS size or the number of shared objects with TLS is tiny. Core's main concern is the spread of using TLS or using data structures whose overall size depends on the number of threads instead of the number of concurrent events. How about we accept the
the "reasonably needed" would mean something along "existing ESProduct needs an internal cache for performance reasons, and that ESProduct is used by other existing ESProducts via member data, and the use of all these ESProducts is so widespread that the cost to migrate them all is deemed very costly"? |
perhaps this can be proposed for the coding rules. |
We don't explain tolerated non-preferred violations for any other rule, but I agree the asterisk (for "exceptional use cases where the rule may be violated with good justification") can be added along the "avoid thread_local" rule (when I get there). |
-1 based on #29631 (comment) |
Closing in favor of #29561 (see discussion above). |
PR description:
Following the discussion #28180 this PR proposes a local cache for the caching of "previous volume" in
MagGeometry
following the sketch in #28180 (comment), andlocal::MagneticField
as an easy replacement ofMagneticField const*
.This PR also migrates Oscar(MT)Producer to use the
local::MagneticField
to enable testing for the most affected use case.PR validation:
Limited matrix runs. I also tested with one workflow that the SIM step runs with multiple threads.