-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restrict FastTimerService to default arena #33126
restrict FastTimerService to default arena #33126
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33126/21479
|
A new Pull Request was created by @dan131riley (Dan Riley) for master. It involves the following packages: HLTrigger/Timer @Martin-Grunewald, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@cmsbuild, please test |
Please test |
hold
|
Pull request has been put on hold by @fwyzard |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e703e6/13390/summary.html Comparison SummarySummary:
|
The purpose of the FTS is to track all the resources used in a job, not
just those that happen to use a particular subset of the TBB threads.
|
In addition to the crashes of few workflows in IBs, the global observer is denoted as obsolete in TBB 2020.3 /** TODO: Obsolete.
Global observer semantics is obsolete as it violates master thread isolation
guarantees and is not composable. Thus the current default behavior of the
constructor is obsolete too and will be changed in one of the future versions
of the library. **/
explicit task_scheduler_observer( bool local = false ) { https://github.com/oneapi-src/oneTBB/blob/v2020.3/include/tbb/task_scheduler_observer.h#L112-L117 This PR seems the best we can do for now. |
Sorry, I don't buy the "best we can do for now", since that basically
always means "we will leave it at that".
I would appreciate instead a plan to maintain the existing functionality.
|
I don't know how we could maintain functionality of TBB that gets removed from TBB (without forking, which would lead to other issues). Maybe ask TBB developers for such hooks? Maybe the desired functionality could be achieved with some use of thread locals (that admittedly would impact all threads and not just TBB threads)? Maybe there is some other solution? In the meantime it would be important to get this PR in so that we can see if there are any remaining issues in the IBs. |
@fwyzard Could you actually describe what is the exact behavior FastTimerService is after for with the current use of |
The issue that is solved by inheriting from |
Summary of what was discussed in ORP+core software meeting: this PR would likely impact the correctness of the accounting of FastTimerService, so a different solution/workaround is needed (Dan is looking into it). In the meantime, we would merge this PR (or a copy of it) into DEVEL IB to check if there are any further, more rare issues left from the migration to |
Merging in DEVEL makes sense, of course, and thanks for investigating his further. Note that the If there is an alternative to |
OK, I am changing the base branch for this PR to be |
@silviodonato , @qliphy can you please merge it (note that it will go in DEVEL IBs only) ? |
merge |
thanks, I am starting a 12h DEVEL IB now |
This PR is definitely wrong. With the new task_group structure, there are a lot of entries and exits of the primary arena in between measurement points, which will incorrectly reset the thread counters. The behavior of the global observer is an enter when a thread becomes available to tbb and an exit when it leaves, which in the normal case means enter & exit only get called once per thread. This could be closely approximated by initializing only on the first enter for a thread and then iterating over the enumerable_thread_specific at post-endjob, except for the reuse issue that @fwyzard points out, and finalization for threads that leave mid-job. I believe this is not a practical problem on Linux, but it would be better if we could detect that a thread ID has been reused. There's probably a low-level way to do this with pthread_key_create(). |
It seems unlikely that TBB will be adding/removing threads while we are running with our recent changes. We call In the old implementation, if we told TBB to use 1 thread but then used the 'external work' of the framework, TBB would still create a new thread. I don't think it does that anymore but we could test it again. |
How is this different from the old implementation, when we configured a job to run with 4 or 8 threads ? |
The addition of I would still do more testing to be sure TBB is actually behaving in such a way. Such testing would be a combination of
|
(now that I finally found it) #31483 shows a case where a job configured to use 256 threads uses (at least) 323 distinct TBB worker threads (that end up running |
@dan131riley could you point me to some documentation of how "moving" the threads from one arena to another works, and how we are (going to be) using different arenas in CMSSW (and ROOT) ? Maybe we can use the different arenas as a kind of macro categories for accounting the resource usage ? |
@fwyzard I'm not the expert on this, but there are a few places we create transitory task arenas in order to manage waiting for asynchronous events. One example is the cmssw/Mixing/Base/src/SecondaryEventProvider.cc Lines 25 to 32 in dbeb22f
I doubt it would be very informative to create a task_scheduler_observer for every transitory task arena.
Event setup also currently uses an arena for asynchronous operations, but that may be going away. |
To clarify, the framework internally uses 2
As for ROOT, it looks like their use of a |
PR description:
This makes the FastTimerService tbb::task_scheduler_observer local to the primary arena. This fixes the problem with crashes at the end of the job following #32804 and covered in issue #33107.
PR validation:
It compiles, and running step 3 of 136.885501, which was failing 20-50% of the time depending on platform, has run without crashes in dozens of test runs.