restrict FastTimerService to default arena #33126

dan131riley · 2021-03-09T23:35:44Z

PR description:

This makes the FastTimerService tbb::task_scheduler_observer local to the primary arena. This fixes the problem with crashes at the end of the job following #32804 and covered in issue #33107.

PR validation:

It compiles, and running step 3 of 136.885501, which was failing 20-50% of the time depending on platform, has run without crashes in dozens of test runs.

cmsbuild · 2021-03-09T23:42:12Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33126/21479

This PR adds an extra 24KB to repository

cmsbuild · 2021-03-09T23:42:35Z

A new Pull Request was created by @dan131riley (Dan Riley) for master.

It involves the following packages:

HLTrigger/Timer

@Martin-Grunewald, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks.
@Martin-Grunewald this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

makortel · 2021-03-09T23:48:01Z

@cmsbuild, please test

Dr15Jones · 2021-03-09T23:48:30Z

Please test

fwyzard · 2021-03-10T02:43:54Z

hold

cmsbuild · 2021-03-10T02:44:16Z

Pull request has been put on hold by @fwyzard
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

cmsbuild · 2021-03-10T02:45:33Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e703e6/13390/summary.html
COMMIT: c1886c2
CMSSW: CMSSW_11_3_X_2021-03-09-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/33126/13390/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 2950 differences found in the comparisons
DQMHistoTests: Total files compared: 38
DQMHistoTests: Total histograms compared: 2849195
DQMHistoTests: Total failures: 9571
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2839602
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 37 files compared)
Checked 160 log files, 37 edm output root files, 38 DQM output files

fwyzard · 2021-03-10T02:47:49Z

The purpose of the FTS is to track all the resources used in a job, not just those that happen to use a particular subset of the TBB threads.

makortel · 2021-03-11T22:33:12Z

In addition to the crashes of few workflows in IBs, the global observer is denoted as obsolete in TBB 2020.3

    /** TODO: Obsolete.
        Global observer semantics is obsolete as it violates master thread isolation
        guarantees and is not composable. Thus the current default behavior of the
        constructor is obsolete too and will be changed in one of the future versions
        of the library. **/
    explicit task_scheduler_observer( bool local = false ) {

https://github.com/oneapi-src/oneTBB/blob/v2020.3/include/tbb/task_scheduler_observer.h#L112-L117
and in fact that mode appears to be removed in 2021.1.1
https://github.com/oneapi-src/oneTBB/blob/v2021.1.1/include/oneapi/tbb/task_scheduler_observer.h

This PR seems the best we can do for now.

fwyzard · 2021-03-11T23:45:57Z

Sorry, I don't buy the "best we can do for now", since that basically always means "we will leave it at that". I would appreciate instead a plan to maintain the existing functionality.

makortel · 2021-03-12T14:29:10Z

I don't know how we could maintain functionality of TBB that gets removed from TBB (without forking, which would lead to other issues).

Maybe ask TBB developers for such hooks? Maybe the desired functionality could be achieved with some use of thread locals (that admittedly would impact all threads and not just TBB threads)? Maybe there is some other solution?

In the meantime it would be important to get this PR in so that we can see if there are any remaining issues in the IBs.

makortel · 2021-03-12T14:48:17Z

@fwyzard Could you actually describe what is the exact behavior FastTimerService is after for with the current use of tbb::task_scheduler_observer?

fwyzard · 2021-03-12T16:50:46Z

The issue that is solved by inheriting from tbb::task_scheduler_observer and implementing on_scheduler_entry()/on_scheduler_exit() is that during a job, a TBB pool may relinquish a worker thread and/or acquire a new one; when this happens, the per-thread information needs to be properly initialised (for a new thread) or accounted (for a thread leaving the pool).

makortel · 2021-03-17T00:59:57Z

Summary of what was discussed in ORP+core software meeting: this PR would likely impact the correctness of the accounting of FastTimerService, so a different solution/workaround is needed (Dan is looking into it). In the meantime, we would merge this PR (or a copy of it) into DEVEL IB to check if there are any further, more rare issues left from the migration to tbb::task_group.

fwyzard · 2021-03-17T10:24:25Z

Merging in DEVEL makes sense, of course, and thanks for investigating his further.

Note that the tbb::enumerable_thread_specific<Measurement, tbb::cache_aligned_allocator<Measurement>, tbb::ets_key_per_instance> threads_ member automatically adds new elements as needed; however according to TBB it does not guarantee that an element is not reused for a new thread (see the second NOTE at https://software.intel.com/content/www/us/en/develop/documentation/tbb-documentation/top/intel-threading-building-blocks-developer-reference/thread-local-storage/enumerablethreadspecific-template-class.html ); it also does not remove/destroy the elements for threads that leave the pool.

If there is an alternative to tbb::enumerable_thread_specific that takes care of the proper initialisation and "finalisation" of TLS objects for TBB worker threads, we could avoid the use of the tbb::task_scheduler_observer.

smuzaffar · 2021-03-17T10:38:13Z

OK, I am changing the base branch for this PR to be CMSSW_11_3_DEVEL_X and then we can merge it.

smuzaffar · 2021-03-17T10:40:01Z

@silviodonato , @qliphy can you please merge it (note that it will go in DEVEL IBs only) ?

silviodonato · 2021-03-17T10:45:42Z

merge
(this will go only in CMSSW_11_3_DEVEL_X for test/debugging)

smuzaffar · 2021-03-17T10:52:56Z

thanks, I am starting a 12h DEVEL IB now

dan131riley · 2021-03-17T14:27:11Z

This PR is definitely wrong. With the new task_group structure, there are a lot of entries and exits of the primary arena in between measurement points, which will incorrectly reset the thread counters. The behavior of the global observer is an enter when a thread becomes available to tbb and an exit when it leaves, which in the normal case means enter & exit only get called once per thread. This could be closely approximated by initializing only on the first enter for a thread and then iterating over the enumerable_thread_specific at post-endjob, except for the reuse issue that @fwyzard points out, and finalization for threads that leave mid-job. I believe this is not a practical problem on Linux, but it would be better if we could detect that a thread ID has been reused. There's probably a low-level way to do this with pthread_key_create().

Dr15Jones · 2021-03-17T14:35:09Z

It seems unlikely that TBB will be adding/removing threads while we are running with our recent changes. We call tbb::global_control at beginning of job and tell TBB that it can only use a set number of threads (decided from the configuration option process.options.numberOfThreads). We then wrap all our processing in a tbb::task_arena where we told it to use that same number of threads. So at that point, TBB is constrained to not use more than N threads AND we've explicitly told TBB to use N threads in the 'main' arena. Implementation wise, TBB could still delete a thread and bring up another one to replace it and it would still hold to the constraints but that seems pretty unlikely since it would be unnecessary inefficient.

In the old implementation, if we told TBB to use 1 thread but then used the 'external work' of the framework, TBB would still create a new thread. I don't think it does that anymore but we could test it again.

fwyzard · 2021-03-17T14:42:54Z

Implementation wise, TBB could still delete a thread and bring up another one to replace it and it would still hold to the constraints but that seems pretty unlikely since it would be unnecessary inefficient.

How is this different from the old implementation, when we configured a job to run with 4 or 8 threads ?
In that scenario, TBB rarely did delete a thread and bring in a new one.

Dr15Jones · 2021-03-17T14:55:55Z

How is this different from the old implementation, when we configured a job to run with 4 or 8 threads ?
In that scenario, TBB rarely did delete a thread and bring in a new one.

The addition of tbb::global_control seems to be much more constraining than the old tbb::scheduler_init mechanism. The latter seemed to be more a suggested number of threads while global_control sets a hard limit. The explicit use of task_arena also seems to be more constraining then the old implicit usage.

I would still do more testing to be sure TBB is actually behaving in such a way. Such testing would be a combination of

1 thread and >1 thread
with and without 'external work' (since this calls tbb::task::enqueue) using a non-TBB thread
with and without ES module being run (since they have their own tbb::task_arena)
with and without ROOT IMT (since it uses its own tbb::task_arena)

makortel · 2021-03-17T14:57:13Z

Implementation wise, TBB could still delete a thread and bring up another one to replace it and it would still hold to the constraints but that seems pretty unlikely since it would be unnecessary inefficient.

How is this different from the old implementation, when we configured a job to run with 4 or 8 threads ?
In that scenario, TBB rarely did delete a thread and bring in a new one.

(now that I finally found it) #31483 shows a case where a job configured to use 256 threads uses (at least) 323 distinct TBB worker threads (that end up running OscarMTProducer) during the job.

fwyzard · 2021-03-18T08:38:37Z

@dan131riley could you point me to some documentation of how "moving" the threads from one arena to another works, and how we are (going to be) using different arenas in CMSSW (and ROOT) ?

Maybe we can use the different arenas as a kind of macro categories for accounting the resource usage ?

dan131riley · 2021-03-18T12:40:25Z

@fwyzard I'm not the expert on this, but there are a few places we create transitory task arenas in order to manage waiting for asynchronous events. One example is the WaitingTaskWithArenaHolder used for external work, another more recent example is in the secondary event provider:

cmssw/Mixing/Base/src/SecondaryEventProvider.cc

Lines 25 to 32 in dbeb22f

    
           //we need the arena to guarantee that the syncWait will return to this thread 
        
           // and not cause this callstack to possibly be moved to a new thread 
        
           tbb::task_arena localArena{tbb::this_task_arena::max_concurrency()}; 
        
           std::exception_ptr exceptPtr = localArena.execute([&]() { 
        
             return edm::syncWait([&](edm::WaitingTaskHolder&& iHolder) { 
        
               manager.processOneOccurrenceAsync<T, U>(std::move(iHolder), info, token, streamID, topContext, context); 
        
             }); 
        
           });

I doubt it would be very informative to create a task_scheduler_observer for every transitory task arena.

Event setup also currently uses an arena for asynchronous operations, but that may be going away.

Dr15Jones · 2021-03-18T15:47:54Z

To clarify, the framework internally uses 2 tbb::task_arenas.

the main arena which is started in main() and surrounds all our data processing. It is explicitly told to use as many threads as were set in the cmsRun configuration.
the EventSetup arena which is used whenever an ED module requests an EventSetup module to be run. This is done to avoid having a synchronous EventSetup request from an ED module causing another ED module to run on that same thread. This arena will go away once we have finished the esConsumes migration so we can always prefetch the EventSetup data.

WaitingTaskWithArenaHolder doesn't actually use a new tbb::task_arena instead it attaches itself to the arena which is active in the thread when it is created. Since WaitingTaskWithArenaHolder are only used the ED modules it means they only attach to the main arena.

As for ROOT, it looks like their use of a tbb::task_arena is a hidden internal detail of their implementation.

restrict FastTimerService to default arena

c1886c2

cmsbuild added this to the CMSSW_11_3_X milestone Mar 9, 2021

cmsbuild added code-checks-pending hlt-pending orp-pending pending-signatures tests-pending labels Mar 9, 2021

dan131riley mentioned this pull request Mar 9, 2021

Assertion failures in edm::service::Timing #33107

Closed

cmsbuild added code-checks-approved and removed code-checks-pending labels Mar 9, 2021

cmsbuild added tests-started and removed tests-pending labels Mar 9, 2021

cmsbuild added the hold label Mar 10, 2021

cmsbuild added tests-approved and removed tests-started labels Mar 10, 2021

thomreis mentioned this pull request Mar 10, 2021

Addressing ECAL local reconstruction on GPU issues #33116

Merged

Dr15Jones mentioned this pull request Mar 15, 2021

Fatal system signal has occurred during exit #32045

Closed

smuzaffar changed the base branch from master to CMSSW_11_3_DEVEL_X March 17, 2021 10:38

cmsbuild removed the code-checks-approved label Mar 17, 2021

cmsbuild added orp-approved and removed orp-pending labels Mar 17, 2021

cmsbuild merged commit cb6fcb1 into cms-sw:CMSSW_11_3_DEVEL_X Mar 17, 2021

dan131riley deleted the fasttimer-tbb-task-group branch April 7, 2021 12:08

cmsbuild mentioned this pull request Oct 18, 2023

[DEVEL] drop -Wno-deprecated and -Wno-deprecated-copy flags cms-sw/cmsdist#8766

Merged

cmsbuild mentioned this pull request Nov 17, 2023

[GCC] Build latest GCC 12 with stdcxx backtrace enabled cms-sw/cmsdist#8821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restrict FastTimerService to default arena #33126

restrict FastTimerService to default arena #33126

dan131riley commented Mar 9, 2021

cmsbuild commented Mar 9, 2021

cmsbuild commented Mar 9, 2021

makortel commented Mar 9, 2021

Dr15Jones commented Mar 9, 2021

fwyzard commented Mar 10, 2021 via email

cmsbuild commented Mar 10, 2021

cmsbuild commented Mar 10, 2021

fwyzard commented Mar 10, 2021 via email

makortel commented Mar 11, 2021

fwyzard commented Mar 11, 2021 via email

makortel commented Mar 12, 2021

makortel commented Mar 12, 2021

fwyzard commented Mar 12, 2021

makortel commented Mar 17, 2021

fwyzard commented Mar 17, 2021 •

edited

Loading

smuzaffar commented Mar 17, 2021

smuzaffar commented Mar 17, 2021

silviodonato commented Mar 17, 2021

smuzaffar commented Mar 17, 2021

dan131riley commented Mar 17, 2021

Dr15Jones commented Mar 17, 2021

fwyzard commented Mar 17, 2021

Dr15Jones commented Mar 17, 2021

makortel commented Mar 17, 2021 •

edited

Loading

fwyzard commented Mar 18, 2021

dan131riley commented Mar 18, 2021

Dr15Jones commented Mar 18, 2021

restrict FastTimerService to default arena #33126

restrict FastTimerService to default arena #33126

Conversation

dan131riley commented Mar 9, 2021

PR description:

PR validation:

cmsbuild commented Mar 9, 2021

cmsbuild commented Mar 9, 2021

makortel commented Mar 9, 2021

Dr15Jones commented Mar 9, 2021

fwyzard commented Mar 10, 2021 via email

cmsbuild commented Mar 10, 2021

cmsbuild commented Mar 10, 2021

Comparison Summary

fwyzard commented Mar 10, 2021 via email

makortel commented Mar 11, 2021

fwyzard commented Mar 11, 2021 via email

makortel commented Mar 12, 2021

makortel commented Mar 12, 2021

fwyzard commented Mar 12, 2021

makortel commented Mar 17, 2021

fwyzard commented Mar 17, 2021 • edited Loading

smuzaffar commented Mar 17, 2021

smuzaffar commented Mar 17, 2021

silviodonato commented Mar 17, 2021

smuzaffar commented Mar 17, 2021

dan131riley commented Mar 17, 2021

Dr15Jones commented Mar 17, 2021

fwyzard commented Mar 17, 2021

Dr15Jones commented Mar 17, 2021

makortel commented Mar 17, 2021 • edited Loading

fwyzard commented Mar 18, 2021

dan131riley commented Mar 18, 2021

Dr15Jones commented Mar 18, 2021

fwyzard commented Mar 17, 2021 •

edited

Loading

makortel commented Mar 17, 2021 •

edited

Loading