Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restrict FastTimerService to default arena #33126

Merged

Conversation

dan131riley
Copy link

PR description:

This makes the FastTimerService tbb::task_scheduler_observer local to the primary arena. This fixes the problem with crashes at the end of the job following #32804 and covered in issue #33107.

PR validation:

It compiles, and running step 3 of 136.885501, which was failing 20-50% of the time depending on platform, has run without crashes in dozens of test runs.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 9, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33126/21479

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 9, 2021

A new Pull Request was created by @dan131riley (Dan Riley) for master.

It involves the following packages:

HLTrigger/Timer

@Martin-Grunewald, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks.
@Martin-Grunewald this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Mar 9, 2021

@cmsbuild, please test

@Dr15Jones
Copy link
Contributor

Please test

@fwyzard
Copy link
Contributor

fwyzard commented Mar 10, 2021 via email

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @fwyzard
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@cmsbuild cmsbuild added the hold label Mar 10, 2021
@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e703e6/13390/summary.html
COMMIT: c1886c2
CMSSW: CMSSW_11_3_X_2021-03-09-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/33126/13390/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2950 differences found in the comparisons
  • DQMHistoTests: Total files compared: 38
  • DQMHistoTests: Total histograms compared: 2849195
  • DQMHistoTests: Total failures: 9571
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2839602
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 37 files compared)
  • Checked 160 log files, 37 edm output root files, 38 DQM output files

@fwyzard
Copy link
Contributor

fwyzard commented Mar 10, 2021 via email

@makortel
Copy link
Contributor

In addition to the crashes of few workflows in IBs, the global observer is denoted as obsolete in TBB 2020.3

    /** TODO: Obsolete.
        Global observer semantics is obsolete as it violates master thread isolation
        guarantees and is not composable. Thus the current default behavior of the
        constructor is obsolete too and will be changed in one of the future versions
        of the library. **/
    explicit task_scheduler_observer( bool local = false ) {

https://github.com/oneapi-src/oneTBB/blob/v2020.3/include/tbb/task_scheduler_observer.h#L112-L117
and in fact that mode appears to be removed in 2021.1.1
https://github.com/oneapi-src/oneTBB/blob/v2021.1.1/include/oneapi/tbb/task_scheduler_observer.h

This PR seems the best we can do for now.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 11, 2021 via email

@makortel
Copy link
Contributor

I don't know how we could maintain functionality of TBB that gets removed from TBB (without forking, which would lead to other issues).

Maybe ask TBB developers for such hooks? Maybe the desired functionality could be achieved with some use of thread locals (that admittedly would impact all threads and not just TBB threads)? Maybe there is some other solution?

In the meantime it would be important to get this PR in so that we can see if there are any remaining issues in the IBs.

@makortel
Copy link
Contributor

@fwyzard Could you actually describe what is the exact behavior FastTimerService is after for with the current use of tbb::task_scheduler_observer?

@fwyzard
Copy link
Contributor

fwyzard commented Mar 12, 2021

The issue that is solved by inheriting from tbb::task_scheduler_observer and implementing on_scheduler_entry()/on_scheduler_exit() is that during a job, a TBB pool may relinquish a worker thread and/or acquire a new one; when this happens, the per-thread information needs to be properly initialised (for a new thread) or accounted (for a thread leaving the pool).

@makortel
Copy link
Contributor

Summary of what was discussed in ORP+core software meeting: this PR would likely impact the correctness of the accounting of FastTimerService, so a different solution/workaround is needed (Dan is looking into it). In the meantime, we would merge this PR (or a copy of it) into DEVEL IB to check if there are any further, more rare issues left from the migration to tbb::task_group.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 17, 2021

Merging in DEVEL makes sense, of course, and thanks for investigating his further.

Note that the tbb::enumerable_thread_specific<Measurement, tbb::cache_aligned_allocator<Measurement>, tbb::ets_key_per_instance> threads_ member automatically adds new elements as needed; however according to TBB it does not guarantee that an element is not reused for a new thread (see the second NOTE at https://software.intel.com/content/www/us/en/develop/documentation/tbb-documentation/top/intel-threading-building-blocks-developer-reference/thread-local-storage/enumerablethreadspecific-template-class.html ); it also does not remove/destroy the elements for threads that leave the pool.

If there is an alternative to tbb::enumerable_thread_specific that takes care of the proper initialisation and "finalisation" of TLS objects for TBB worker threads, we could avoid the use of the tbb::task_scheduler_observer.

@smuzaffar
Copy link
Contributor

OK, I am changing the base branch for this PR to be CMSSW_11_3_DEVEL_X and then we can merge it.

@smuzaffar smuzaffar changed the base branch from master to CMSSW_11_3_DEVEL_X March 17, 2021 10:38
@smuzaffar
Copy link
Contributor

@silviodonato , @qliphy can you please merge it (note that it will go in DEVEL IBs only) ?

@silviodonato
Copy link
Contributor

merge
(this will go only in CMSSW_11_3_DEVEL_X for test/debugging)

@cmsbuild cmsbuild merged commit cb6fcb1 into cms-sw:CMSSW_11_3_DEVEL_X Mar 17, 2021
@smuzaffar
Copy link
Contributor

thanks, I am starting a 12h DEVEL IB now

@dan131riley
Copy link
Author

This PR is definitely wrong. With the new task_group structure, there are a lot of entries and exits of the primary arena in between measurement points, which will incorrectly reset the thread counters. The behavior of the global observer is an enter when a thread becomes available to tbb and an exit when it leaves, which in the normal case means enter & exit only get called once per thread. This could be closely approximated by initializing only on the first enter for a thread and then iterating over the enumerable_thread_specific at post-endjob, except for the reuse issue that @fwyzard points out, and finalization for threads that leave mid-job. I believe this is not a practical problem on Linux, but it would be better if we could detect that a thread ID has been reused. There's probably a low-level way to do this with pthread_key_create().

@Dr15Jones
Copy link
Contributor

It seems unlikely that TBB will be adding/removing threads while we are running with our recent changes. We call tbb::global_control at beginning of job and tell TBB that it can only use a set number of threads (decided from the configuration option process.options.numberOfThreads). We then wrap all our processing in a tbb::task_arena where we told it to use that same number of threads. So at that point, TBB is constrained to not use more than N threads AND we've explicitly told TBB to use N threads in the 'main' arena. Implementation wise, TBB could still delete a thread and bring up another one to replace it and it would still hold to the constraints but that seems pretty unlikely since it would be unnecessary inefficient.

In the old implementation, if we told TBB to use 1 thread but then used the 'external work' of the framework, TBB would still create a new thread. I don't think it does that anymore but we could test it again.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 17, 2021

Implementation wise, TBB could still delete a thread and bring up another one to replace it and it would still hold to the constraints but that seems pretty unlikely since it would be unnecessary inefficient.

How is this different from the old implementation, when we configured a job to run with 4 or 8 threads ?
In that scenario, TBB rarely did delete a thread and bring in a new one.

@Dr15Jones
Copy link
Contributor

How is this different from the old implementation, when we configured a job to run with 4 or 8 threads ?
In that scenario, TBB rarely did delete a thread and bring in a new one.

The addition of tbb::global_control seems to be much more constraining than the old tbb::scheduler_init mechanism. The latter seemed to be more a suggested number of threads while global_control sets a hard limit. The explicit use of task_arena also seems to be more constraining then the old implicit usage.

I would still do more testing to be sure TBB is actually behaving in such a way. Such testing would be a combination of

  1. 1 thread and >1 thread
  2. with and without 'external work' (since this calls tbb::task::enqueue) using a non-TBB thread
  3. with and without ES module being run (since they have their own tbb::task_arena)
  4. with and without ROOT IMT (since it uses its own tbb::task_arena)

@makortel
Copy link
Contributor

makortel commented Mar 17, 2021

Implementation wise, TBB could still delete a thread and bring up another one to replace it and it would still hold to the constraints but that seems pretty unlikely since it would be unnecessary inefficient.

How is this different from the old implementation, when we configured a job to run with 4 or 8 threads ?
In that scenario, TBB rarely did delete a thread and bring in a new one.

(now that I finally found it) #31483 shows a case where a job configured to use 256 threads uses (at least) 323 distinct TBB worker threads (that end up running OscarMTProducer) during the job.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 18, 2021

@dan131riley could you point me to some documentation of how "moving" the threads from one arena to another works, and how we are (going to be) using different arenas in CMSSW (and ROOT) ?

Maybe we can use the different arenas as a kind of macro categories for accounting the resource usage ?

@dan131riley
Copy link
Author

@fwyzard I'm not the expert on this, but there are a few places we create transitory task arenas in order to manage waiting for asynchronous events. One example is the WaitingTaskWithArenaHolder used for external work, another more recent example is in the secondary event provider:

//we need the arena to guarantee that the syncWait will return to this thread
// and not cause this callstack to possibly be moved to a new thread
tbb::task_arena localArena{tbb::this_task_arena::max_concurrency()};
std::exception_ptr exceptPtr = localArena.execute([&]() {
return edm::syncWait([&](edm::WaitingTaskHolder&& iHolder) {
manager.processOneOccurrenceAsync<T, U>(std::move(iHolder), info, token, streamID, topContext, context);
});
});

I doubt it would be very informative to create a task_scheduler_observer for every transitory task arena.

Event setup also currently uses an arena for asynchronous operations, but that may be going away.

@Dr15Jones
Copy link
Contributor

To clarify, the framework internally uses 2 tbb::task_arenas.

  1. the main arena which is started in main() and surrounds all our data processing. It is explicitly told to use as many threads as were set in the cmsRun configuration.
  2. the EventSetup arena which is used whenever an ED module requests an EventSetup module to be run. This is done to avoid having a synchronous EventSetup request from an ED module causing another ED module to run on that same thread. This arena will go away once we have finished the esConsumes migration so we can always prefetch the EventSetup data.

WaitingTaskWithArenaHolder doesn't actually use a new tbb::task_arena instead it attaches itself to the arena which is active in the thread when it is created. Since WaitingTaskWithArenaHolder are only used the ED modules it means they only attach to the main arena.

As for ROOT, it looks like their use of a tbb::task_arena is a hidden internal detail of their implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants