Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch FastTimerService to using a local thread observer #33261

Merged
merged 4 commits into from Mar 26, 2021

Conversation

dan131riley
Copy link

PR description:

With the new task_group interfaces, global thread observers cause crashes at the end of the job, and the interface for global thread observers is being removed in the latest versions of TBB. This PR modifies the FastTimerService to use a local thread observer on the primary framework arena to register threads, and then tracks individual threads using lower level pthreads interfaces instead of the TBB enumerable_thread_specific (which, under the hood, uses the same pthread mechanisms). In the normal case, this should reproduce the results of the FastTimerService with a global observer and account for any threads that are added to the arena.

PR validation:

Tested on one workflow (136.885501) that it produces results consistent with the results with the global thread observer, and doesn't crash on exit.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33261/21747

  • This PR adds an extra 28KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @dan131riley (Dan Riley) for master.

It involves the following packages:

HLTrigger/Timer

@Martin-Grunewald, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks.
@Martin-Grunewald this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7d1c5b/13747/summary.html
COMMIT: 574677e
CMSSW: CMSSW_11_3_X_2021-03-23-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/33261/13747/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 3 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2639935
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2639906
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 36 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 155 log files, 37 edm output root files, 37 DQM output files
  • TriggerResults: no differences found

// initialise the measurement point for a thread that has newly joining the TBB pool
thread().measure();
FastTimerService::ThreadGuard::ThreadGuard() {
auto err = ::pthread_key_create(&key_, retire_thread);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, retire_thread is called when the worker thread exits.
Why not call it from on_scheduler_exit instead ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Monitoring the primary arena, on_scheduler_entry and on_scheduler_exit get called many times as threads get moved between different arenas, and there's no way to tell when the call to on_scheduler_exit is the final exit. This is the only reliable way I could think of to catch if a thread gets deleted before the end job sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so with this approach, the time (and resources) spent by a thread outside the main arena would be accounted as "overhead", right ?
Which is not a bad thing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that with either the global observer or this PR, what gets measured as overhead is the time outside of any defined CMS module, irrespective of the arena. With the global observer you were getting one call to on_scheduler_enter and on_scheduler_exit per thread, and I'm trying to replicate that with the observer on the primary arena.

Measurement& thread();
void finalize();

tbb::concurrent_vector<std::unique_ptr<specific_t>> thread_resources_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a tbb::concurrent_vector<std::unique_ptr<specific_t>> ?
wouldn't tbb::concurrent_vector<specific_t> also work, and avoid one indirection ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it needs to be std::shared_ptr. Otherwise there's a potential use-after-free if the threads exit after the FastTimerService gets destructed. We've observed that TBB often doesn't destroy threads until global destructors are called (I believe using the global observer you may actually have been missing some overhead from on_scheduler_exit getting called after postEndJob).

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33261/21760

  • This PR adds an extra 28KB to repository

@cmsbuild
Copy link
Contributor

Pull request #33261 was updated. @Martin-Grunewald, @cmsbuild, @fwyzard can you please check and sign again.

@Martin-Grunewald
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7d1c5b/13769/summary.html
COMMIT: f646cb6
CMSSW: CMSSW_11_3_X_2021-03-25-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/33261/13769/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 3 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2639935
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2639906
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 36 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 155 log files, 37 edm output root files, 37 DQM output files
  • TriggerResults: no differences found

@Martin-Grunewald
Copy link
Contributor

With this PR I no longer see those crashes, thanks!

@Martin-Grunewald
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Mar 26, 2021

+1

@cmsbuild cmsbuild merged commit 64658ee into cms-sw:master Mar 26, 2021
@fwyzard
Copy link
Contributor

fwyzard commented Mar 26, 2021

what the ...

@fwyzard
Copy link
Contributor

fwyzard commented Mar 26, 2021

did anyone actually check the performance of these changes, before merging ?

@qliphy
Copy link
Contributor

qliphy commented Mar 26, 2021

@fwyzard Sorry if there is any misunderstanding. It was based on above PR validation and test results, also on comments/signature from hlt ( @Martin-Grunewald ).

With this PR I no longer see those crashes, thanks!

Moreover, I don't see you have further comment...
We can see what IB test will show.

@dan131riley
Copy link
Author

@fwyzard I did run some timing tests, and saw no significant change in the overall timing. There was a small increase in the "other" overhead category, but prior to this PR I believe "other" may have been undercounted. There's more investigation to do, particularly to see if a less defensive approach would be feasible, so I'd consider this an interim step, not necessarily final, but it's important for development to see that it fixes the crashes.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 26, 2021

@qliphy well, sorry for not being able to reply to every PR all the time.

@dan131riley thanks, if you tell me that this will not necessarily be the final version, I'm OK with that.

@qliphy
Copy link
Contributor

qliphy commented Mar 27, 2021

IB test CMSSW_11_3_X_2021-03-26-2300 (slc7_amd64_gcc900) shows no crash anymore.
Keep watching for next IB.

@dan131riley dan131riley deleted the fasttimer-tbb-task-group-2 branch April 7, 2021 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants