Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement skipBadFiles in RootEmbeddedFileSequence::readOneRandom #32821

Merged
merged 4 commits into from Feb 11, 2021

Conversation

dan131riley
Copy link

PR description:

Under heavy loads, we can see high failure rates in premixing jobs due to file open failures for the secondary input files. The skipBadFiles parameter is supposed to allow skipping over file open failures; however, it is only partially implemented in the EmbeddedRootSource used for the secondary input files. This PR implements skipBadFiles for the RootEmbeddedFileSequence::readOneRandom() routine and adds a fileOpenAttempts parameter to control how many file open attempts are made in case of failures.

skipBadFiles is still ignored for the (less often used) sequential read cases.

This PR resolves issue #15653 (from 4.5 years ago), which has links to the HN thread that motivated it.

PR validation:

A test config was created in SecondaryInput/test/SecondaryInputTestSkip_cfg.py (and it found some bugs). However, it has a small probability of causing false positives so it has not been added to the standard unit test run by 'runtests'. Standard tests were also run.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 4, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32821/21012

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 4, 2021

A new Pull Request was created by @dan131riley (Dan Riley) for master.

It involves the following packages:

IOPool/Input
IOPool/SecondaryInput

@makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks.
@makortel, @wddgit this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Feb 4, 2021

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 5, 2021

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-98d480/12724/summary.html
COMMIT: dee8c9d
CMSSW: CMSSW_11_3_X_2021-02-04-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/32821/12724/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2751765
  • DQMHistoTests: Total failures: 13
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2751730
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 36 files compared)
  • Checked 156 log files, 37 edm output root files, 37 DQM output files

@@ -43,7 +44,8 @@ namespace edm {
initialNumberOfEventsToSkip_(pset.getUntrackedParameter<unsigned int>("skipEvents", 0U)),
treeCacheSize_(pset.getUntrackedParameter<unsigned int>("cacheSize", roottree::defaultCacheSize)),
enablePrefetching_(false),
enforceGUIDInFileName_(pset.getUntrackedParameter<bool>("enforceGUIDInFileName", false)) {
enforceGUIDInFileName_(pset.getUntrackedParameter<bool>("enforceGUIDInFileName", false)),
fileOpenAttempts_(pset.getUntrackedParameter<unsigned int>("fileOpenAttempts", numberOfFiles())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #15653 you wrote

In addition, using the number of files in the list as the retry count doesn't make much sense for a random selection, and also doesn't make sense if retries are being cause by a systemic problem like a site issue.

Should the default be something else then? The fillDescriptions() (which, IIUC, is not used by any of the mixing modules) uses 1. Should we use the same here (also to not change the default behavior)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makortel fileOpenAttempts is new in this PR. The old behavior was to try up to the number of files on the initial open, and then not retry at all for subsequent files. Rereading the HN thread, numberOfFiles() is probably too large, but 1 is too small--1 would be equivalent to turning off skipBadFiles. I also realized rereading the HN thread that what @bbockelm was suggesting was a per-job retry limit, so that systemic problems don't result in the servers getting hammered by retries from every thread--so I've revised the PR to set the retry count to 3 and make it a static atomic.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2021

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32821/21048

  • This PR adds an extra 24KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32821/21049

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2021

Pull request #32821 was updated. @makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please check and sign again.

@makortel
Copy link
Contributor

makortel commented Feb 8, 2021

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 8, 2021

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-98d480/12765/summary.html
COMMIT: 36d83a0
CMSSW: CMSSW_11_3_X_2021-02-08-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/32821/12765/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2751765
  • DQMHistoTests: Total failures: 12
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2751730
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 36 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 156 log files, 37 edm output root files, 37 DQM output files

@makortel
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@makortel
Copy link
Contributor

We should probably inform computing that there will be a new knob, right?

@silviodonato
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 597ebba into cms-sw:master Feb 11, 2021
@dan131riley dan131riley deleted the RootEmbeddedFileSequence-filehash branch February 11, 2021 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants