New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement skipBadFiles in RootEmbeddedFileSequence::readOneRandom #32821
Implement skipBadFiles in RootEmbeddedFileSequence::readOneRandom #32821
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32821/21012
|
A new Pull Request was created by @dan131riley (Dan Riley) for master. It involves the following packages: IOPool/Input @makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@cmsbuild, please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-98d480/12724/summary.html Comparison SummarySummary:
|
@@ -43,7 +44,8 @@ namespace edm { | |||
initialNumberOfEventsToSkip_(pset.getUntrackedParameter<unsigned int>("skipEvents", 0U)), | |||
treeCacheSize_(pset.getUntrackedParameter<unsigned int>("cacheSize", roottree::defaultCacheSize)), | |||
enablePrefetching_(false), | |||
enforceGUIDInFileName_(pset.getUntrackedParameter<bool>("enforceGUIDInFileName", false)) { | |||
enforceGUIDInFileName_(pset.getUntrackedParameter<bool>("enforceGUIDInFileName", false)), | |||
fileOpenAttempts_(pset.getUntrackedParameter<unsigned int>("fileOpenAttempts", numberOfFiles())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #15653 you wrote
In addition, using the number of files in the list as the retry count doesn't make much sense for a random selection, and also doesn't make sense if retries are being cause by a systemic problem like a site issue.
Should the default be something else then? The fillDescriptions()
(which, IIUC, is not used by any of the mixing modules) uses 1
. Should we use the same here (also to not change the default behavior)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@makortel fileOpenAttempts
is new in this PR. The old behavior was to try up to the number of files on the initial open, and then not retry at all for subsequent files. Rereading the HN thread, numberOfFiles()
is probably too large, but 1 is too small--1 would be equivalent to turning off skipBadFiles
. I also realized rereading the HN thread that what @bbockelm was suggesting was a per-job retry limit, so that systemic problems don't result in the servers getting hammered by retries from every thread--so I've revised the PR to set the retry count to 3 and make it a static atomic.
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32821/21048
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-32821/21049
|
Pull request #32821 was updated. @makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please check and sign again. |
@cmsbuild, please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-98d480/12765/summary.html Comparison SummarySummary:
|
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
We should probably inform computing that there will be a new knob, right? |
+1 |
PR description:
Under heavy loads, we can see high failure rates in premixing jobs due to file open failures for the secondary input files. The
skipBadFiles
parameter is supposed to allow skipping over file open failures; however, it is only partially implemented in theEmbeddedRootSource
used for the secondary input files. This PR implementsskipBadFiles
for theRootEmbeddedFileSequence::readOneRandom()
routine and adds afileOpenAttempts
parameter to control how many file open attempts are made in case of failures.skipBadFiles
is still ignored for the (less often used) sequential read cases.This PR resolves issue #15653 (from 4.5 years ago), which has links to the HN thread that motivated it.
PR validation:
A test config was created in
SecondaryInput/test/SecondaryInputTestSkip_cfg.py
(and it found some bugs). However, it has a small probability of causing false positives so it has not been added to the standard unit test run by 'runtests'. Standard tests were also run.