Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[11_0_X] Protect storage accounting UDP messages from NaN, and Use StatisticsSenderService for all framework files #36358

Merged

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Dec 3, 2021

PR description:

This PR is a combined backport of #35362 and #35505, following requests in #29412 and #36349. Includes also #36403 as further cleanup.

PR validation:

Unit tests pass.

NaN's were being reported from the values computed using sqrt. This most likely was from the different variables not being updated atomically together.
Previously, each try to open the file using a different PFN would report an open attempt for the same LFN. This meant we could have multiple opens but only one close for a given LFN.
When sending information to the StatisticsSenderService, the file LFN or URL must be supplied.
Send statistics for primary, secondary, and embedded files.
The aggregate file statistics are only reset on primary file close boundaries to keep the behavior the same as previous.
Changed all calls to closeFile_() to be the new closeFile()
Now broadcasts how the file is used.
@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2021

A new Pull Request was created by @makortel (Matti Kortelainen) for CMSSW_11_0_X.

It involves the following packages:

  • IOPool/Input (core)
  • IOPool/SecondaryInput (core)
  • IOPool/TFileAdaptor (core)
  • Utilities/StorageFactory (core)
  • Utilities/XrdAdaptor (core)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please review it and eventually sign? Thanks.
@wddgit this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Dec 3, 2021

@cmsbuild, please test

@makortel
Copy link
Contributor Author

makortel commented Dec 3, 2021

backport

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2021

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0c72ac/20977/summary.html
COMMIT: e96e622
CMSSW: CMSSW_11_0_X_2021-11-28-0000/slc7_amd64_gcc820
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/36358/20977/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2793840
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2793498
  • DQMHistoTests: Total skipped: 341
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 30 edm output root files, 34 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2021

This pull request is fully signed and it will be integrated in one of the next CMSSW_11_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_12_2_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

perrotta commented Dec 4, 2021

IOPool/Input/src/RootInputFileSequence.cc includes also the updates originally in #28911 (“New developments using multiple data catalogs provided in site-local-config.xml“) : they were merged in CMSSW_11_1_0_patch1

@makortel please confirm that importing the updates only from one file in #28911 (instead of the whole PR) doesn't cause possible issues somewhere

@makortel
Copy link
Contributor Author

makortel commented Dec 6, 2021

@perrotta Thanks for detailed check, but could you clarify? My intention was to not include any changes from #28911 (and if I wasn't careful enough and something sneaked in, I want to understand it well). The changes in IOPool/Input/src/RootInputFileSequence.cc should be

  • Create one std::unique_ptr<InputSource::FileOpenSentry> sentry instead of separate objects for primary and fallback open attempts.
  • Call edm::storage::StatisticsSenderService::openingFile()
  • Call edm::storage::StatisticsSenderService::closedFile() via new member function closeFile().

All these are from #35505. Note that the first of these changes caused indentation changes, and the diff becomes easier to read with "Hide whitespace".

@perrotta
Copy link
Contributor

perrotta commented Dec 6, 2021

@perrotta Thanks for detailed check, but could you clarify? My intention was to not include any changes from #28911 (and if I wasn't careful enough and something sneaked in, I want to understand it well). The changes in IOPool/Input/src/RootInputFileSequence.cc should be

  • Create one std::unique_ptr<InputSource::FileOpenSentry> sentry instead of separate objects for primary and fallback open attempts.
  • Call edm::storage::StatisticsSenderService::openingFile()
  • Call edm::storage::StatisticsSenderService::closedFile() via new member function closeFile().

All these are from #35505. Note that the first of these changes caused indentation changes, and the diff becomes easier to read with "Hide whitespace".

Hi Matti.
For example, the following line is from #28911:
https://github.com/cms-sw/cmssw/pull/36358/files?w=1;#diff-4f6e0887868f2e273e21bf5c6449848b28466b34e0a88961d3db5a3801d27c21R231
and in general, if I am not wrong, all the usages of the bool usedFallback_ were removed with #28911 and the same happens in this PR (were by the way that usedFallback_ is only defined and updated, but never used in the code)

@makortel
Copy link
Contributor Author

makortel commented Dec 6, 2021

Thanks @perrotta, I see your point now. Indeed the equivalent of usedFallback_=false was done in #28911. Via FileOpenSentry that boolean gets used only for preOpenFile, postOpenFile, preCloseFile, and postCloseFile signals. I went through all users of those signals (in 11_0_0)

  • FWCore/MessageService/src/MessageLogger.cc (clean)
  • FWCore/Services/plugins/CheckTransitions.cc (clean)
  • FWCore/Services/plugins/CondorStatusUpdater.cc (clean)
  • FWCore/Services/plugins/Timing.cc (clean)
  • FWCore/Services/plugins/Tracer.cc (all print)
  • HeterogeneousCore/CUDAServices/plugins/NVProfilerService.cc (clean)
  • IgTools/IgProf/plugins/IgProfService.cc (clean)
  • Utilities/StorageFactory/src/StatisticsSenderService.cc

Of those, Tracer makes use of the boolean by showing it in its printouts

void Tracer::preOpenFile(std::string const& lfn, bool b) {
LogAbsolute out("Tracer");
out << TimeStamper(printTimestamps_);
out << indention_ << indention_ << " starting: open input file: lfn = " << lfn;
if (dumpNonModuleContext_)
out << " usedFallBack = " << b;
}

and StatisticsSenderService propagates it in the UDP packet
if (usedFallback) {
os << "\"fallback\": true, ";
}

(although this PR changes the information delivery mechanism from the ActivityRegistry callbacks to a direct call of StatisticsSenderService::closedFile() from RootInputFileSequence::closeFile(), but the source of the information is the same).

I think this points to a bug in #28911 that the information if a fallback file is not used does not propagate anymore to the UDP packets.

@makortel
Copy link
Contributor Author

makortel commented Dec 6, 2021

I think this points to a bug in #28911 that the information if a fallback file is not used does not propagate anymore to the UDP packets.

I opened an issue #36375 and am thinking of fixing it, and including the fix as part of these backports. Thanks @perrotta for pointing it out!

}
if (!filePtr && (hasFallbackUrl)) {
try {
usedFallback_ = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since the usedFallback_ = true is set here, 11_0_X (and earlier) do not need the backport of #36379. The FileOpenSentry will still always signal that none of the files are fallbacks, but that information is not being used anywhere (except in Tracer Service, but those being "wrong" is not a big deal). The StatisticsSenderService anyway gets the value of this boolean via direct call (instead of the ActivityRegistry callbacks).

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 9, 2021

Pull request #36358 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please check and sign again.

@makortel
Copy link
Contributor Author

makortel commented Dec 9, 2021

Now including #36403 too.

@makortel
Copy link
Contributor Author

makortel commented Dec 9, 2021

unhold

@cmsbuild cmsbuild removed the hold label Dec 9, 2021
@makortel
Copy link
Contributor Author

makortel commented Dec 9, 2021

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0c72ac/21145/summary.html
COMMIT: ed0609f
CMSSW: CMSSW_11_0_X_2021-12-05-0000/slc7_amd64_gcc820
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36358/21145/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2793840
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2793497
  • DQMHistoTests: Total skipped: 341
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 30 edm output root files, 34 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_11_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_12_3_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Dec 10, 2021

+1

@cmsbuild cmsbuild merged commit 085def3 into cms-sw:CMSSW_11_0_X Dec 10, 2021
@makortel makortel deleted the backportStatisticsSenderService_110x branch December 10, 2021 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants