Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM pages for bin-by-bin comparisons in PR tests come up empty #38980

Closed
missirol opened this issue Aug 5, 2022 · 29 comments
Closed

DQM pages for bin-by-bin comparisons in PR tests come up empty #38980

missirol opened this issue Aug 5, 2022 · 29 comments

Comments

@missirol
Copy link
Contributor

missirol commented Aug 5, 2022

Since earlier today, I cannot see plots in the DQM webpage (https://cmsweb.cern.ch/dqm/dev) used to display bin-by-bin comparisons for CMSSW PR tests, at least for PRs tested recently.

Example:

I see the same for several recent PRs (I checked only a few). I did find one where the plots show up
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-49175f/26445/summary.html
https://tinyurl.com/24nx32n7

Initially discussed in #38971 (comment).

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2022

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Aug 5, 2022

assign core,dqm

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2022

New categories assigned: core,dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@Dr15Jones,@smuzaffar,@emanueleusai,@makortel,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Aug 5, 2022

@cms-sw/externals-l2 @cms-sw/dqm-l2 Would anyone of you have an idea what could be going wrong?

@emanueleusai
Copy link
Member

yes I saw this too. I'm going to restart the GUI and see if it fixes it

@missirol
Copy link
Contributor Author

missirol commented Aug 6, 2022

@emanueleusai , any news on this? Looks like the issue is still there.

@tvami
Copy link
Contributor

tvami commented Aug 6, 2022

indeed a recently came back PR test
#38916
have this stil missing too

@missirol
Copy link
Contributor Author

missirol commented Aug 8, 2022

@cms-sw/dqm-l2

This issue persists, and it's delaying the review of a few PRs.

Should we expect it to be fixed anytime soon?

@emanueleusai
Copy link
Member

I couldn't look into it any further over the weekend, but we are working on it.

@emanueleusai
Copy link
Member

So @ahmad3213 found this error message in the logs of the bin by bin GUI:

visDQMIndex: ./DQM/StringAtom.h:124: size_t StringAtomTree::insert(const string&): Assertion `tree_.size() < tree_.capacity()' failed.
*** DQMStore: WARNING: no monitor element 'total_errors' in directory 'Run 1/MessageLogger/Run summary/Errors' for quality test 'total_errors.XrangeWithin:Error'

visDQMIndex (pid=28588 ppid=5099) received fatal signal 6 (Aborted)

First appearance around July 27th. We keep investigating.

@emanueleusai
Copy link
Member

It looks like we might be hitting the size limit of the index trees.
https://github.com/cms-DQM/dqmgui_prod/blob/index128/src/cpp/DQM/StringAtom.h#L64

StringAtomTree(size_t capacity = 1024*1024)

emanueleusai added a commit to emanueleusai/cmsdist that referenced this issue Aug 11, 2022
muhammadimranfarooqi pushed a commit to cms-sw/cmsdist that referenced this issue Aug 11, 2022
@qliphy
Copy link
Contributor

qliphy commented Aug 15, 2022

@emanueleusai This issue is fixed and can be closed, right?

@missirol
Copy link
Contributor Author

Looking at recent PRs, it looks like the issue is still present (I don't see plots).

@cms-sw/dqm-l2 , what is the ETA to deploy a fix? (if not done already)

@emanueleusai
Copy link
Member

Hi it's not fixed yet, it looks like the fix we prepared increasing the tree size is not working. One of our SW engineers is now investigating it.

@missirol
Copy link
Contributor Author

missirol commented Sep 5, 2022

1 month without proper bin-by-bin comparisons feels like a long time.

I don't know if this should be marked as urgent, but it sure needs to be fixed.

@cms-sw/dqm-l2 , is there an ETA?

@emanueleusai
Copy link
Member

So our SW engineer was able to reproduce the issue on a local installation of the GUI using a recent file (where it doesn't work) and an older file (where it works). Inspecting the logs it looks like there might be a problem with the indexer when loading a new file.

We asked the cmsweb admins to do a backup and reset of the index on the DQM GUI dev flavour on server vocms0731 and seemingly it didn't fix the issue.
We are also trying to get in touch with one of the original authors of the old GUI code.

We don't know what changed exactly between the root files before ~July and after. But it is unlikely that all of a sudden the size of the index trees has quadrupled (the original attempt at fixing the issue increased the limit four-fold). So it is not a trivial fix as we originally thought.

The SW engineer working on it is now on holiday for the next few weeks so we are moving the task to our other SW engineer.
We are working on it with priority and we hope to have it solved within the next 1 or 2 weeks. We understand the impact is has on PR reviews (it does impact my work too). Thank you for your patience.

@pmandrik
Copy link
Contributor

pmandrik commented Sep 6, 2022

At local setup of DQM GUI version 9.7.7 when the DB is empty / new no problems in indexer:

exec visDQMIndex add /afs/cern.ch/work/p/pmandrik/swork/4_dqm/idir /afs/cern.ch/work/p/pmandrik/swork/4_dqm/03_binbybin/RelVal/CMSSW_12_6_x/DQM_V0001_R000000001__RelVal_wf11834_0_base__CMSSW_12_6_X-PRcmssw_39302-52583__DQMIO.root
2022-09-06 14:53:54.086111 [visDQMImportDaemon/3674] imported /afs/cern.ch/work/p/pmandrik/swork/4_dqm/03_binbybin/RelVal/CMSSW_12_6_x/DQM_V0001_R000000001__RelVal_wf11834_0_base__CMSSW_12_6_X-PRcmssw_39302-52583__DQMIO.root with status 0 in 0.844s

and when DB is old (copied from vocms0731) the same crash:

exec visDQMIndex add /afs/cern.ch/work/p/pmandrik/swork/4_dqm/idir /afs/cern.ch/work/p/pmandrik/swork/4_dqm/03_binbybin/RelVal/CMSSW_12_6_x/DQM_V0001_R000000001__RelVal_wf11834_0_base__CMSSW_12_6_X-PRcmssw_39302-52583__DQMIO.root
visDQMIndex: ./DQM/StringAtom.h:124: size_t StringAtomTree::insert(const string&): Assertion `tree_.size() < tree_.capacity()' failed.

looks like cmsweb admins had not reset the index DB yet or had not done it properly, maybe we can rather delete the whole DB and start a new one.

@pmandrik
Copy link
Contributor

pmandrik commented Sep 8, 2022

The Index DB was recreated. Bin-by-bin comparison in GUI is started to be available for some of the latest PRs, e.g.:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_6_X_2022-09-07-2300+8aa7bf/52663/dqm-histo-comparison-summary.html
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_6_X_2022-09-07-2300+66b1df/52662/dqm-histo-comparison-summary.html

@emanueleusai
Copy link
Member

Confirmed working on recent PRs.

NB: For PRs where the tests were triggered before the index was recreated the GUI will still appear empty, so I recommend re-triggering the tests if you intend to use the DQM bin-by-bin tool.

@emanueleusai
Copy link
Member

+dqm

@makortel
Copy link
Contributor

makortel commented Sep 8, 2022

unassign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 8, 2022

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

makortel commented Sep 8, 2022

@cmsbuild, please close

@cmsbuild cmsbuild closed this as completed Sep 8, 2022
@rovere
Copy link
Contributor

rovere commented Sep 9, 2022

The Index DB was recreated. Bin-by-bin comparison in GUI is started to be available for some of the latest PRs, e.g.: https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_6_X_2022-09-07-2300+8aa7bf/52663/dqm-histo-comparison-summary.html https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_6_X_2022-09-07-2300+66b1df/52662/dqm-histo-comparison-summary.html

You pushed into production the wrong fix that, in fact, is not a fix at all, as I commented directly on the PR.
Regenerating the index is fine for the dev instance. It would be bad for Offline and RelVal servers.
Please check on those servers what the status is of the occupancy of all the Patricia trees used internally and, in case, increase their memory allocation accordingly.

@pmandrik
Copy link
Contributor

pmandrik commented Sep 9, 2022

I commented directly on the PR

Hi, could you point me to this discussion? thanks!

@rovere
Copy link
Contributor

rovere commented Sep 9, 2022

cms-DQM/dqmgui_prod#11 (review)

@emanueleusai
Copy link
Member

Ciao Marco @rovere, thank you for your feedback, it is much appreciated.

So concerning the Patricia trees I found only one more instance where the value is set manually and indeed that wasn't changed: https://github.com/cms-DQM/dqmgui_prod/search?q=StringAtomTree%28
https://github.com/cms-DQM/dqmgui_prod/blob/b282784006ee15b68e841ef66b2875a8fccd4536/src/cpp/DQM/serverext.cc
In any case given we found a solution by other means, we can certainly revert the PR and set the default value to what it was before if you are concerned about it.

Concerning regenerating the index: we observed the issue only in dev where the history of the index is not needed. Since we never saw the issue on offline and relval we did not check or investigate what is going on there. Do you have any particular suspicion that we might be about to hit the limit in those instances?

@rovere
Copy link
Contributor

rovere commented Sep 12, 2022

Ciao @emanueleusai
this code region could be helpful to look at link

I'm not particularly concerned about the default value since it is practically never used.

I haven't been playing around with DQMGUI for ages and I have no idea what the current occupancy of the many patricia trees is for Offline and RelVal servers. Maybe having a look at that would be something useful, at least to understand where you are in the process of filling all the allocated slots.

@emanueleusai
Copy link
Member

Thank you, Marco. We will follow up with your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants