New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DQM pages for bin-by-bin comparisons in PR tests come up empty #38980
Comments
A new Issue was created by @missirol Marino Missiroli. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core,dqm |
New categories assigned: core,dqm @jfernan2,@ahmad3213,@micsucmed,@rvenditti,@Dr15Jones,@smuzaffar,@emanueleusai,@makortel,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@cms-sw/externals-l2 @cms-sw/dqm-l2 Would anyone of you have an idea what could be going wrong? |
yes I saw this too. I'm going to restart the GUI and see if it fixes it |
@emanueleusai , any news on this? Looks like the issue is still there. |
indeed a recently came back PR test |
@cms-sw/dqm-l2 This issue persists, and it's delaying the review of a few PRs. Should we expect it to be fixed anytime soon? |
I couldn't look into it any further over the weekend, but we are working on it. |
So @ahmad3213 found this error message in the logs of the bin by bin GUI: visDQMIndex: ./DQM/StringAtom.h:124: size_t StringAtomTree::insert(const string&): Assertion `tree_.size() < tree_.capacity()' failed. visDQMIndex (pid=28588 ppid=5099) received fatal signal 6 (Aborted) First appearance around July 27th. We keep investigating. |
It looks like we might be hitting the size limit of the index trees. StringAtomTree(size_t capacity = 1024*1024) |
DQM GUI PR tag update fixing cms-sw/cmssw#38980
DQM GUI PR tag update fixing cms-sw/cmssw#38980
@emanueleusai This issue is fixed and can be closed, right? |
Looking at recent PRs, it looks like the issue is still present (I don't see plots). @cms-sw/dqm-l2 , what is the ETA to deploy a fix? (if not done already) |
Hi it's not fixed yet, it looks like the fix we prepared increasing the tree size is not working. One of our SW engineers is now investigating it. |
1 month without proper bin-by-bin comparisons feels like a long time. I don't know if this should be marked as urgent, but it sure needs to be fixed. @cms-sw/dqm-l2 , is there an ETA? |
So our SW engineer was able to reproduce the issue on a local installation of the GUI using a recent file (where it doesn't work) and an older file (where it works). Inspecting the logs it looks like there might be a problem with the indexer when loading a new file. We asked the cmsweb admins to do a backup and reset of the index on the DQM GUI dev flavour on server vocms0731 and seemingly it didn't fix the issue. We don't know what changed exactly between the root files before ~July and after. But it is unlikely that all of a sudden the size of the index trees has quadrupled (the original attempt at fixing the issue increased the limit four-fold). So it is not a trivial fix as we originally thought. The SW engineer working on it is now on holiday for the next few weeks so we are moving the task to our other SW engineer. |
At local setup of DQM GUI version 9.7.7 when the DB is empty / new no problems in indexer:
and when DB is old (copied from vocms0731) the same crash:
looks like cmsweb admins had not reset the index DB yet or had not done it properly, maybe we can rather delete the whole DB and start a new one. |
The Index DB was recreated. Bin-by-bin comparison in GUI is started to be available for some of the latest PRs, e.g.: |
Confirmed working on recent PRs. NB: For PRs where the tests were triggered before the index was recreated the GUI will still appear empty, so I recommend re-triggering the tests if you intend to use the DQM bin-by-bin tool. |
+dqm |
unassign core |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
You pushed into production the wrong fix that, in fact, is not a fix at all, as I commented directly on the PR. |
Hi, could you point me to this discussion? thanks! |
Ciao Marco @rovere, thank you for your feedback, it is much appreciated. So concerning the Patricia trees I found only one more instance where the value is set manually and indeed that wasn't changed: https://github.com/cms-DQM/dqmgui_prod/search?q=StringAtomTree%28 Concerning regenerating the index: we observed the issue only in dev where the history of the index is not needed. Since we never saw the issue on offline and relval we did not check or investigate what is going on there. Do you have any particular suspicion that we might be about to hit the limit in those instances? |
Ciao @emanueleusai I'm not particularly concerned about the default value since it is practically never used. I haven't been playing around with DQMGUI for ages and I have no idea what the current occupancy of the many patricia trees is for Offline and RelVal servers. Maybe having a look at that would be something useful, at least to understand where you are in the process of filling all the allocated slots. |
Thank you, Marco. We will follow up with your suggestion. |
Since earlier today, I cannot see plots in the DQM webpage (
https://cmsweb.cern.ch/dqm/dev
) used to display bin-by-bin comparisons for CMSSW PR tests, at least for PRs tested recently.Example:
I see the same for several recent PRs (I checked only a few). I did find one where the plots show up
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-49175f/26445/summary.html
https://tinyurl.com/24nx32n7
Initially discussed in #38971 (comment).
The text was updated successfully, but these errors were encountered: