Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in bParking processing #26154

Closed
fabiocos opened this issue Mar 12, 2019 · 75 comments
Closed

Crash in bParking processing #26154

fabiocos opened this issue Mar 12, 2019 · 75 comments

Comments

@fabiocos
Copy link
Contributor

A problem with bParking code has been reported by PdmV. This issue has been initially observed during CRAB-based private tests, and were reproducible when running in CRAB but not when running locally. The problem is:

----- Begin Fatal Exception 11-Mar-2019 20:27:43 UTC-----------------------
An exception of category 'FatalRootError' occurred while
[0] Constructing the EventProcessor
[1] Constructing module: class=LowPtGsfElectronIDProducer label='lowPtGsfElectronID'
Additional Info:
[a] Fatal Root Error: @SUB=TXMLEngine::ParseFile
Unexpected end of xml file
----- End Fatal Exception -------------------------------------------------

initially reported in https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/4440.html and https://indico.cern.ch/event/790963/contributions/3306140 .
It looks present in 10_2_X, not in cycles after 10_3_X, starting with merges on top of 10_2_9.

@cmsbuild
Copy link
Contributor

A new Issue was created by @fabiocos Fabio Cossutti.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fabiocos
Copy link
Contributor Author

@prebello could you please point us to a failing workflow/initial settings?

@fabiocos
Copy link
Contributor Author

@mverzett @bainbrid FYI

@prebello
Copy link
Contributor

prebello commented Mar 12, 2019

@prebello could you please point us to a failing workflow/initial settings?

sure for instance
full set of injected relvals (all failed) are at
https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?campaign=CMSSW_10_2_12__fullsim_noPU_2018_bParking-1552246543

details for each step can be found at
https://cmsweb.cern.ch/reqmgr2/fetch?rid=prebello_RVCMSSW_10_2_12SingleElectronPt1000_190310_203726_6599
clicking on the respective step and then "Raw Data" tab

does it help? any further detail?

@davidlange6
Copy link
Contributor

previously reported errors were due to special characters in the file (it seems)

The file read here is
RunII_Autumn18_LowPtElectrons_mva_id.xml.gz

is that correct?

@mverzett
Copy link
Contributor

@davidlange6 yes, the file is the right one.

previously reported errors were due to special characters in the file (it seems)

  • Out of curiosity, how can you tell?
  • Why it works locally but not remotely?
  • How do we clean the xml from these characters?

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 12, 2019 via email

@bainbrid
Copy link
Contributor

bainbrid commented Mar 12, 2019 via email

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 12, 2019 via email

@guitargeek
Copy link
Contributor

This seems to be the reoccurrence of #24395, which got fixed by #24432 in 10_3_X. Since you are using 10_2_X here, the weight file is just too large I think.

Indeed, when I faced this problem half a year ago it also magically worked locally but not on CRAB...

@mverzett
Copy link
Contributor

@guitargeek thanks for the explanation! Indeed it makes sense as the weight files are quite large.
@perrotta @slava77 @fabiocos any chance we can backport #24432 or violates some rules? Any other ideas for a workaround?

@guitargeek
Copy link
Contributor

guitargeek commented Mar 13, 2019

That would of course be quite a big backport, other ideas would be really convenient! In case we want to keep the backport option open however, there is also a cmsdist backport which would have to be made for tinyxml2, analogue to cms-sw/cmsdist#4312. To make this list of references related to #24432 complete, I should also mention the external PR that it depended on, translating some old txt weight files to the XML format: cms-sw/cmsdist#4332

@perrotta
Copy link
Contributor

Given the review comments posted for #24432 (see #24432 (comment)) I see no counterindication in principle in backporting that PR. It must be noted however, as also pointed out by @guitargeek right above. that it would be quite a big backport, it would act on parts of the code which have been touched and modified independently in the meanwhile, and therefore both the backport into a production release and its review should be done and validated carefully. As such, I may expect some non negligible further delay in the start of the bParking reco production.

Moreover (just to keep here some record of it) one has to backport and integrate also the later fix #25085.

If I read correctly the issue #24395, there was a suggestion there of acting on the xmlenginebuffersize limit (see #24395 (comment)): was that possibility ever explored, as a possible faster alternative?

@guitargeek
Copy link
Contributor

I personally never explored that option. But I think it must be somehow more complicated than just changing the limit, because there is also this other strange effect of it working locally but not on CRAB.

What I can do is to explore how easy a backport of #24432 + #25085 would be (i.e. how many conflicts would arise).

@bainbrid
Copy link
Contributor

bainbrid commented Mar 13, 2019 via email

@guitargeek
Copy link
Contributor

This is the backport for cmsdist, cmssw will follow.

@mverzett
Copy link
Contributor

@bainbrid at the same time we could test the increase of xmlenginebuffersize with the same assumption that if it works on crab works for pdmV as well.
@guitargeek is it an environmental variable? a global variable?

@bainbrid
Copy link
Contributor

bainbrid commented Mar 13, 2019 via email

@guitargeek
Copy link
Contributor

Yes, that's good we try things in parallel. I have no idea how to change the xmlenginebuffersize though.

@mverzett
Copy link
Contributor

Apparently the buffer size was hardcoded in the TMVA code until this august. I guess that the backport is the only solution

@mverzett
Copy link
Contributor

Still, is a mystery how locally it works though

@guitargeek
Copy link
Contributor

guitargeek commented Mar 13, 2019

I'm theoretically done with the backport, but I have real trouble compiling and testing today! Both on lxplus and the machines of by lab I always get /bin/sh: line 1: 294961 Segmentation fault (core dumped) edmWriteConfigs -p blabla.so when compiling. Is that a known error?

Edit: Never mind, I realized my mistake. SL6 vs. SL7 problem, because 10_2_X defaulted to SL6.

@mverzett
Copy link
Contributor

Probably naive question: Could the problem lie here? Given we have two BDT models being instantiated sequentially in the same module and the comment on security of the line above the quoted ones?

@guitargeek
Copy link
Contributor

guitargeek commented Mar 13, 2019

What do you mean by "have two BDT models being instantiated sequentially in the same module"? There are modules in production for quite some time now which instantiate dozens of GBRForests, read from gzipped files right now. What does the low pt reconstruction do different with respect to what already exists? But honestly I don't quite understand the comment behind your link anyway :)

@Dr15Jones
Copy link
Contributor

There is a potential problem here
https://github.com/cms-sw/cmssw/blob/CMSSW_10_2_X/CommonTools/Utils/src/TMVAZipReader.cc#L81

mkstemp can fail, in which case it returns -1. That is never checked. If we had a failure on mulitple jobs running on a site, they would all try to open /tmp/tmva.XXXXXX and write into it simultaneously. Then when TMVA tries to read it, it could be changing while doing the read.

@bbockelm could that be causing the problems we see?

@bbockelm
Copy link
Contributor

Nope, that’s not the cause.

  1. All production jobs run in separate containers with unique /tmp.
  2. mkstemp only returns successfully if it has created a new unique file.
  3. The XXXXXX is replaced with randomly generated characters. That’s why the file name is declared char* and not const char*.

@bainbrid
Copy link
Contributor

bainbrid commented Mar 14, 2019

ok, @prebello , all, just to confirm, I tested line by line this #26154 (comment) and it works fine for me.
Try on more than one occasion: scram b -j8
It takes a while but it should build.

@guitargeek , to note, FYI:

  • I had to move an existing RecoParticleFlow/PFProducer/data directory (just containing a download.url file) to make way for the git clone renamed as data
  • The GBRForestTools.cc already has "tinyxml2.h" in the list of headers, I didn't need to change anything ...
  • I also see the warning ****WARNING: Invalid tool tinyxml2. Please fix src/CommonTools/MVAUtils/BuildFile.xml file. message but it doesn't seem to affect anything ...?

@guitargeek
Copy link
Contributor

guitargeek commented Mar 14, 2019

Thank you Rob, I edited my PR description accordingly. Seems I accidentally pushed the change with the <> vs. "", but nevermind. Yes, I had the warning too and it can be ignored.

@mverzett
Copy link
Contributor

I tested the recipe on crab, one file, 400 events. It works! @prebello can you confirm the same?

@bainbrid
Copy link
Contributor

bainbrid commented Mar 14, 2019 via email

@prebello
Copy link
Contributor

@bainbrid sorry. busy with lectures marathon since yesterday. I will take care of it today only after 5pm CERN.

@prebello
Copy link
Contributor

prebello commented Mar 16, 2019

@bainbrid
again the error in 10 events local test (lxplus scl6)
%MSG-e TkDetLayers: TSGFromL2Muon:hltL3NoFiltersNoVtxTrajSeedOIHit@streamBeginRun 16-Mar-2019 03:01:26 CET Run: 1 Stream: 7
ForwardDiskSectorBuilderFromDet: Trying to build Petal Wedge from Dets at different z positions !! Delta_z = -0.950417
%MSG

I have injected two relvals in the system to test.
https://dmytro.web.cern.ch/dmytro/cmsprodmon/requests.php?campaign=CMSSW_10_2_12_patch1__fullsim_noPU_2018_bParking_newtest-1552702058

@slava77
Copy link
Contributor

slava77 commented Mar 16, 2019 via email

@boudoul
Copy link
Contributor

boudoul commented Mar 16, 2019

I confirm that you can safely ignore this message

@bainbrid
Copy link
Contributor

bainbrid commented Mar 16, 2019 via email

@prebello
Copy link
Contributor

Colleagues, unfortunately
---- Begin Fatal Exception 16-Mar-2019 17:15:57 UTC-----------------------
An exception of category 'FatalRootError' occurred while
[0] Constructing the EventProcessor
[1] Constructing module: class=LowPtGsfElectronIDProducer label='lowPtGsfElectronID'
Additional Info:
[a] Fatal Root Error: @sub=TXMLEngine::ParseFile
Unexpected end of xml file

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 17, 2019 via email

@bainbrid
Copy link
Contributor

@prebello did you manage to successfully build the patch detailed here #26154 (comment) ? And you still see the Exception with this recipe above?

Is anyone able to submit jobs using WMagent? Are there instructions we can follow?

We’ve had the recipe available for 5-6 days, so I’m keen to draw a firm conclusion quickly.

@mverzett
Copy link
Contributor

@prebello I am surprised that the error still appears, as the with the PR the XML parsing for the low pt electrons in overhauled and TXMLEngine is replaced. I am wondering as @bainbrid if you included the recipe he suggested.

@guitargeek
Copy link
Contributor

Hi @bainbrid, Slava gave me some feedback on the PR during the weekend, and I took the opportunity to also fix the line with <tinyxml2.h>. So now you really have to add to the recipe that it should be replaced in GBRForestToolsTinyXML.cc with "tinyxml2.h". That does not explain the error @prebello got however. If the PR was merged, there is no chance the TXMLEngine can throw because it's not used.

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 18, 2019 via email

@bainbrid
Copy link
Contributor

Ah ok, I'm not aware of the technical details / limitations involved in such a test. Perhaps this explains why @prebello still sees the error.

@fabiocos, all: given Mauro's test with CRAB appears to mitigate the issue, which is very encouraging, is it at all possible to proceed with reviewing/merging #26176, create a new CMSSW_10_2 release as soon as possible, and try again?

We are on an very tight timeline for the bParking processing and I am worried we will miss our slot.

@prebello
Copy link
Contributor

its unlikely that any wmagent test is possible with a custom recipe (eg, outside of a release) - thats why I asked how this test was done..

On Mar 18, 2019, at 4:32 AM, Mauro Verzetti @.***> wrote: @prebello I am surprised that the error still appears, as the with the PR the XML parsing for the low pt electrons in overhauled and TXMLEngine is replaced. I am wondering as @bainbrid if you included the recipe he suggested. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

exaclty.. I did the test in patch1 release only changing the era accordingly. Indeed if the PR with dedicated changes was not merged, it will not work. I have understood that patch1 was the fixed release.

@prebello
Copy link
Contributor

Ah ok, I'm not aware of the technical details / limitations involved in such a test. Perhaps this explains why @prebello still sees the error.

@fabiocos, all: given Mauro's test with CRAB appears to mitigate the issue, which is very encouraging, is it at all possible to proceed with reviewing/merging #26176, create a new CMSSW_10_2 release as soon as possible, and try again?

We are on an very tight timeline for the bParking processing and I am worried we will miss our slot.

@bainbrid one needs the fixes in a dedicated release to make the test properly.

@prebello
Copy link
Contributor

Hi @bainbrid, Slava gave me some feedback on the PR during the weekend, and I took the opportunity to also fix the line with <tinyxml2.h>. So now you really have to add to the recipe that it should be replaced in GBRForestToolsTinyXML.cc with "tinyxml2.h". That does not explain the error @prebello got however. If the PR was merged, there is no chance the TXMLEngine can throw because it's not used.

@guitargeek I have done it (changed from <tinyxml2.h> to "tinyxml2.h") to fix the compilation error. Indeed it is not the reason of the issue.

@bainbrid
Copy link
Contributor

@prebello @slava77

Sorry - I'm still confused:

  • Are you saying that there exists a CMSSW_10_2_12_patch1 release that contains Jonas's fix?
  • And you used this patch release to test with WMagent?

If yes to both, then the question still remains why you see the exception from the TXMLEngine constuctor when the patch no longer uses this code ...

@bainbrid
Copy link
Contributor

bainbrid commented Mar 18, 2019 via email

@prebello
Copy link
Contributor

@prebello @slava77

Sorry - I'm still confused:

* Are you saying that there exists a CMSSW_10_2_12_patch1 release that contains Jonas's fix?

* And you used this patch release to test with WMagent?

If yes to both, then the question still remains why you see the exception from the TXMLEngine constuctor when the patch no longer uses this code ...

I am not saying that CMSSW_10_2_12_patch1 contains the fix. I have said that "I have guessed that CMSSW_10_2_12_patch1 had the fix" . Yes I have used it.

@prebello
Copy link
Contributor

@bainbrid do you think we will have problems in rereco 10-2-12? maybe, as relvals are failing too?
(both are injected using wmagent framework)

the samples are still in assignment-approved, although we have requested the block transfer, and I would like to confirm with you to avoid contacting computing for potential problematic runs.
https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?campaign=Commissioning2018

FYI @davidlange6

@fabiocos
Copy link
Contributor Author

CMSSW_10_2_12_patch1 has nothing to do with this issue, see the release notes https://cmssdt.cern.ch/SDT/ReleaseNotes/CMSSW_10/CMSSW_10_2_12_patch1.html
I built it to allow GEN-SIM to move forward without interfering with the fix of the B parking issue.

Yesterday I have privately tested the recipe locally, and it works (i.e. the workflow 136.898 arrives at the end without apparent problems). Having a new release with the various additions for B parking inside looks the best way to move forward, provided the backport is approved

@bainbrid
Copy link
Contributor

bainbrid commented Mar 18, 2019 via email

@fabiocos
Copy link
Contributor Author

@prebello thank you a lot
@bainbrid @slava77 @perrotta I consider this issue as solved

@bainbrid
Copy link
Contributor

bainbrid commented Mar 21, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests