New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"ExcessiveTime: SiPixelClusterProducer:siPixelClustersPreSplitting" causes fake reco comparison failures #29398
Comments
A new Issue was created by @silviodonato Silvio Donato. @Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@christopheralanwest do you think this is related to #29308? |
@silviodonato I don't know. Adding @mmusich to the thread. |
I just ran the 9.0 on a cmsdev machine, and got (from TimeReport summary)
Something is clearly wrong with The timings of |
assign reconstruction The effect seems to be real and not just an infrastructure glitch. |
In my test the first event took ~18 minutes to process, the remaining 9 took 20 s (2-5 s/event). |
is this reproducible for a repeated run? |
I ran an IgProf, but it doesn't indicate anything particularly interesting Two more retries gave 1.5 and 26 minutes for the first event. Let's see if strace would find anything interesting. |
@christopheralanwest @silviodonato pardon my naivety, but I really can't see how an update of the Run3 Global Tags as done in #29308 could possibly affect a Run-1 MC workflow such as 9.0 [*]. [*]
|
Strace hints towards conditions / frontier
i.e. the job spent 16 minutes in a long chain of @DrDaveD, would you have any ideas or suggestions what to check next? |
Since you were able to run strace I assume it is quite reproducible. Please run it with the environment variables FRONTIER_LOG_LEVEL=debug and FRONTIER_LOG_FILE=frontier_client.log and point me to the log file. |
@DrDaveD I have the log here (8.4 MB) In this job the first event took 10 minutes to process. |
The problem was with a temporary physical machine we had in service as cmsmeyproxy2. It was taking 10 minutes to transfer a 133MB query instead of 2 seconds. I took it out of service so performance should be much better. Nothing showed up wrong on total throughput so thank you for bringing this to my attention. We don't routinely monitor for transfer rates of large queries. |
Thanks Dave! I take it that the problem should be gone now, and the issue can be closed. |
+1 |
sorry I think I am late and this was understood. |
Thanks to everybody, the issue is solved. |
As noted by @perrotta in #29375 (comment), we are getting many fake reco comparison failures because of the following message
For instance, looking at step3 of wf 9.0, we get this error using:
while the error disappeared in
The first time we see this error is, likely, CMSSW_11_1_X_2020-03-30-2300.
Looking at step3 of wf 11634.0, this error appeared only in several IBs:
eg. https://cmssdt.cern.ch/SDT/jenkins-artifacts/ib-baseline-tests/CMSSW_11_1_X_2020-03-30-2300/slc7_amd64_gcc820/-GenuineIntel/matrix-results/11634.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021.log
The text was updated successfully, but these errors were encountered: