Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"ExcessiveTime: SiPixelClusterProducer:siPixelClustersPreSplitting" causes fake reco comparison failures #29398

Closed
silviodonato opened this issue Apr 6, 2020 · 20 comments

Comments

@silviodonato
Copy link
Contributor

As noted by @perrotta in #29375 (comment), we are getting many fake reco comparison failures because of the following message

%MSG-e ExcessiveTime:  SiPixelClusterProducer:siPixelClustersPreSplitting  05-Apr-2020 15:28:43 CEST Run: 1 Event: 1
ExcessiveTime: Module used 841.448 seconds of time which exceeds the error threshold configured in the Timing Service of 600 seconds.
%MSG

For instance, looking at step3 of wf 9.0, we get this error using:

The first time we see this error is, likely, CMSSW_11_1_X_2020-03-30-2300.
Looking at step3 of wf 11634.0, this error appeared only in several IBs:

  • CMSSW_11_1_X_2020-03-29-0000 no
  • CMSSW_11_1_X_2020-03-29-2300 no
  • CMSSW_11_1_X_2020-03-30-1100 no
  • CMSSW_11_1_X_2020-03-30-2300 yes
  • CMSSW_11_1_X_2020-03-31-2300 no
  • CMSSW_11_1_X_2020-04-01-2300 yes
  • CMSSW_11_1_X_2020-04-02-2300 no
  • CMSSW_11_1_X_2020-04-03-2300 yes
  • CMSSW_11_1_X_2020-04-05-0000 no
  • CMSSW_11_1_X_2020-04-05-2300 yes

eg. https://cmssdt.cern.ch/SDT/jenkins-artifacts/ib-baseline-tests/CMSSW_11_1_X_2020-03-30-2300/slc7_amd64_gcc820/-GenuineIntel/matrix-results/11634.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021.log

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 6, 2020

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@silviodonato
Copy link
Contributor Author

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 6, 2020

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@silviodonato
Copy link
Contributor Author

@christopheralanwest do you think this is related to #29308?

@christopheralanwest
Copy link
Contributor

@silviodonato I don't know. Adding @mmusich to the thread.

@makortel
Copy link
Contributor

makortel commented Apr 6, 2020

I just ran the 9.0 on a cmsdev machine, and got (from TimeReport summary)

TimeReport   7.475239     7.475239     7.475239  csc2DRecHits
...
TimeReport  13.801270    13.801270    13.801270  gedGsfElectrons
...
TimeReport   0.027153     0.027153     0.027153  initialStepTrackCandidates
...
TimeReport   0.031862     0.031862     0.031862  initialStepTrackCandidatesPreSplitting
...
TimeReport  79.645565    79.645565    79.645565  siPixelClustersPreSplitting

Something is clearly wrong with siPixelClustersPreSplitting (I added the initialStep and initialStepPreSplitting track pattern recognition module timings for reference).

The timings of csc2DRechits and gedGsfElectrons look suspiciously large as well.

@makortel
Copy link
Contributor

makortel commented Apr 6, 2020

assign reconstruction

The effect seems to be real and not just an infrastructure glitch.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 6, 2020

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Apr 6, 2020

In my test the first event took ~18 minutes to process, the remaining 9 took 20 s (2-5 s/event).

@slava77
Copy link
Contributor

slava77 commented Apr 7, 2020

In my test the first event took ~18 minutes to process, the remaining 9 took 20 s (2-5 s/event).

is this reproducible for a repeated run?
I wonder if there are some issues with cvmfs/fuse mounts being slow to respond for the first load.

@makortel
Copy link
Contributor

makortel commented Apr 7, 2020

I ran an IgProf, but it doesn't indicate anything particularly interesting
http://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/debug/siPixelClustersPreSplitting_06_perf
Now the first event took ~5 minutes.

Two more retries gave 1.5 and 26 minutes for the first event.

Let's see if strace would find anything interesting.

@mmusich
Copy link
Contributor

mmusich commented Apr 7, 2020

@silviodonato I don't know. Adding @mmusich to the thread.

@christopheralanwest @silviodonato pardon my naivety, but I really can't see how an update of the Run3 Global Tags as done in #29308 could possibly affect a Run-1 MC workflow such as 9.0 [*].
Please clarify.

[*]

9.0    Higgs200ChargedTaus+DIGI+RECO+HARVEST [1]: cmsDriver.py H200ChargedTaus_Tauola_8TeV_cfi  --conditions auto:run1_mc -n 10 --eventcontent RAWSIM --relval 9000,100 -s GEN,SIM --datatier GEN-SIM --beamspot Realistic8TeVCollision 
                                           [2]: cmsDriver.py step2  --datatier GEN-SIM-DIGI-RAW-HLTDEBUG --conditions auto:run1_mc -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@fake --eventcontent FEVTDEBUGHLT -n 100 
                                           [3]: cmsDriver.py step3  --datatier GEN-SIM-RECO,MINIAODSIM,DQMIO --runUnscheduled  --conditions auto:run1_mc -s RAW2DIGI,L1Reco,RECO,RECOSIM,EI,PAT,VALIDATION:@standardValidationNoHLT+@miniAODValidation,DQM:@standardDQMFakeHLT+@miniAODDQM --eventcontent RECOSIM,MINIAODSIM,DQM -n 100 
                                           [4]: cmsDriver.py step4  --scenario pp --filetype DQM --conditions auto:run1_mc --mc  -s HARVESTING:validationHarvestingNoHLT+dqmHarvestingFakeHLT -n 100

@makortel
Copy link
Contributor

makortel commented Apr 7, 2020

Strace hints towards conditions / frontier

13453 04:04:53 connect(12, {sa_family=AF_INET, sin_port=htons(3128), sin_addr=inet_addr("...")}, 16) = -1 EINPROGRESS (Operation now in progress
13453 04:04:53 poll([{fd=12, events=POLLOUT}], 1, 5000) = 1 ([{fd=12, revents=POLLOUT}])
13453 04:04:53 getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
13453 04:04:53 poll([{fd=12, events=POLLOUT}], 1, 5000) = 1 ([{fd=12, revents=POLLOUT}])
13453 04:04:53 sendto(12, "GET http://cmsfrontier.cern.ch:8"..., 504, 0, NULL, 0) = 504
13453 04:04:53 poll([{fd=12, events=POLLIN}], 1, 10000) = 1 ([{fd=12, revents=POLLIN}])
13453 04:04:53 recvfrom(12, "HTTP/1.1 200 OK\r\nServer: Apache-"..., 32768, 0, NULL, NULL) = 4259
13453 04:04:53 poll([{fd=12, events=POLLIN}], 1, 10000) = 1 ([{fd=12, revents=POLLIN}])
13453 04:04:53 recvfrom(12, "bKTfmI5fiRK/Lb+LqO8SLvT6cVR7hiJ1"..., 32768, 0, NULL, NULL) = 2896
...
13453 04:20:57 poll([{fd=12, events=POLLIN}], 1, 10000) = 1 ([{fd=12, revents=POLLIN}])
13453 04:20:57 recvfrom(12, "GciDP/vBzWo3GciFP/vhLWs4dxnIiT1b"..., 32768, 0, NULL, NULL) = 5792
13453 04:20:57 poll([{fd=12, events=POLLIN}], 1, 10000) = 1 ([{fd=12, revents=POLLIN}])
13453 04:20:57 recvfrom(12, "r255TvRiA1vJ7pnoxAK3CmqkRcpoRElU"..., 32768, 0, NULL, NULL) = 5776
13453 04:20:57 madvise(0x7f2fc4789000, 8425472, MADV_DONTNEED) = 0

i.e. the job spent 16 minutes in a long chain of poll()+recvfrom() from cmsfrontier.cern.ch (or maybe really from the proxy). I wonder if this is because the reads are slow, or the job reads a lot (or something else).

@DrDaveD, would you have any ideas or suggestions what to check next?

@DrDaveD
Copy link

DrDaveD commented Apr 7, 2020

Since you were able to run strace I assume it is quite reproducible. Please run it with the environment variables FRONTIER_LOG_LEVEL=debug and FRONTIER_LOG_FILE=frontier_client.log and point me to the log file.

@makortel
Copy link
Contributor

makortel commented Apr 7, 2020

Since you were able to run strace I assume it is quite reproducible. Please run it with the environment variables FRONTIER_LOG_LEVEL=debug and FRONTIER_LOG_FILE=frontier_client.log and point me to the log file.

@DrDaveD I have the log here (8.4 MB)
https://mkortela.web.cern.ch/mkortela/debug/frontier_client.log
Thanks!

In this job the first event took 10 minutes to process.

@DrDaveD
Copy link

DrDaveD commented Apr 7, 2020

The problem was with a temporary physical machine we had in service as cmsmeyproxy2. It was taking 10 minutes to transfer a 133MB query instead of 2 seconds. I took it out of service so performance should be much better. Nothing showed up wrong on total throughput so thank you for bringing this to my attention. We don't routinely monitor for transfer rates of large queries.

@makortel
Copy link
Contributor

makortel commented Apr 7, 2020

Thanks Dave! I take it that the problem should be gone now, and the issue can be closed.

@makortel
Copy link
Contributor

makortel commented Apr 7, 2020

+1

@VinInn
Copy link
Contributor

VinInn commented Apr 8, 2020

sorry I think I am late and this was understood.
Indeed offline siPixelClustersPreSplitting is loading the BY FAR largest payload: pixel gains (one per pixel channel). Typically the first event is by far slower than any other. Time report should give a huge difference between ave and max

@silviodonato
Copy link
Contributor Author

Thanks to everybody, the issue is solved.
For instance, see #29422
Closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants