"ExcessiveTime: SiPixelClusterProducer:siPixelClustersPreSplitting" causes fake reco comparison failures #29398

silviodonato · 2020-04-06T18:56:07Z

As noted by @perrotta in #29375 (comment), we are getting many fake reco comparison failures because of the following message

%MSG-e ExcessiveTime:  SiPixelClusterProducer:siPixelClustersPreSplitting  05-Apr-2020 15:28:43 CEST Run: 1 Event: 1
ExcessiveTime: Module used 841.448 seconds of time which exceeds the error threshold configured in the Timing Service of 600 seconds.
%MSG

For instance, looking at step3 of wf 9.0, we get this error using:

CMSSW_11_1_X_2020-04-05-0000
CMSSW_11_1_X_2020-04-05-0000 + #29348
while the error disappeared in
CMSSW_11_1_X_2020-04-05-0000 + #29346

The first time we see this error is, likely, CMSSW_11_1_X_2020-03-30-2300.
Looking at step3 of wf 11634.0, this error appeared only in several IBs:

CMSSW_11_1_X_2020-03-29-0000 no
CMSSW_11_1_X_2020-03-29-2300 no
CMSSW_11_1_X_2020-03-30-1100 no
CMSSW_11_1_X_2020-03-30-2300 yes
CMSSW_11_1_X_2020-03-31-2300 no
CMSSW_11_1_X_2020-04-01-2300 yes
CMSSW_11_1_X_2020-04-02-2300 no
CMSSW_11_1_X_2020-04-03-2300 yes
CMSSW_11_1_X_2020-04-05-0000 no
CMSSW_11_1_X_2020-04-05-2300 yes

eg. https://cmssdt.cern.ch/SDT/jenkins-artifacts/ib-baseline-tests/CMSSW_11_1_X_2020-03-30-2300/slc7_amd64_gcc820/-GenuineIntel/matrix-results/11634.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021/step3_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021.log

The text was updated successfully, but these errors were encountered:

cmsbuild · 2020-04-06T18:56:28Z

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

silviodonato · 2020-04-06T18:58:05Z

assign core

cmsbuild · 2020-04-06T18:58:26Z

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

silviodonato · 2020-04-06T18:58:45Z

@christopheralanwest do you think this is related to #29308?

christopheralanwest · 2020-04-06T23:36:05Z

@silviodonato I don't know. Adding @mmusich to the thread.

makortel · 2020-04-06T23:37:19Z

I just ran the 9.0 on a cmsdev machine, and got (from TimeReport summary)

TimeReport   7.475239     7.475239     7.475239  csc2DRecHits
...
TimeReport  13.801270    13.801270    13.801270  gedGsfElectrons
...
TimeReport   0.027153     0.027153     0.027153  initialStepTrackCandidates
...
TimeReport   0.031862     0.031862     0.031862  initialStepTrackCandidatesPreSplitting
...
TimeReport  79.645565    79.645565    79.645565  siPixelClustersPreSplitting

Something is clearly wrong with siPixelClustersPreSplitting (I added the initialStep and initialStepPreSplitting track pattern recognition module timings for reference).

The timings of csc2DRechits and gedGsfElectrons look suspiciously large as well.

makortel · 2020-04-06T23:38:18Z

assign reconstruction

The effect seems to be real and not just an infrastructure glitch.

cmsbuild · 2020-04-06T23:38:38Z

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2020-04-06T23:53:34Z

In my test the first event took ~18 minutes to process, the remaining 9 took 20 s (2-5 s/event).

slava77 · 2020-04-07T00:03:29Z

In my test the first event took ~18 minutes to process, the remaining 9 took 20 s (2-5 s/event).

is this reproducible for a repeated run?
I wonder if there are some issues with cvmfs/fuse mounts being slow to respond for the first load.

makortel · 2020-04-07T01:53:13Z

I ran an IgProf, but it doesn't indicate anything particularly interesting
http://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/debug/siPixelClustersPreSplitting_06_perf
Now the first event took ~5 minutes.

Two more retries gave 1.5 and 26 minutes for the first event.

Let's see if strace would find anything interesting.

mmusich · 2020-04-07T07:14:43Z

@silviodonato I don't know. Adding @mmusich to the thread.

@christopheralanwest @silviodonato pardon my naivety, but I really can't see how an update of the Run3 Global Tags as done in #29308 could possibly affect a Run-1 MC workflow such as 9.0 [*].
Please clarify.

[*]

9.0    Higgs200ChargedTaus+DIGI+RECO+HARVEST [1]: cmsDriver.py H200ChargedTaus_Tauola_8TeV_cfi  --conditions auto:run1_mc -n 10 --eventcontent RAWSIM --relval 9000,100 -s GEN,SIM --datatier GEN-SIM --beamspot Realistic8TeVCollision 
                                           [2]: cmsDriver.py step2  --datatier GEN-SIM-DIGI-RAW-HLTDEBUG --conditions auto:run1_mc -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@fake --eventcontent FEVTDEBUGHLT -n 100 
                                           [3]: cmsDriver.py step3  --datatier GEN-SIM-RECO,MINIAODSIM,DQMIO --runUnscheduled  --conditions auto:run1_mc -s RAW2DIGI,L1Reco,RECO,RECOSIM,EI,PAT,VALIDATION:@standardValidationNoHLT+@miniAODValidation,DQM:@standardDQMFakeHLT+@miniAODDQM --eventcontent RECOSIM,MINIAODSIM,DQM -n 100 
                                           [4]: cmsDriver.py step4  --scenario pp --filetype DQM --conditions auto:run1_mc --mc  -s HARVESTING:validationHarvestingNoHLT+dqmHarvestingFakeHLT -n 100

makortel · 2020-04-07T13:28:57Z

Strace hints towards conditions / frontier

13453 04:04:53 connect(12, {sa_family=AF_INET, sin_port=htons(3128), sin_addr=inet_addr("...")}, 16) = -1 EINPROGRESS (Operation now in progress
13453 04:04:53 poll([{fd=12, events=POLLOUT}], 1, 5000) = 1 ([{fd=12, revents=POLLOUT}])
13453 04:04:53 getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
13453 04:04:53 poll([{fd=12, events=POLLOUT}], 1, 5000) = 1 ([{fd=12, revents=POLLOUT}])
13453 04:04:53 sendto(12, "GET http://cmsfrontier.cern.ch:8"..., 504, 0, NULL, 0) = 504
13453 04:04:53 poll([{fd=12, events=POLLIN}], 1, 10000) = 1 ([{fd=12, revents=POLLIN}])
13453 04:04:53 recvfrom(12, "HTTP/1.1 200 OK\r\nServer: Apache-"..., 32768, 0, NULL, NULL) = 4259
13453 04:04:53 poll([{fd=12, events=POLLIN}], 1, 10000) = 1 ([{fd=12, revents=POLLIN}])
13453 04:04:53 recvfrom(12, "bKTfmI5fiRK/Lb+LqO8SLvT6cVR7hiJ1"..., 32768, 0, NULL, NULL) = 2896
...
13453 04:20:57 poll([{fd=12, events=POLLIN}], 1, 10000) = 1 ([{fd=12, revents=POLLIN}])
13453 04:20:57 recvfrom(12, "GciDP/vBzWo3GciFP/vhLWs4dxnIiT1b"..., 32768, 0, NULL, NULL) = 5792
13453 04:20:57 poll([{fd=12, events=POLLIN}], 1, 10000) = 1 ([{fd=12, revents=POLLIN}])
13453 04:20:57 recvfrom(12, "r255TvRiA1vJ7pnoxAK3CmqkRcpoRElU"..., 32768, 0, NULL, NULL) = 5776
13453 04:20:57 madvise(0x7f2fc4789000, 8425472, MADV_DONTNEED) = 0

i.e. the job spent 16 minutes in a long chain of poll()+recvfrom() from cmsfrontier.cern.ch (or maybe really from the proxy). I wonder if this is because the reads are slow, or the job reads a lot (or something else).

@DrDaveD, would you have any ideas or suggestions what to check next?

DrDaveD · 2020-04-07T14:15:22Z

Since you were able to run strace I assume it is quite reproducible. Please run it with the environment variables FRONTIER_LOG_LEVEL=debug and FRONTIER_LOG_FILE=frontier_client.log and point me to the log file.

makortel · 2020-04-07T14:54:22Z

Since you were able to run strace I assume it is quite reproducible. Please run it with the environment variables FRONTIER_LOG_LEVEL=debug and FRONTIER_LOG_FILE=frontier_client.log and point me to the log file.

@DrDaveD I have the log here (8.4 MB)
https://mkortela.web.cern.ch/mkortela/debug/frontier_client.log
Thanks!

In this job the first event took 10 minutes to process.

DrDaveD · 2020-04-07T16:22:42Z

The problem was with a temporary physical machine we had in service as cmsmeyproxy2. It was taking 10 minutes to transfer a 133MB query instead of 2 seconds. I took it out of service so performance should be much better. Nothing showed up wrong on total throughput so thank you for bringing this to my attention. We don't routinely monitor for transfer rates of large queries.

makortel · 2020-04-07T17:50:47Z

Thanks Dave! I take it that the problem should be gone now, and the issue can be closed.

makortel · 2020-04-07T17:50:52Z

+1

VinInn · 2020-04-08T17:22:27Z

sorry I think I am late and this was understood.
Indeed offline siPixelClustersPreSplitting is loading the BY FAR largest payload: pixel gains (one per pixel channel). Typically the first event is by far slower than any other. Time report should give a huge difference between ave and max

silviodonato · 2020-04-09T07:51:52Z

Thanks to everybody, the issue is solved.
For instance, see #29422
Closing this issue

cmsbuild added the pending-assignment label Apr 6, 2020

cmsbuild added core-pending pending-signatures and removed pending-assignment labels Apr 6, 2020

cmsbuild added the reconstruction-pending label Apr 6, 2020

silviodonato mentioned this issue Apr 7, 2020

[DD4hep] Expose Full Geo History on Request #29387

Merged

cmsbuild added core-approved and removed core-pending labels Apr 7, 2020

silviodonato mentioned this issue Apr 8, 2020

[Heppy] drop unnecessary dependency on rootpy #29403

Merged

silviodonato closed this as completed Apr 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"ExcessiveTime: SiPixelClusterProducer:siPixelClustersPreSplitting" causes fake reco comparison failures #29398

"ExcessiveTime: SiPixelClusterProducer:siPixelClustersPreSplitting" causes fake reco comparison failures #29398

silviodonato commented Apr 6, 2020

cmsbuild commented Apr 6, 2020

silviodonato commented Apr 6, 2020

cmsbuild commented Apr 6, 2020

silviodonato commented Apr 6, 2020

christopheralanwest commented Apr 6, 2020

makortel commented Apr 6, 2020

makortel commented Apr 6, 2020

cmsbuild commented Apr 6, 2020

makortel commented Apr 6, 2020 •

edited

slava77 commented Apr 7, 2020

makortel commented Apr 7, 2020 •

edited

mmusich commented Apr 7, 2020

makortel commented Apr 7, 2020

DrDaveD commented Apr 7, 2020

makortel commented Apr 7, 2020

DrDaveD commented Apr 7, 2020

makortel commented Apr 7, 2020

makortel commented Apr 7, 2020

VinInn commented Apr 8, 2020

silviodonato commented Apr 9, 2020

"ExcessiveTime: SiPixelClusterProducer:siPixelClustersPreSplitting" causes fake reco comparison failures #29398

"ExcessiveTime: SiPixelClusterProducer:siPixelClustersPreSplitting" causes fake reco comparison failures #29398

Comments

silviodonato commented Apr 6, 2020

cmsbuild commented Apr 6, 2020

silviodonato commented Apr 6, 2020

cmsbuild commented Apr 6, 2020

silviodonato commented Apr 6, 2020

christopheralanwest commented Apr 6, 2020

makortel commented Apr 6, 2020

makortel commented Apr 6, 2020

cmsbuild commented Apr 6, 2020

makortel commented Apr 6, 2020 • edited

slava77 commented Apr 7, 2020

makortel commented Apr 7, 2020 • edited

mmusich commented Apr 7, 2020

makortel commented Apr 7, 2020

DrDaveD commented Apr 7, 2020

makortel commented Apr 7, 2020

DrDaveD commented Apr 7, 2020

makortel commented Apr 7, 2020

makortel commented Apr 7, 2020

VinInn commented Apr 8, 2020

silviodonato commented Apr 9, 2020

makortel commented Apr 6, 2020 •

edited

makortel commented Apr 7, 2020 •

edited