Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[81X] Fix workflow 4.76 #15304

Closed
smuzaffar opened this issue Jul 27, 2016 · 11 comments
Closed

[81X] Fix workflow 4.76 #15304

smuzaffar opened this issue Jul 27, 2016 · 11 comments

Comments

@smuzaffar
Copy link
Contributor

I confirm that by using

 --inputCommands "keep *","drop *_*_*_reRECO"

workflow 4.76 runs fine. I see that current

"drop *_*_*_RECO"

comes from

python/relval_steps.py:steps['HLTDSKIM']=merge([{'--inputCommands':'"keep *","drop *_*_*_RECO"'},steps['HLTD']])

but updating HLTDSKIM to have reRECO means all 4.xx workflows will be dropping reRECO instead of RECO and ends up failing e.g 4.37 then fails with similar message

----- Begin Fatal Exception 27-Jul-2016 12:57:22 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing run: 177790 lumi: 214 event: 280479856
   [1] Running path 'FEVTDEBUGHLToutput_step'
   [2] Calling event method for module PoolOutputModule/'FEVTDEBUGHLToutput'
   [3] Rethrowing an exception that happened on a different thread.
   [4] Reading branch EcalRecHitsSorted_ecalPreshowerRecHit_EcalRecHitsES_RECO.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::CheckByteCount
object of class vector<EcalRecHit> read too few bytes: 39324 instead of 74268

----- End Fatal Exception -------------------------------------------------

@davidlange6 , any idea how to avoid this?

@cmsbuild
Copy link
Contributor

A new Issue was created by @smuzaffar Malik Shahzad Muzaffar.

@davidlange6, @smuzaffar, @davidlt, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are list here #13029

@slava77
Copy link
Contributor

slava77 commented Jul 27, 2016

On 7/27/16 4:05 AM, Malik Shahzad Muzaffar wrote:

I confirm that by using

|--inputCommands "keep ","drop *__*_reRECO" |

workflow 4.76 runs fine. I see that current

|"drop ____*_RECO" |

comes from

|python/relval_steps.py:steps['HLTDSKIM']=merge([{'--inputCommands':'"keep ","drop
*
__*_RECO"'},steps['HLTD']]) |

but updating HLTDSKIM to have reRECO means all 4.xx workflows will be
dropping reRECO instead of RECO and ends up failing e.g 4.37 then fails
with similar message

Just add both RECO and reRECO. The logic in the setup of these workflows
is to drop earlier "reco" of all kinds.

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Aug 12, 2016

I dropped reRECO and although it fixed the step2 but then step3 failed with error message

----- Begin Fatal Exception 12-Aug-2016 16:30:10 CEST-----------------------
An exception of category 'Configuration' occurred while
   [0] Calling InputSource::readRun_
Exception Message:
Duplicate Process. The process name reRECO was already in the ProcessHistory.
Please modify the configuration file to use a distinct process name.
----- End Fatal Exception -------------------------------------------------

The cmsDriver commands for step2 and step3 are

cmsDriver.py step2  --datatier FEVTDEBUGHLT --conditions auto:run1_hlt_Fake -s L1REPACK,HLT:@fake --process reHLT --eventcontent FEVTDEBUGHLT --data  --inputCommands "keep *","drop *_*_*_RECO","drop *_*_*_reRECO" -n 100  --filein filelist:step1_dasquery.log --lumiToProcess step1_lumiRanges.log --fileout file:step2.root

cmsDriver.py step3  --conditions auto:run1_data_Fake -s RAW2DIGI,L1Reco,RECO,EI,PAT,ALCA:SiStripCalZeroBias+SiStripCalMinBias+TkAlMinBias,DQM:@standardDQMFakeHLT+@miniAODDQM --runUnscheduled  --process reRECO --data  --eventcontent RECO,MINIAOD,DQM --hltProcess reHLT --scenario pp --datatier RECO,MINIAOD,DQMIO -n 100  --filein file:step2.root  --fileout file:step3.root

@Dr15Jones
Copy link
Contributor

@wddgit David, in this case do we fail to drop the history?

@wddgit
Copy link
Contributor

wddgit commented Aug 16, 2016

The current behavior is this. Nothing is ever dropped from the ProcessHistory, not even if all the products are dropped. In addition, it is an error to have a process read a file which has in its ProcessHistory an earlier process with the same name. The only exceptional case to mention is that if no product is produced at all, then the restriction does not apply and the process name does not get added to the ProcessHistory. I ran the code, printed out the contents of the ProcessHistory and this is the behavior I see. From looking at the code quickly, it appears this has been the behavior since at least 2011 when I put in the set of changes that implemented the reduced ProcessHistoryID. It is hard to trace back earlier, because so much of the relevant code changed at that point. I do not remember whether the same behavior was there before or not. If it was a change, then I do not remember whether it was done for a reason or was unintentional.

If we wanted to change the current behavior, we would need to ensure that we dropped all the products from the process AND all subsequent processes before we tried to reuse the process name. And I think also we would need to spend some time thinking about all the metadata like ProductIDs and parentage and how they could be affected by this. It would take some time to think all this through. I can see two options in how to proceed.

Option 1. We could drop the process name from the ProcessHistory, then one issue would be that events which were treated as being in different runs because they had differing ProcessHistoryIDs might suddenly have the same ProcessHistoryID because we dropped the process that caused a difference. We would have to think about how to deal with that case.

Option 2. Allow the same process name to appear multiple times in the ProcessHistory. Doing this would also require a lot of thought to verify that we are not breaking something by allowing this behavior.

(This is all independent of the recent bug fix I put in related to duplicate process names. The problem above related to a check run by the InputSource when reading the first run. The recent bug fix was related to a check run when initializing the lookup tables in the ProductRegistry. That was only checking entries in the input product registry that actually conflicted with the current process name. So that check would be needed in any case.)

@slava77
Copy link
Contributor

slava77 commented Aug 16, 2016

Why not just edit the step3 of this workflow and tell it to use --process reRECO2 this has fewer conceptual implications.
This is applicable to workflows with a pattern of inputs from skims and rerecos.

@cmsbuild
Copy link
Contributor

#15498 should fix 4.76

@smuzaffar
Copy link
Contributor Author

assign core

@smuzaffar
Copy link
Contributor Author

+1
fixed via #15498 for CMSSW_8_1_X_2016-08-18-1100

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants