New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory increase not applied for resubmitted merge tasks #11953
Comments
Hi @yanfr0818 , memory increase for merge tasks is not supported in WMCore. Whenever such memory hungry jobs are spotted in production, it usually comes from a configuration and/or CMSSW problem. I would suggest reporting this to Core Software, as the merge process is supposed to be very lightweight in terms of resources requirements. |
Thanks @amaltaro we'll follow up with core but still is there a reason why this memory increase is completely blocked? Ops should be able to increase the memory, if the need be. This might be necessary to finish urgent workflows while we investigate their higher memory usage. We can come up with a PR to lift this restriction if you think this is not a breaking change. @hassan11196 @lucalavezzo FYI |
There are some comments from the software core team, in this issue. Does this sound good to you? @amaltaro |
Alan, is this matter of removing [1] https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMTask.py#L607 |
Hi @amaltaro , we'd like to follow up with this issue. Does the solution that Hasan proposed above work? Can we make a PR for this? This issue has already been fixed by the software core team and will be propagated with the next release of CMSSW. But we still need to resolve the failed WFs at hand. |
Apologies for the belated reply. We would need to have resource requirements for Merge tasks as well and properly map them by their names between workflow assignment and workflow construction. So we would likely have to change the construction of workload objects upstream (ReqMgr2). Plus, StepChains are even more complicated, given that we have a single resource requirement for all the steps. My opinion with such development is that we would potentially hurt the system with unnecessary complexity and likely create other bugs. If CMSSW release is buggy, then we should run the same workflow once a new release is made. Additionally, we could look into allowing CMSSW + ScramArch override in the ACDC creation, if that is really desired. |
For failed WFs with 50660-PerformanceKill error, we would want to resubmit the WFs with increased memory. However, when Unified creates the resubmission for ReReco jobs, the memory increase would not be applied for merge tasks, as can be seen in this code: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMTask.py#L607.
We don't understand the rationale behind this restriction. If we would like to increase the memory of merge jobs, we should be able to do it.
Example of a failed WF: https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC4_r-0-Run2022F_ZeroBias_JMENano12p5_240201_014109_1580
The memory increase is reflected in the JSON config. However, the maxPSS is still set to 2355.2 in the Config tab.
The text was updated successfully, but these errors were encountered: