Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory increase not applied for resubmitted merge tasks #11953

Open
yanfr0818 opened this issue Mar 28, 2024 · 6 comments
Open

Memory increase not applied for resubmitted merge tasks #11953

yanfr0818 opened this issue Mar 28, 2024 · 6 comments

Comments

@yanfr0818
Copy link

yanfr0818 commented Mar 28, 2024

For failed WFs with 50660-PerformanceKill error, we would want to resubmit the WFs with increased memory. However, when Unified creates the resubmission for ReReco jobs, the memory increase would not be applied for merge tasks, as can be seen in this code: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMTask.py#L607.

We don't understand the rationale behind this restriction. If we would like to increase the memory of merge jobs, we should be able to do it.

Example of a failed WF: https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC4_r-0-Run2022F_ZeroBias_JMENano12p5_240201_014109_1580
The memory increase is reflected in the JSON config. However, the maxPSS is still set to 2355.2 in the Config tab.

@amaltaro
Copy link
Contributor

Hi @yanfr0818 , memory increase for merge tasks is not supported in WMCore. Whenever such memory hungry jobs are spotted in production, it usually comes from a configuration and/or CMSSW problem.

I would suggest reporting this to Core Software, as the merge process is supposed to be very lightweight in terms of resources requirements.

@haozturk
Copy link

haozturk commented Apr 3, 2024

Thanks @amaltaro we'll follow up with core but still is there a reason why this memory increase is completely blocked? Ops should be able to increase the memory, if the need be. This might be necessary to finish urgent workflows while we investigate their higher memory usage. We can come up with a PR to lift this restriction if you think this is not a breaking change. @hassan11196 @lucalavezzo FYI

@yanfr0818
Copy link
Author

There are some comments from the software core team, in this issue.
In short, the high memory usage is caused by serialization of ParameterSets. While they are trying to find a long-term solution for this, we can resubmit these jobs with a higher memory requirement (say 5GB) for a quick fix.

Does this sound good to you? @amaltaro

@haozturk
Copy link

Alan, is this matter of removing Merge from this list [1]? We can make a PR for this, but we're not sure how to test it and see whether it'd break something

[1] https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMTask.py#L607

@yanfr0818
Copy link
Author

Hi @amaltaro , we'd like to follow up with this issue. Does the solution that Hasan proposed above work? Can we make a PR for this?

This issue has already been fixed by the software core team and will be propagated with the next release of CMSSW. But we still need to resolve the failed WFs at hand.

@amaltaro
Copy link
Contributor

Apologies for the belated reply.
Last time I looked into this, this process is much more convoluted than what it actually looks.

We would need to have resource requirements for Merge tasks as well and properly map them by their names between workflow assignment and workflow construction. So we would likely have to change the construction of workload objects upstream (ReqMgr2).

Plus, StepChains are even more complicated, given that we have a single resource requirement for all the steps.

My opinion with such development is that we would potentially hurt the system with unnecessary complexity and likely create other bugs.

If CMSSW release is buggy, then we should run the same workflow once a new release is made. Additionally, we could look into allowing CMSSW + ScramArch override in the ACDC creation, if that is really desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants