Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resource requirements for utilitarian jobs #8331

Open
amaltaro opened this issue Nov 15, 2017 · 12 comments
Open

Improve resource requirements for utilitarian jobs #8331

amaltaro opened this issue Nov 15, 2017 · 12 comments

Comments

@amaltaro
Copy link
Contributor

For cleanup, logcollect and merge jobs.
By default, they use 1 core, request 1GB of RAM and have a MaxRSS watchdog set to ~2.3GB.

We should check ES data and maybe lower these requirements for a better usage of the resources.

@amaltaro amaltaro self-assigned this Nov 15, 2017
@amaltaro amaltaro added this to the WMAgent1801 milestone Nov 15, 2017
@amaltaro
Copy link
Contributor Author

Might not be a very good idea... I've just found a merge job for the TaskChain_Relval_Multicore template that had a performance failure:

    PerformanceError
        PerformanceKill (Exit Code: 50660)

            Error in CMSSW step cmsRun1
            Number of Cores: None
            Job has exceeded maxRSS: 2355.2
            Job has RSS: 2425

@hufnagel
Copy link
Member

Weird. Merge jobs should all use fast-copy of baskets which is fast and should use little memory. Might be worthwhile to get a log of that job and figure out what went wrong...

@amaltaro
Copy link
Contributor Author

Are you volunteering yourself to look at it? :)

@ticoann ticoann modified the milestones: WMAgent1801, WMAgent1802 Feb 12, 2018
@vlimant
Copy link
Contributor

vlimant commented Feb 13, 2018

#8451 maybe a duplicate

@amaltaro
Copy link
Contributor Author

You mean the other way around :)

@vlimant
Copy link
Contributor

vlimant commented Feb 13, 2018

from #8451 make sure you update also what goes in htcondor when you rework this

@thongonary
Copy link

So... how straightforward it is to increase the threshold to some higher value, say, 4GB?

@hufnagel
Copy link
Member

hufnagel commented Mar 2, 2018

You don't want to do this for all such jobs. Requesting 4GB for standard merge, cleanup, logcollect etc jobs means you have less resources that can run them (you wait longer to run them and can run less of them) and you leave less resources available for other jobs.

If special types of utility jobs (i.e. NANOAOD merges that aren't really standard merges) need more memory, we should request more memory just for these special types of jobs.

Cleanup and LogCollect could probably be reduced though.

@thongonary
Copy link

If special types of utility jobs (i.e. NANOAOD merges that aren't really standard merges) need more memory, we should request more memory just for these special types of jobs.

Thanks! That's what we want.

@amaltaro
Copy link
Contributor Author

amaltaro commented Mar 3, 2018

For the record, several of the Task getters/setters methods don't touch "utilitarian" jobs. Right now we cannot change resource requirements for such jobs and if we want to support updates to those tasks too, that's going to be tricky and likely ugly for the assigner/unified side (the only way I see memory updates working without causing issues in other tasks would be specificifying every single tasks and its Memory requirement).

@bbockelm
Copy link
Contributor

bbockelm commented Mar 4, 2018

Hi,

Note that the NanoAOD merge issues are really a ROOT bug -- and affect how well these files can be effectively read by users. See:

cms-sw/cmssw#22445

For the other merge jobs - are we really seeing memory limits, or are we simply snapshotting cmsRun when it forks? The watchdog should be using PSS, not RSS, in the end.

Brian

@ticoann ticoann modified the milestones: WMAgent1804, WMAgent1805 Apr 10, 2018
@ticoann ticoann modified the milestones: WMAgent1805, WMAgent1809 Aug 28, 2018
@ticoann ticoann modified the milestones: WMAgent1809, WMAgent1904 Dec 26, 2018
@amaltaro
Copy link
Contributor Author

I suggest we first update the watchdog to PSS instead of RSS. Then we collect data for a couple of months and set those utilitarian jobs with a reasonable resources requirement in order to minimize the resource wastage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants