Skip to content

FAQ frequently asked questions

Alan Malta Rodrigues edited this page Apr 9, 2020 · 3 revisions

This wiki is meant to contain the most common questions and answers related to the Workload Management operations.

Just a reminder about the usual monitoring tools though:

Why are there so many workflows stuck in acquired state?

While there is no clear answer for such question, there is likely enough monitoring information to get to a conclusion.

From the monitoring links above, one can check the Condor pool summary link, go to the Site Table: table and check the last row of the IdleCpus column. Right now the value is 3723, so there are 3723 cpus that are free in the system, and the likely reason they are not used comes from the fact that (some) workflows are not properly dimensioned, sometimes taking more memory than the usual 2.5GB/core.

The WMAgent monitoring also has some interesting plots on this respect, especially those for "GQ elements by priority", for instance this one, which shows thousands of GQEs in Available above the 80k priority. This would answer why 80k workflows are not going through as well.

Final note: if you still think there might be a problem in the system. You can always pick one workflow and bump its priority to the highest in the system. If it does not get jobs running in a couple of hours, then there is a high change to have a problem in the WM system (provided the site is up & running).

Why the workflow is configured to request X GB of Memory while jobs in condor request something different

One of the ways to answer it would be by looking at the job classads and check whether the job has been tuned or not. Another possibility, could be that you're not looking at the correct place, because the memory requirements can be overwritten during the workflow assignment (or any time before the workflow gets assigned).

The most reliable way to check the workflow/task memory requirements is through the following link (replace the workflow name by the one you want to look at): https://cmsweb.cern.ch/reqmgr2/config?name=cmsunified_ACDC0_task_SMP-RunIISummer15GS-00286__v1_T_200408_083701_9533 then look for the keyword memoryRequirement. It will give you the memory requirements for every single task in the workflow (in this case it's 4GB for all of them).

For most of the cases, the JSON tab/view can also be used, e.g.: https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_SMP-RunIISummer15GS-00286__v1_T_200408_083701_9533 but it can be tricky in the sense that the parameter appears multiple times and, one has precedence over another.

Clone this wiki locally