Skip to content

obsolete_NgamaiQueueLimits

trac edited this page Oct 9, 2014 · 1 revision

#!html
<h1  style="text-align: center; color: green"> Enforcement of Time and Memory limits and Project names in OGE jobs on Ngamai:</h1>

Enforcement of Time and Memory limits and Project names in OGE jobs on Ngamai:

Enforcement of time, memory and project specification is now active on Ngamai for all OGE jobs.

This applies to all jobs run on Ngamai using qsub or qrsh, whether batch or interactive.

You must now specify time and memory limits and project name in your qsub and qrsh batch jobs and interactive qrsh sessions.

For missing values the OGE JSV script sets defaults and produces output describing the actions taken.

For a dry run without job submission, use " -w p" OGE qsub option, this prints a validation report.

Project Specification:

-P project_name specifies the project name for the OGE job.

The list of project names, project managers and allowed users for each project can be found at

http://wiki.bom.gov.au/foswiki/pub/Main/HPCCC/Project_Group_Accounting_Ngamai.xls

Project names not in the spreadsheet will cause your job to be rejected with message that project does not exist.

Additions and changes to the spreadsheet can only be made by the project manager sending an email to ngamaihelp@bom.gov.au .

The default project name is “general”.

However obtaining project general by default should be avoided as this means your job will be set with very low default resource usage of 2GB of memory for all processes, and 600 seconds runtime.

You can explicitly set project as general and set suitable resource limits however this is not recommended and will eventually be disabled.

Memory Specification:

-l s_vmem and h_vmem specify soft and hard virtual memory per process.

Each process cannot use more than this amount of memory or the job will send signal, see below.

-l mem_free specifies memory per process required for the job,

OGE must allocate nodes to the job with at least this much free memory.

Important: the memory requirements specified in your job are per process!

You are advised to only supply soft limit for memory and allow OGE JSV to set other limits: h_vmem # s_vmem + 1GB , mem_free h_vmem.

If only hard limit is set then other limits are: s_vmem # mem_free h_vmem.

If no memory limit is set then defaults are: s_vmem # 2GB/#processes, h_vmem 3GB/#processes, mem_free = h_vmem.

Upper limit for virtual memory on Ngamai is: s_vmem # 59GB/#processes, h_vmem 60GB/#processes, mem_free = h_vmem.

The mem_free parameter ensures memory on a node will not be oversubscribed (e.g. by running two jobs on the same node, each one using 35 GB of memory).

If a job should request more memory than is available on a node, the job will be rejected with message of too much memory requested.

In normal use a memory exceeded signal will not be generated by the OS.

If a program does request too much memory then it will crash with an error, eg segmentation violation.

If a memory exceeded signal is wanted for error handling then please email ngamaihelp@bom.gov.au for assistance in programming this.

Time Specification:

-l s_cpu, h_cpu specify soft and hard cpu time (usually) in seconds.

Cpu is real or elapsed time multiplied by number of cores (PEs).

-l s_rt, h_rt specify soft and hard real (elapsed) time in seconds.

Rt is the wallclock time for the job, *_rt is not used if cpu time is supplied, otherwise it is used to calculate cpu time and then *_rt is removed. The conversion of real time to cpu time is required in order to support Suspend/Resume.

You are advised to only supply soft limit for cpu time and allow OGE JSV to set hard limit: h_cpu = s_cpu + 60secs.

If only hard limit is set then soft limit is: s_cpu = h_cpu.

If no time limit is set then defaults are: s_cpu # 600secs, h_cpu 660secs.

If only real time is supplied then script will calculate cpu time as: s_cpu = number_of_cores x s_rt

Upper limit for elapsed time is 24 hours for oprtnl/normal queues and 7 days for dmop/dm queues.

When an SGE job reaches a soft limit the signal sent is SIGUSR1 for s_rt and SIGXCPU for s_cpu which the job script can detect for appropriate action, the hard limit signal is a kill which cannot be caught.

Note: there are a couple of ways of specifying time, from man sge_types:

time_specifier

A time specifier either consists of a positive decimal, hexadecimal or octal integer constant, in which case the value is interpreted to be in seconds, or is built by 3 decimal integer numbers separated by colon signs where the first number counts the hours, the second the minutes and the third the seconds. If a number would be zero it can be left out but the separating colon must remain (e.g. 1:0:1 = 1::1 means 1 hours and 1 second).

Examples:

Two ways of settings these parameters are:

Within the job deck:

#$ -l s_cpu=1000 # cpu time in seconds

#$ -l s_vmem=1G # memory in GBytes

#$ -P Project_Name

As part of the qsub line:

qsub -l s_cpu=1:00:00 -l s_vmem=10G –P Project_Name yourjob

Here is an example of issues around requesting memory per process:

Requesting 5 GB for 12 processes will request 60 GB overall!

If you want to use partial nodes, e.g. using 2+2 processes on two nodes in order to have 30 GB available for each process (4 processes * 30 GB # 120 GB memory on two nodes), you have to request 24 cores, and 5 GB (per process).

Then use appropriate options for mpirun (--npersocket or --npernode) to distribute the executables appropriately.

While technically OGE can take memory requirements automatically into account (e.g. requesting 30 GB per process and 4 cores will give you two nodes), the resulting potential inefficient node usage is difficult to detect (accidentally requesting 10 GB per process for a 120 core job will not give you 12 nodes as you might expect, but 20 nodes, which is not obvious in qstat output, which only lists the slots, not the nodes used).

Some general examples (assuming that no OGE parameters are specified in the job script):

  1. qsub -P pr_access_a -l s_cpu=00:10:00 -l h_vmem=1GB yourjob

--> s_cpu=600,h_cpu=660,h_vmem=1024M, s_vmem=1073741824,mem_free=1073741824

A hard limit giving you 60 seconds more CPU time was added, and s_vmem and mem_free was set for you.

  1. qsub -P pr_access_a -pe mpi 3 -l h_rt=00:10:00 -l mem_free=6GB yourjob

--> h_cpu=1800,s_cpu=1800, s_vmem=6442450944,h_vmem=6800364885,mem_free=6144M

The real time limit is converted into 30 minutes of CPU time (since you requested 3 processes). The mem_free value was converted into a 6GB soft limit, and a 6.333 GB hard limit (giving you one additional GB overall for your 3 processes).

  1. qsub -pe mpi 12 -l s_cpu=12:0:0 -l h_vmem=5GB yourjob

--> s_cpu=7200,h_cpu=7920, h_vmem=268435455,s_vmem=178956970,mem_free=268435455

Since the user did not specify a project, the runtime is limited to 10 minutes, or 12*600 # 7200 seconds CPU time. The memory limit is set to overall 12 GB or 2048M/12 170.6 MB, the hard limit will allow for one additional GB overall, which is 1GB/12 = 85 MB per process.

Resource and Project specification for qrsh commands within OGE batch jobs:

Qrsh commands run inside OGE batch jobs get a default project name of general and minimum default resource allocation, see above.

You can either set project name and resource allocation in the qrsh command or use a special command, /opt/bom/bin/get_job_resources , to get the calling OGE job specification. This command allows qrsh commands to inherit the calling OGE job specification, ie project and time and memory limits, it returns for example, -l h_cpu=10,s_cpu=10,h_rt=10,s_rt=20 -P project_name .

The usage is: qrsh /opt/bom/bin/get_job_resources -q [<script>|]

You should use this command to ensure the qrsh command inherits the calling job resources and project name rather than using the default.

This applies particularly to multi-user scripts such as mars.sh.

The get_job_resources command is ignored for interactive use of qrsh and you need to specify the resources.

If specifying a resource in qrsh along with get_job_resources then standard OGE precedence applies, ie to override get_job_resources output you add resources afterwards in the command, eg, qrsh /opt/bom/bin/get_job_resources -l s_rt 00:10:00


[azs, 12/9/2013] Copy from Ngamai_-JSV_and_Project_Names_Advice-_27Aug13.pdf

Clone this wiki locally