You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue was discovered after ctsm updated the ccs_confim_cesm version to ccs_config_cesm0.0.92 (ESCOMP/CTSM#2416). Since then, ctsm test cases using 5x5_amazon resolution are failing to run with the following error:
cesm.log
1 dec0417.hsn.de.hpc.ucar.edu 4: <65-65> is invalid
2 dec0417.hsn.de.hpc.ucar.edu 4: libnuma: Warning: cpu argument 65-65 is out of range
3 dec0417.hsn.de.hpc.ucar.edu 4:
4 dec0417.hsn.de.hpc.ucar.edu 4: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
5 dec0417.hsn.de.hpc.ucar.edu 4: [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
6 dec0417.hsn.de.hpc.ucar.edu 4: [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
7 dec0417.hsn.de.hpc.ucar.edu 4: [--localalloc | -l] command args ...
8 dec0417.hsn.de.hpc.ucar.edu 4: numactl [--show | -s]
9 dec0417.hsn.de.hpc.ucar.edu 4: numactl [--hardware | -H]
10 dec0417.hsn.de.hpc.ucar.edu 4: numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
11 dec0417.hsn.de.hpc.ucar.edu 4: [--strict | -t]
12 dec0417.hsn.de.hpc.ucar.edu 4: [--shmid | -I <id>] --shm | -S <shmkeyfile>
13 dec0417.hsn.de.hpc.ucar.edu 4: [--shmid | -I <id>] --file | -f <tmpfsfile>
14 dec0417.hsn.de.hpc.ucar.edu 4: [--huge | -u] [--touch | -T]
15 dec0417.hsn.de.hpc.ucar.edu 4: memory policy [--dump | -d] [--dump-nodes | -D]
16 dec0417.hsn.de.hpc.ucar.edu 4:
17 dec0417.hsn.de.hpc.ucar.edu 4: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
18 dec0417.hsn.de.hpc.ucar.edu 4: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
19 dec0417.hsn.de.hpc.ucar.edu 4: Instead of a number a node can also be:
20 dec0417.hsn.de.hpc.ucar.edu 4: netdev:DEV the node connected to network device DEV
21 dec0417.hsn.de.hpc.ucar.edu 4: file:PATH the node the block device of path is connected to
22 dec0417.hsn.de.hpc.ucar.edu 4: ip:HOST the node of the network device host routes through
23 dec0417.hsn.de.hpc.ucar.edu 4: block:PATH the node of block device path
24 dec0417.hsn.de.hpc.ucar.edu 4: pci:[seg:]bus:dev[:func] The node of a PCI device
25 dec0417.hsn.de.hpc.ucar.edu 4: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
26 dec0417.hsn.de.hpc.ucar.edu 4: all ranges can be inverted with !
27 dec0417.hsn.de.hpc.ucar.edu 4: all numbers and ranges can be made cpuset-relative with +
28 dec0417.hsn.de.hpc.ucar.edu 4: the old --cpubind argument is deprecated.
29 dec0417.hsn.de.hpc.ucar.edu 4: use --cpunodebind or --physcpubind instead
30 dec0417.hsn.de.hpc.ucar.edu 4: use --balancing | -b to enable Linux kernel NUMA balancing
31 dec0417.hsn.de.hpc.ucar.edu 4: for the process if it is supported by kernel
32 dec0417.hsn.de.hpc.ucar.edu 4: <length> can have g (GB), m (MB) or k (KB) suffixes
33 dec0417.hsn.de.hpc.ucar.edu 3: <64-64> is invalid
34 dec0417.hsn.de.hpc.ucar.edu 3: libnuma: Warning: cpu argument 64-64 is out of range
35 dec0417.hsn.de.hpc.ucar.edu 3:
36 dec0417.hsn.de.hpc.ucar.edu 3: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
37 dec0417.hsn.de.hpc.ucar.edu 3: [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
38 dec0417.hsn.de.hpc.ucar.edu 3: [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
39 dec0417.hsn.de.hpc.ucar.edu 3: [--localalloc | -l] command args ...
40 dec0417.hsn.de.hpc.ucar.edu 3: numactl [--show | -s]
41 dec0417.hsn.de.hpc.ucar.edu 3: numactl [--hardware | -H]
42 dec0417.hsn.de.hpc.ucar.edu 3: numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
43 dec0417.hsn.de.hpc.ucar.edu 3: [--strict | -t]
44 dec0417.hsn.de.hpc.ucar.edu 3: [--shmid | -I <id>] --shm | -S <shmkeyfile>
45 dec0417.hsn.de.hpc.ucar.edu 3: [--shmid | -I <id>] --file | -f <tmpfsfile>
46 dec0417.hsn.de.hpc.ucar.edu 3: [--huge | -u] [--touch | -T]
47 dec0417.hsn.de.hpc.ucar.edu 3: memory policy [--dump | -d] [--dump-nodes | -D]
48 dec0417.hsn.de.hpc.ucar.edu 3:
49 dec0417.hsn.de.hpc.ucar.edu 3: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
50 dec0417.hsn.de.hpc.ucar.edu 3: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
51 dec0417.hsn.de.hpc.ucar.edu 3: Instead of a number a node can also be:
52 dec0417.hsn.de.hpc.ucar.edu 3: netdev:DEV the node connected to network device DEV
53 dec0417.hsn.de.hpc.ucar.edu 3: file:PATH the node the block device of path is connected to
54 dec0417.hsn.de.hpc.ucar.edu 3: ip:HOST the node of the network device host routes through
55 dec0417.hsn.de.hpc.ucar.edu 3: block:PATH the node of block device path
56 dec0417.hsn.de.hpc.ucar.edu 3: pci:[seg:]bus:dev[:func] The node of a PCI device
57 dec0417.hsn.de.hpc.ucar.edu 3: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
58 dec0417.hsn.de.hpc.ucar.edu 3: all ranges can be inverted with !
59 dec0417.hsn.de.hpc.ucar.edu 3: all numbers and ranges can be made cpuset-relative with +
60 dec0417.hsn.de.hpc.ucar.edu 3: the old --cpubind argument is deprecated.
61 dec0417.hsn.de.hpc.ucar.edu 3: use --cpunodebind or --physcpubind instead
62 dec0417.hsn.de.hpc.ucar.edu 3: use --balancing | -b to enable Linux kernel NUMA balancing
63 dec0417.hsn.de.hpc.ucar.edu 3: for the process if it is supported by kernel
64 dec0417.hsn.de.hpc.ucar.edu 3: <length> can have g (GB), m (MB) or k (KB) suffixes
65 dec0417.hsn.de.hpc.ucar.edu: rank 3 exited with code 1
66 dec0417.hsn.de.hpc.ucar.edu: rank 0 died from signal 15
Thanks for reporting this. This is actually due to the PBS select line, specifically, because you are only requesting 5 cpus. Under these circumstances, PBS will create a linux cgroup with only 5 cpus, all on the first socket. The mpibind script tries to bind processes across both sockets, to give your job full memory bandwidth, however, core #s > 4 won't exist in the PBS cgroup, hence this failure. So, to get your case to run immediately, try rerunning with 128 CPUs and 5 MPI ranks in the select line, e.g. with something similar to:
In general, regardless of how many CPUs you intend to use, you should always request 128 on a derecho node so that you have access to full memory performance.
On the mpibind side, I'll add some code to catch this type of request, and exit gracefully with a more meaningful error message.
This issue was discovered after ctsm updated the
ccs_confim_cesm
version toccs_config_cesm0.0.92
(ESCOMP/CTSM#2416). Since then, ctsm test cases using5x5_amazon
resolution are failing to run with the following error:cesm.log
mpibind.log
The text was updated successfully, but these errors were encountered: