Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
likwid-mpirun: enable simple pinning for MPI and hybrid MPI/threaded applications
Pinning to dedicated compute resources is important for pure MPI and even more for hybrid MPI/threaded applications. While all major MPI implementations include their mechanism for pinning, likwid-mpirun provides a simple and portable solution based on the powerful capabilities of likwid-pin. This is still experimental at the moment. Still it can be adapted to any MPI and OpenMP combination with the help of a tuning application in the test directory of LIKWID. likwid-mpirun works in conjunction with PBS, LoadLeveler and SLURM. The tested MPI and compilers are Intel C/C++ compiler, GCC, Intel MPI and OpenMPI. The support for mvapich is untested.
As usual you can get a help message with
$ likwid-mpirun -h
You always have to specify the total number of MPI processes with the
Two cases are distinguished: Pure MPI and hybrid applications.
$ likwid-mpirun -np 16 ./a.out
This will start 16 processes, the number of processes per compute node is calculated from the PBS/LoadLeveler/SLURM node file. If two hosts are given, eight processes are pinned to cores/SMT threads per node. The pinning is implemented with the likwid-pin node domain.
Pure MPI with explicit pinning:
$ likwid-mpirun -np 16 -nperdomain S:2 ./a.out
For this case a single option
-nperdomain covers all cases. The argument contains a domain character as already known from the other LIKWID applications and the number per domain separated by a colon. Above example will start two processes per socket up to 16 processes and will pin the processes with likwid-pin.
Domains can be:
- N - for node
- S - for socket
- C - for last level shared cache
- M - for NUMA domain (interesting e.g. for AMD Magny Cours)
For pinning on Magny Cours the following can be useful:
$ likwid-mpirun -np 16 -nperdomain M:2 ./a.out
This will start 2 processes per NUMA domain. On a two socket AMD MagnyCours system this will result in 8 processes per node with two nodes total for this run.
For debugging use the debug option:
$ likwid-mpirun -debug -np 16 -nperdomain M:2 ./a.out
This will output all command which would be executed.
Pinning of hybrid applications:
$ likwid-mpirun -np 16 -pin S0:0,1_S1:0,1 ./a.out
Hybrid pinning has only one option covering all possibilities with
The argument string to pin consists of valid likwid-pin expressions separated by underscores. The number of separated expression denote the number of processes started
per node. Above example will start two processes per node. The first process and its threads (two) will be pinned to Socket one, core 0,1. The second process and its threads will be pinned to socket two, core 0,1. Consequently, the above statement requires 4 hosts to run.
The main pinning complexity is that the OpenMP as well as the MPI implementation could start their own threads for management purpose. These threads need to be skipped and their position in the started threads has to be determined in advance. For the tested MPI+Compiler combinations, the skip masks are integrated into likwid-mpirun.
At the moment all pinning uses block distribution, round robin variants for node and global are planned.
-h, --help Help message -v, --version Version information -d, --debug Debugging output -n/-np <count> Set the number of processes -nperdomain <domain> Set the number of processes per node by giving an affinity domain and count -pin <list> Specify pinning of threads. CPU expressions like likwid-pin separated with '_' -s, --skip <hex> Bitmask with threads to skip -mpi <id> Specify which MPI should be used. Possible values: openmpi, intelmpi and mvapich2 If not set, module system is checked -omp <id> Specify which OpenMP should be used. Possible values: gnu and intel Only required for statically linked executables. -hostfile Use custom hostfile instead of searching the environment -g/-group <perf> Set a likwid-perfctr conform event set for measuring on nodes -m/-marker Activate marker API mode
MPI not recognized
likwid-mpirun checks for some known MPI implementations (OpenMPI, IntelMPI and Mvapich2) in the file system and the module system. It searches for the executables like
mpiexec in the path that can be either in the environment variable
MPI_BASE. If it does not find it, try to set it on the command line with
-mpi [openmpi, intelmpi, mvapich2 or slurm].
If you are running in a batch job environment that is supported by likwid-mpirun the hosts are read from the batch system. In cases where you run it interactively or in an unsupported batch job environment, you have to generate a valid hostfile for likwid-mpirun. The syntax is very simple: List a hostname as many times as the host has slots.
localhost localhost localhost host1 host2 host2
There are three slots on
localhost, one slot on
host1 and two slots on
Performance measurements of MPI and hybrid applications
Besides the correct pinning of MPI processes and their threads, the application execution can be measured using likwid-perfctr. By setting a performance group or custom event set on the command line, the call of likwid-pin is substituted with likwid-perfctr. By now, you can perform end-to-end measurements and instrumented code using the LIKWID Marker API.
Measure the double-precision floating-point operations used by all participating systems running a hybrid application with one MPI process per socket and 10 threads per MPI process:
$ likwid-mpirun -pin S0:0-9_S1:0-9 -g FLOPS_DP ./a.out
Measure the energy used by all participating systems running one process per socket:
$ likwid-mpirun -nperdomain S:1 -g ENERGY ./a.out
likwid-mpirun is intelligent enough to measure socket-wide performance counters on one CPU, the others skip the reading of the hardware registers, they just read the core-local performance counters.
When measuring is activated, no overloading of the hosts is allowed. Multiple processes would read the hardware performance counters so that the final results wouldn't be valid anymore. There are plans to substitute likwid-perfctr with likwid-pin for the overloaded processes.
Using likwid-mpirun with SLURM job scheduler
likwid-mpirun is able to run applications through SLURM.
$ salloc -N X $ likwid-mpirun -np 2 ./a.out
likwid-mpirun recognizes the SLURM environment and calls
srun instead of
mpirun. You can see the
srun command when using the
-d command line switch. Some MPI implementations require special parameters and there is currently no way to add custom options to
srun. One common switch is
--mpi=pmi2 (at least on our cluster). You can either change the Lua code (likwid-4.3.3:
cp $(which likwid-mpirun) .; vi -n 592 likwid-mpirun; ./likwid-mpirun ...) or you set the environment variable
SLURM_MPI_TYPE=pmi2 before running
In some rare cases it might be required to use the MPI implementation specific way of starting applications (
mpirun, ...). You can force using this way by using the
--mpi command line switch.