Choose thread affinities more cleverly #887

eschnett · 2013-09-26T15:39:13Z

When having fewer threads than PUs, HPX assigns OS threads to PUs in order. If there are several PUs per core, and the number of threads equals the number of cores, then it is likely to be more efficient to use only one PU per core instead of leaving some cores idle.

With the MPI parcelport, and when several localities run on the same node, then HPX assigns threads of different localities to the same PUs.

This should be changed to make efficient job startup simpler. I also want to argue that, in the MPI case, HPX should loudly warn when this happens so that the user is not surprised by the bad performance. If the default behaviour should not be changed (I hear this was a design decision), then I suggest to add a new command line option to calculate pu-offset and pu-step automatically.

hkaiser · 2013-09-27T01:59:56Z

When having fewer threads than PUs, HPX assigns OS threads to PUs in order. If there are
several PUs per core, and the number of threads equals the number of cores, then it is likely
to be more efficient to use only one PU per core instead of leaving some cores idle.

That's not always true. There is no generic affinity definition which is optimal for all applications.

With the MPI parcel port, and when several localities run on the same node, then HPX
assigns threads of different localities to the same PUs.

Yes, this is correct and an acknowledged problem (see #421). The issue with solving this is caused by the fact that there is no general way to figure out the sequence number of a process on a particular node (I think only SLURM allows to extract this information from the environment). If an application instance wants to determine what cores it can use it needs to know how many other instances of this application are running on the same node and in what order those application instances are supposed to be assigned to the PUs.

This should be changed to make efficient job startup simpler. I also want to argue that,
in the MPI case, HPX should loudly warn when this happens so that the user is not surprised
by the bad performance.

Again, I wouldn't know how to detect this.

If the default behaviour should not be changed (I hear this was a design decision), then I
suggest to add a new command line option to calculate pu-offset and pu-step automatically.

Could you elaborate on how this could be done, please?

eschnett · 2013-09-27T03:06:53Z

Regarding affinity: Yes, there is no layout that is ideal for all applications.

However, consider e.g. running on an Intel MIC (60 cores) with 120 threads. HPX currently places 4 threads on the first 30 cores, and no threads onto the other 30 cores. In all HPC talks I've seen, this configuration was not considered. Instead, for maximum performance, people use all cores, and then vary the number of threads running per core (1, 2, 3, or 4).

I suggest a command-line option to switch to this behaviour, without having to manually specify the PU stepping; specifying the PU stepping manually requires taking the number of threads and number of cores into account for each run. One would calculate #pus-per-core as #threads/#cores (rounding up). Alternatively, one would distribute threads over cores round-robin, possibly renumbering the threads afterwards to ensure the threads on the same core are numbered consecutively.

Regarding MPI: One can use MPI_Get_processor_name to find out which MPI ranks share a node. Within the node, one can order them by MPI_Get_rank. (Without MPI, the hostname could be used, or the IP address obtained from ifconfig.)

sithhell · 2013-09-27T06:17:03Z

With regard to the multiple localities per node, please join the discussion at #421.

For the other topic discussed here I suggest to extend the --hpx:bind syntax to include the following:

--hpx:bind=compact: This will be the default (It's what we currently have already). All worker threads fill up the physical cores first.
--hpx:bind=scatter: This option will round-robin the threads to cores. Meaning that we use all physical cores first, and then (if more threads than physical cores are available are requested) use the hyperthreads
--hpx:bind=balanced: Same as scatter, but keeps the worker thread numbers consecutive.

This naming scheme and distribution comes from Intel's KMP_AFFINITY settings. I think this is a good way to expresses the users intend in a concise way.

- Implemented --hpx:bind=compact and --hpx:bind=scatter - Beautified the output of --hpx:print-bind to be human parsable

pagrubel · 2013-09-28T02:57:52Z

I measured the usage of the threads on the Xeon Phi on stampede using mpstat while running with --hpx:bind=compact and scatter. They both behave as described for scatter. So can you explain a little better how to use this. I thought that all I needed to do is specify the number of Os threads to use -t 120 for example then add --hpx:bind-scatter What should that give me? what I see is the threads 1-120 being used which means it uses four threads of the first thirty cores

pagrubel · 2013-09-28T03:10:22Z

here is a short example using -t16 --hpx:bind-scatter output from mpstat -P ALL (using a 1 second interval) notice the idle % which is the last column. so four threads per core are use for the first 4 cores)

Linux 2.6.38.8-g5f2543d (c557-701-mic0.stampede.tacc.utexas.edu) 09/27/13 k1om (244 CPU)

18:47:10 CPU %usr %nice 18:47:11 all 0.00 18:47:11 0 0.00 18:47:11 1 0.00 98.59 18:47:11 2 0.00 98.59 18:47:11 3 0.00 100.00 18:47:11 4 0.00 100.00 18:47:11 5 0.00 100.00 18:47:11 6 0.00 99.30 18:47:11 7 0.00 100.00 18:47:11 8 0.00 100.00 18:47:11 9 0.00 99.30 18:47:11 10 0.00 99.30 18:47:11 11 0.00 100.00 18:47:11 12 0.00 100.00 18:47:11 13 0.00 100.00 18:47:11 14 0.00 100.00 18:47:11 15 0.00 100.00 18:47:11 16 0.00 100.00 18:47:11 17 0.00 18:47:11 18 0.00 18:47:11 19 0.00 18:47:11 20 0.00 18:47:11 21 0.00 CPU 22 on were the same as 21 %sys %iowait %irq %soft %steal %guest %idle
6.53 0.14 0.00 0.00 0.01 0.00 0.00 93.32
0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
1.41 0.00 0.00 0.00 0.00 0.00 0.00
1.41 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.70 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.70 0.00 0.00 0.00 0.00 0.00 0.00
0.70 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00

hkaiser · 2013-09-28T03:13:25Z

Pat, could you please post the output you get from adding these 2 command line arguments to your application:

--hpx:bind=compact --hpx:print-bind

and

--hpx:bind=scatter --hpx:print-bind

pagrubel · 2013-09-28T04:08:38Z

/work/02466/pagrubel/build/hpx_build0927mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static hpx:bind=scatter
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#4(P#1) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
1: PU L#5(P#2) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
2: PU L#6(P#3) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
3: PU L#7(P#4) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
4: PU L#8(P#5) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
5: PU L#9(P#6) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
6: PU L#10(P#7) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
7: PU L#11(P#8) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
8: PU L#12(P#9) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
9: PU L#13(P#10) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
10: PU L#14(P#11) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
11: PU L#15(P#12) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
12: PU L#16(P#13) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
13: PU L#17(P#14) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
14: PU L#18(P#15) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
15: PU L#19(P#16) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
OS-threads,Tasks,Delay (micro-seconds),Total Walltime (seconds),Walltime per Task (seconds)
16, 1000000, 10, 11.2566, 1.12566e-05

/work/02466/pagrubel/build/hpx_build0927mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static hpx:bind=compact
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#4(P#1) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
1: PU L#5(P#2) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
2: PU L#6(P#3) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
3: PU L#7(P#4) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
4: PU L#8(P#5) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
5: PU L#9(P#6) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
6: PU L#10(P#7) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
7: PU L#11(P#8) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
8: PU L#12(P#9) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
9: PU L#13(P#10) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
10: PU L#14(P#11) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
11: PU L#15(P#12) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
12: PU L#16(P#13) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
13: PU L#17(P#14) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
14: PU L#18(P#15) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
15: PU L#19(P#16) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
OS-threads,Tasks,Delay (micro-seconds),Total Walltime (seconds),Walltime per Task (seconds)
16, 1000000, 10, 11.2531, 1.12531e-0

sithhell · 2013-09-28T06:47:30Z

Looks like you missed the dashes before hpx:bind
Am 28.09.2013 06:08 schrieb "Patricia Grubel" notifications@github.com:

/work/02466/pagrubel/build/hpx_build0927mic/bin/hpx_homogeneous_timed_task_spawn
-t16 --hpx:queuing=static hpx:bind=scatter
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#4(P#1) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
1: PU L#5(P#2) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
2: PU L#6(P#3) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
3: PU L#7(P#4) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
4: PU L#8(P#5) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
5: PU L#9(P#6) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
6: PU L#10(P#7) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
7: PU L#11(P#8) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
8: PU L#12(P#9) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
9: PU L#13(P#10) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
10: PU L#14(P#11) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
11: PU L#15(P#12) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
12: PU L#16(P#13) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
13: PU L#17(P#14) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
14: PU L#18(P#15) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
15: PU L#19(P#16) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
OS-threads,Tasks,Delay (micro-seconds),Total Walltime (seconds),Walltime
per Task (seconds)
16, 1000000, 10, 11.2566, 1.12566e-05

/work/02466/pagrubel/build/hpx_build0927mic/bin/hpx_homogeneous_timed_task_spawn
-t16 --hpx:queuing=static hpx:bind=compact
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#4(P#1) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
1: PU L#5(P#2) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
2: PU L#6(P#3) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
3: PU L#7(P#4) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
4: PU L#8(P#5) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
5: PU L#9(P#6) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
6: PU L#10(P#7) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
7: PU L#11(P#8) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
8: PU L#12(P#9) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
9: PU L#13(P#10) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
10: PU L#14(P#11) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
11: PU L#15(P#12) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
12: PU L#16(P#13) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
13: PU L#17(P#14) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
14: PU L#18(P#15) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
15: PU L#19(P#16) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
OS-threads,Tasks,Delay (micro-seconds),Total Walltime (seconds),Walltime
per Task (seconds)
16, 1000000, 10, 11.2531, 1.12531e-0

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/887#issuecomment-25290166
.

pagrubel · 2013-09-28T12:45:41Z

Uggh! mia culpa! That worked! Wonder if there is a possiblity to limit the number of cores and use scatter such that you can run with a certain number of threads per core.

pagrubel · 2013-09-30T00:47:55Z

now scatter does not work example:

/work/02466/pagrubel/build/hpx_build0929mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static --hpx:bind=scatter
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#1(P#241), Core L#0(P#60), Socket L#0(P#0)
1: PU L#2(P#242), Core L#0(P#60), Socket L#0(P#0)
2: PU L#3(P#243), Core L#0(P#60), Socket L#0(P#0)
3: PU L#4(P#1), Core L#1(P#0), Socket L#0(P#0)
4: PU L#1(P#241), Core L#0(P#60), Socket L#0(P#0)
5: PU L#2(P#242), Core L#0(P#60), Socket L#0(P#0)
6: PU L#3(P#243), Core L#0(P#60), Socket L#0(P#0)
7: PU L#4(P#1), Core L#1(P#0), Socket L#0(P#0)
8: PU L#1(P#241), Core L#0(P#60), Socket L#0(P#0)
9: PU L#2(P#242), Core L#0(P#60), Socket L#0(P#0)
10: PU L#3(P#243), Core L#0(P#60), Socket L#0(P#0)
11: PU L#4(P#1), Core L#1(P#0), Socket L#0(P#0)
12: PU L#1(P#241), Core L#0(P#60), Socket L#0(P#0)
13: PU L#2(P#242), Core L#0(P#60), Socket L#0(P#0)
14: PU L#3(P#243), Core L#0(P#60), Socket L#0(P#0)
15: PU L#4(P#1), Core L#1(P#0), Socket L#0(P#0)

pagrubel · 2013-09-30T02:34:50Z

sorry but now balanced is messed up
run on the Xeon phi
/work/02466/pagrubel/build/hpx_build0929mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static --hpx:bind=balanced
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#1(P#241), Core L#0(P#60), Socket L#0(P#0)
1: PU L#2(P#242), Core L#0(P#60), Socket L#0(P#0)
2: PU L#3(P#243), Core L#0(P#60), Socket L#0(P#0)
3: PU L#5(P#2), Core L#1(P#0), Socket L#0(P#0)
4: PU L#6(P#3), Core L#1(P#0), Socket L#0(P#0)
5: PU L#7(P#4), Core L#1(P#0), Socket L#0(P#0)
6: PU L#9(P#6), Core L#2(P#1), Socket L#0(P#0)
7: PU L#10(P#7), Core L#2(P#1), Socket L#0(P#0)
8: PU L#11(P#8), Core L#2(P#1), Socket L#0(P#0)
9: PU L#13(P#10), Core L#3(P#2), Socket L#0(P#0)
10: PU L#14(P#11), Core L#3(P#2), Socket L#0(P#0)
11: PU L#15(P#12), Core L#3(P#2), Socket L#0(P#0)
12: PU L#17(P#14), Core L#4(P#3), Socket L#0(P#0)
13: PU L#18(P#15), Core L#4(P#3), Socket L#0(P#0)
14: PU L#19(P#16), Core L#4(P#3), Socket L#0(P#0)
15: PU L#21(P#18), Core L#5(P#4), Socket L#0(P#0)
...

sithhell · 2013-09-30T10:04:32Z

Everything works as expected now.

hkaiser · 2013-09-30T11:09:56Z

--hpx:bind=balanced hangs on my machine now. What was wrong with the previous implementation?

ghost assigned hkaiser Sep 26, 2013

sithhell added a commit that referenced this issue Sep 27, 2013

First commit for fixing #887:

5c77687

- Implemented --hpx:bind=compact and --hpx:bind=scatter - Beautified the output of --hpx:print-bind to be human parsable

hkaiser closed this as completed in b41af3c Sep 29, 2013

pagrubel reopened this Sep 30, 2013

sithhell closed this as completed Sep 30, 2013

hkaiser reopened this Sep 30, 2013

sithhell closed this as completed in ab604bb Sep 30, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose thread affinities more cleverly #887

Choose thread affinities more cleverly #887

eschnett commented Sep 26, 2013

hkaiser commented Sep 27, 2013

eschnett commented Sep 27, 2013

sithhell commented Sep 27, 2013

pagrubel commented Sep 28, 2013

pagrubel commented Sep 28, 2013

hkaiser commented Sep 28, 2013

pagrubel commented Sep 28, 2013

sithhell commented Sep 28, 2013

pagrubel commented Sep 28, 2013

pagrubel commented Sep 30, 2013

pagrubel commented Sep 30, 2013

sithhell commented Sep 30, 2013

hkaiser commented Sep 30, 2013

Choose thread affinities more cleverly #887

Choose thread affinities more cleverly #887

Comments

eschnett commented Sep 26, 2013

hkaiser commented Sep 27, 2013

eschnett commented Sep 27, 2013

sithhell commented Sep 27, 2013

pagrubel commented Sep 28, 2013

pagrubel commented Sep 28, 2013

hkaiser commented Sep 28, 2013

pagrubel commented Sep 28, 2013

sithhell commented Sep 28, 2013

pagrubel commented Sep 28, 2013

pagrubel commented Sep 30, 2013

pagrubel commented Sep 30, 2013

sithhell commented Sep 30, 2013

hkaiser commented Sep 30, 2013