Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose thread affinities more cleverly #887

Closed
eschnett opened this issue Sep 26, 2013 · 13 comments
Closed

Choose thread affinities more cleverly #887

eschnett opened this issue Sep 26, 2013 · 13 comments
Assignees
Labels
category: init difficulty: easy Good issues for starting out with HPX development type: enhancement
Milestone

Comments

@eschnett
Copy link
Contributor

When having fewer threads than PUs, HPX assigns OS threads to PUs in order. If there are several PUs per core, and the number of threads equals the number of cores, then it is likely to be more efficient to use only one PU per core instead of leaving some cores idle.

With the MPI parcelport, and when several localities run on the same node, then HPX assigns threads of different localities to the same PUs.

This should be changed to make efficient job startup simpler. I also want to argue that, in the MPI case, HPX should loudly warn when this happens so that the user is not surprised by the bad performance. If the default behaviour should not be changed (I hear this was a design decision), then I suggest to add a new command line option to calculate pu-offset and pu-step automatically.

@ghost ghost assigned hkaiser Sep 26, 2013
@hkaiser
Copy link
Member

hkaiser commented Sep 27, 2013

When having fewer threads than PUs, HPX assigns OS threads to PUs in order. If there are
several PUs per core, and the number of threads equals the number of cores, then it is likely
to be more efficient to use only one PU per core instead of leaving some cores idle.

That's not always true. There is no generic affinity definition which is optimal for all applications.

With the MPI parcel port, and when several localities run on the same node, then HPX
assigns threads of different localities to the same PUs.

Yes, this is correct and an acknowledged problem (see #421). The issue with solving this is caused by the fact that there is no general way to figure out the sequence number of a process on a particular node (I think only SLURM allows to extract this information from the environment). If an application instance wants to determine what cores it can use it needs to know how many other instances of this application are running on the same node and in what order those application instances are supposed to be assigned to the PUs.

This should be changed to make efficient job startup simpler. I also want to argue that,
in the MPI case, HPX should loudly warn when this happens so that the user is not surprised
by the bad performance.

Again, I wouldn't know how to detect this.

If the default behaviour should not be changed (I hear this was a design decision), then I
suggest to add a new command line option to calculate pu-offset and pu-step automatically.

Could you elaborate on how this could be done, please?

@eschnett
Copy link
Contributor Author

Regarding affinity: Yes, there is no layout that is ideal for all applications.

However, consider e.g. running on an Intel MIC (60 cores) with 120 threads. HPX currently places 4 threads on the first 30 cores, and no threads onto the other 30 cores. In all HPC talks I've seen, this configuration was not considered. Instead, for maximum performance, people use all cores, and then vary the number of threads running per core (1, 2, 3, or 4).

I suggest a command-line option to switch to this behaviour, without having to manually specify the PU stepping; specifying the PU stepping manually requires taking the number of threads and number of cores into account for each run. One would calculate #pus-per-core as #threads/#cores (rounding up). Alternatively, one would distribute threads over cores round-robin, possibly renumbering the threads afterwards to ensure the threads on the same core are numbered consecutively.

Regarding MPI: One can use MPI_Get_processor_name to find out which MPI ranks share a node. Within the node, one can order them by MPI_Get_rank. (Without MPI, the hostname could be used, or the IP address obtained from ifconfig.)

@sithhell
Copy link
Member

With regard to the multiple localities per node, please join the discussion at #421.

For the other topic discussed here I suggest to extend the --hpx:bind syntax to include the following:

  • --hpx:bind=compact: This will be the default (It's what we currently have already). All worker threads fill up the physical cores first.
  • --hpx:bind=scatter: This option will round-robin the threads to cores. Meaning that we use all physical cores first, and then (if more threads than physical cores are available are requested) use the hyperthreads
  • --hpx:bind=balanced: Same as scatter, but keeps the worker thread numbers consecutive.

This naming scheme and distribution comes from Intel's KMP_AFFINITY settings. I think this is a good way to expresses the users intend in a concise way.

sithhell added a commit that referenced this issue Sep 27, 2013
    - Implemented --hpx:bind=compact and --hpx:bind=scatter
    - Beautified the output of --hpx:print-bind to be human parsable
@pagrubel
Copy link
Member

I measured the usage of the threads on the Xeon Phi on stampede using mpstat while running with --hpx:bind=compact and scatter. They both behave as described for scatter. So can you explain a little better how to use this. I thought that all I needed to do is specify the number of Os threads to use -t 120 for example then add --hpx:bind-scatter What should that give me? what I see is the threads 1-120 being used which means it uses four threads of the first thirty cores

@pagrubel
Copy link
Member

here is a short example using -t16 --hpx:bind-scatter output from mpstat -P ALL (using a 1 second interval) notice the idle % which is the last column. so four threads per core are use for the first 4 cores)

Linux 2.6.38.8-g5f2543d (c557-701-mic0.stampede.tacc.utexas.edu) 09/27/13 k1om (244 CPU)

18:47:10 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
18:47:11 all 0.00 6.53 0.14 0.00 0.00 0.01 0.00 0.00 93.32
18:47:11 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
18:47:11 1 0.00 98.59 1.41 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 2 0.00 98.59 1.41 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 3 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 4 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 5 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 6 0.00 99.30 0.70 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 7 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 8 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 9 0.00 99.30 0.70 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 10 0.00 99.30 0.70 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 11 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 12 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 13 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 14 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 15 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 16 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18:47:11 17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
18:47:11 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
18:47:11 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
18:47:11 20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
18:47:11 21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
CPU 22 on were the same as 21

@hkaiser
Copy link
Member

hkaiser commented Sep 28, 2013

Pat, could you please post the output you get from adding these 2 command line arguments to your application:

--hpx:bind=compact --hpx:print-bind 

and

--hpx:bind=scatter --hpx:print-bind

@pagrubel
Copy link
Member

/work/02466/pagrubel/build/hpx_build0927mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static hpx:bind=scatter
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#4(P#1) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
1: PU L#5(P#2) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
2: PU L#6(P#3) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
3: PU L#7(P#4) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
4: PU L#8(P#5) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
5: PU L#9(P#6) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
6: PU L#10(P#7) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
7: PU L#11(P#8) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
8: PU L#12(P#9) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
9: PU L#13(P#10) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
10: PU L#14(P#11) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
11: PU L#15(P#12) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
12: PU L#16(P#13) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
13: PU L#17(P#14) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
14: PU L#18(P#15) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
15: PU L#19(P#16) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
OS-threads,Tasks,Delay (micro-seconds),Total Walltime (seconds),Walltime per Task (seconds)
16, 1000000, 10, 11.2566, 1.12566e-05

/work/02466/pagrubel/build/hpx_build0927mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static hpx:bind=compact
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#4(P#1) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
1: PU L#5(P#2) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
2: PU L#6(P#3) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
3: PU L#7(P#4) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
4: PU L#8(P#5) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
5: PU L#9(P#6) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
6: PU L#10(P#7) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
7: PU L#11(P#8) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
8: PU L#12(P#9) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
9: PU L#13(P#10) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
10: PU L#14(P#11) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
11: PU L#15(P#12) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
12: PU L#16(P#13) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
13: PU L#17(P#14) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
14: PU L#18(P#15) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
15: PU L#19(P#16) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
OS-threads,Tasks,Delay (micro-seconds),Total Walltime (seconds),Walltime per Task (seconds)
16, 1000000, 10, 11.2531, 1.12531e-0

@sithhell
Copy link
Member

Looks like you missed the dashes before hpx:bind
Am 28.09.2013 06:08 schrieb "Patricia Grubel" notifications@github.com:

/work/02466/pagrubel/build/hpx_build0927mic/bin/hpx_homogeneous_timed_task_spawn
-t16 --hpx:queuing=static hpx:bind=scatter
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#4(P#1) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
1: PU L#5(P#2) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
2: PU L#6(P#3) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
3: PU L#7(P#4) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
4: PU L#8(P#5) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
5: PU L#9(P#6) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
6: PU L#10(P#7) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
7: PU L#11(P#8) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
8: PU L#12(P#9) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
9: PU L#13(P#10) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
10: PU L#14(P#11) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
11: PU L#15(P#12) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
12: PU L#16(P#13) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
13: PU L#17(P#14) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
14: PU L#18(P#15) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
15: PU L#19(P#16) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
OS-threads,Tasks,Delay (micro-seconds),Total Walltime (seconds),Walltime
per Task (seconds)
16, 1000000, 10, 11.2566, 1.12566e-05

/work/02466/pagrubel/build/hpx_build0927mic/bin/hpx_homogeneous_timed_task_spawn
-t16 --hpx:queuing=static hpx:bind=compact
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#4(P#1) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
1: PU L#5(P#2) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
2: PU L#6(P#3) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
3: PU L#7(P#4) Core#0 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
4: PU L#8(P#5) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
5: PU L#9(P#6) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
6: PU L#10(P#7) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
7: PU L#11(P#8) Core#1 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
8: PU L#12(P#9) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
9: PU L#13(P#10) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
10: PU L#14(P#11) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
11: PU L#15(P#12) Core#2 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
12: PU L#16(P#13) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
13: PU L#17(P#14) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
14: PU L#18(P#15) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
15: PU L#19(P#16) Core#3 L1d(32KB) L2d(512KB) Socket#0 Machine#0(7697MB)
OS-threads,Tasks,Delay (micro-seconds),Total Walltime (seconds),Walltime
per Task (seconds)
16, 1000000, 10, 11.2531, 1.12531e-0


Reply to this email directly or view it on GitHubhttps://github.com//issues/887#issuecomment-25290166
.

@pagrubel
Copy link
Member

Uggh! mia culpa! That worked! Wonder if there is a possiblity to limit the number of cores and use scatter such that you can run with a certain number of threads per core.

@pagrubel
Copy link
Member

now scatter does not work example:

/work/02466/pagrubel/build/hpx_build0929mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static --hpx:bind=scatter
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#1(P#241), Core L#0(P#60), Socket L#0(P#0)
1: PU L#2(P#242), Core L#0(P#60), Socket L#0(P#0)
2: PU L#3(P#243), Core L#0(P#60), Socket L#0(P#0)
3: PU L#4(P#1), Core L#1(P#0), Socket L#0(P#0)
4: PU L#1(P#241), Core L#0(P#60), Socket L#0(P#0)
5: PU L#2(P#242), Core L#0(P#60), Socket L#0(P#0)
6: PU L#3(P#243), Core L#0(P#60), Socket L#0(P#0)
7: PU L#4(P#1), Core L#1(P#0), Socket L#0(P#0)
8: PU L#1(P#241), Core L#0(P#60), Socket L#0(P#0)
9: PU L#2(P#242), Core L#0(P#60), Socket L#0(P#0)
10: PU L#3(P#243), Core L#0(P#60), Socket L#0(P#0)
11: PU L#4(P#1), Core L#1(P#0), Socket L#0(P#0)
12: PU L#1(P#241), Core L#0(P#60), Socket L#0(P#0)
13: PU L#2(P#242), Core L#0(P#60), Socket L#0(P#0)
14: PU L#3(P#243), Core L#0(P#60), Socket L#0(P#0)
15: PU L#4(P#1), Core L#1(P#0), Socket L#0(P#0)

@pagrubel pagrubel reopened this Sep 30, 2013
@pagrubel
Copy link
Member

sorry but now balanced is messed up
run on the Xeon phi
/work/02466/pagrubel/build/hpx_build0929mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static --hpx:bind=balanced
--hpx:print-bind --delay=10 --tasks=1000000
0: PU L#1(P#241), Core L#0(P#60), Socket L#0(P#0)
1: PU L#2(P#242), Core L#0(P#60), Socket L#0(P#0)
2: PU L#3(P#243), Core L#0(P#60), Socket L#0(P#0)
3: PU L#5(P#2), Core L#1(P#0), Socket L#0(P#0)
4: PU L#6(P#3), Core L#1(P#0), Socket L#0(P#0)
5: PU L#7(P#4), Core L#1(P#0), Socket L#0(P#0)
6: PU L#9(P#6), Core L#2(P#1), Socket L#0(P#0)
7: PU L#10(P#7), Core L#2(P#1), Socket L#0(P#0)
8: PU L#11(P#8), Core L#2(P#1), Socket L#0(P#0)
9: PU L#13(P#10), Core L#3(P#2), Socket L#0(P#0)
10: PU L#14(P#11), Core L#3(P#2), Socket L#0(P#0)
11: PU L#15(P#12), Core L#3(P#2), Socket L#0(P#0)
12: PU L#17(P#14), Core L#4(P#3), Socket L#0(P#0)
13: PU L#18(P#15), Core L#4(P#3), Socket L#0(P#0)
14: PU L#19(P#16), Core L#4(P#3), Socket L#0(P#0)
15: PU L#21(P#18), Core L#5(P#4), Socket L#0(P#0)
...

@sithhell
Copy link
Member

Everything works as expected now.

@hkaiser
Copy link
Member

hkaiser commented Sep 30, 2013

--hpx:bind=balanced hangs on my machine now. What was wrong with the previous implementation?

@hkaiser hkaiser reopened this Sep 30, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: init difficulty: easy Good issues for starting out with HPX development type: enhancement
Projects
None yet
Development

No branches or pull requests

4 participants