New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choose thread affinities more cleverly #887
Comments
That's not always true. There is no generic affinity definition which is optimal for all applications.
Yes, this is correct and an acknowledged problem (see #421). The issue with solving this is caused by the fact that there is no general way to figure out the sequence number of a process on a particular node (I think only SLURM allows to extract this information from the environment). If an application instance wants to determine what cores it can use it needs to know how many other instances of this application are running on the same node and in what order those application instances are supposed to be assigned to the PUs.
Again, I wouldn't know how to detect this.
Could you elaborate on how this could be done, please? |
Regarding affinity: Yes, there is no layout that is ideal for all applications. However, consider e.g. running on an Intel MIC (60 cores) with 120 threads. HPX currently places 4 threads on the first 30 cores, and no threads onto the other 30 cores. In all HPC talks I've seen, this configuration was not considered. Instead, for maximum performance, people use all cores, and then vary the number of threads running per core (1, 2, 3, or 4). I suggest a command-line option to switch to this behaviour, without having to manually specify the PU stepping; specifying the PU stepping manually requires taking the number of threads and number of cores into account for each run. One would calculate #pus-per-core as #threads/#cores (rounding up). Alternatively, one would distribute threads over cores round-robin, possibly renumbering the threads afterwards to ensure the threads on the same core are numbered consecutively. Regarding MPI: One can use MPI_Get_processor_name to find out which MPI ranks share a node. Within the node, one can order them by MPI_Get_rank. (Without MPI, the hostname could be used, or the IP address obtained from ifconfig.) |
With regard to the multiple localities per node, please join the discussion at #421. For the other topic discussed here I suggest to extend the --hpx:bind syntax to include the following:
This naming scheme and distribution comes from Intel's KMP_AFFINITY settings. I think this is a good way to expresses the users intend in a concise way. |
- Implemented --hpx:bind=compact and --hpx:bind=scatter - Beautified the output of --hpx:print-bind to be human parsable
I measured the usage of the threads on the Xeon Phi on stampede using mpstat while running with --hpx:bind=compact and scatter. They both behave as described for scatter. So can you explain a little better how to use this. I thought that all I needed to do is specify the number of Os threads to use -t 120 for example then add --hpx:bind-scatter What should that give me? what I see is the threads 1-120 being used which means it uses four threads of the first thirty cores |
here is a short example using -t16 --hpx:bind-scatter output from mpstat -P ALL (using a 1 second interval) notice the idle % which is the last column. so four threads per core are use for the first 4 cores) Linux 2.6.38.8-g5f2543d (c557-701-mic0.stampede.tacc.utexas.edu) 09/27/13 k1om (244 CPU) 18:47:10 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle |
Pat, could you please post the output you get from adding these 2 command line arguments to your application:
and
|
/work/02466/pagrubel/build/hpx_build0927mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static hpx:bind=scatter /work/02466/pagrubel/build/hpx_build0927mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static hpx:bind=compact |
Looks like you missed the dashes before hpx:bind
|
Uggh! mia culpa! That worked! Wonder if there is a possiblity to limit the number of cores and use scatter such that you can run with a certain number of threads per core. |
now scatter does not work example: /work/02466/pagrubel/build/hpx_build0929mic/bin/hpx_homogeneous_timed_task_spawn -t16 --hpx:queuing=static --hpx:bind=scatter |
sorry but now balanced is messed up |
Everything works as expected now. |
--hpx:bind=balanced hangs on my machine now. What was wrong with the previous implementation? |
When having fewer threads than PUs, HPX assigns OS threads to PUs in order. If there are several PUs per core, and the number of threads equals the number of cores, then it is likely to be more efficient to use only one PU per core instead of leaving some cores idle.
With the MPI parcelport, and when several localities run on the same node, then HPX assigns threads of different localities to the same PUs.
This should be changed to make efficient job startup simpler. I also want to argue that, in the MPI case, HPX should loudly warn when this happens so that the user is not surprised by the bad performance. If the default behaviour should not be changed (I hear this was a design decision), then I suggest to add a new command line option to calculate pu-offset and pu-step automatically.
The text was updated successfully, but these errors were encountered: