Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hpx:threads=all allocates too many os threads #1422

Closed
Finomnis opened this issue Mar 24, 2015 · 4 comments

Comments

Projects
None yet
3 participants
@Finomnis
Copy link
Contributor

commented Mar 24, 2015

Reproducable by this test:

#include <hpx/hpx.hpp>                                                           
#include <hpx/hpx_init.hpp>                                                      
#include <hpx/util/lightweight_test.hpp>                                         

unsigned long num_cores = 0;                                                     

int hpx_main(int argc, char ** argv)                                             
{                                                                                
    std::size_t const os_threads = hpx::get_os_thread_count();                   

    std::cout << "Cores: " << num_cores << std::endl;                            
    std::cout << "OS Threads: " << os_threads << std::endl;                      

    HPX_TEST_EQ(num_cores, os_threads);                                          

    return hpx::finalize();                                                      
}                                                                                

int main(int argc, char **argv)                                                  
{                                                                                
    // Get number of cores from OS                                               
    num_cores = hpx::threads::hardware_concurrency();                            

    // By default this test should run on all available cores                    
    std::vector<std::string> cfg;                                                
    cfg.push_back("hpx.os_threads=all");                                         

    // Initialize and run HPX                                                    
    HPX_TEST_EQ_MSG(hpx::init(argc, argv, cfg), 0,                               
        "HPX main exited with non-zero status");                                 
    return hpx::util::report_errors();                                           
}                                                                                

Steps:

  1. salloc -t 02:00:00 -n1 -N1 --cpus-per-task=32
  2. ./threads_all_test:
Cores: 32
OS Threads: 32
  1. srun ./threads_all_test: (Long startup time)
Cores: 32
OS Threads: 544
/home/yn87erow/hpx_tmp/repo/tests/regressions/threads_all.cpp(21): test 'num_cores == os_threads' failed in function 'int hpx_main(int, char **)': '32' != '544'
0 sanity checks and 1 test failed.
srun: error: rusty: task 0: Exited with exit code 1

As i guess the reason for this behaviour are environment variables, those are the variables that were present on the system after the salloc:

SLURM_NODELIST=rusty
MKLROOT=/opt/intel/mkl/10.0.5.025
LDFLAGS=-L/home/yn87erow/.usr_bin/lib
MANPATH=/usr/local/share/man:/usr/share/man:/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.4/man:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.24/man:/opt/intel/mkl/10.0.5.025/man
STAMPEDE=finomnis@stampede.tacc.utexas.edu
SLURM_NODE_ALIASES=(null)
INTEL_LICENSE_FILE=/opt/intel/composerxe-2013.3.174/licenses:/opt/intel/license
TERM=rxvt-unicode
SHELL=/bin/bash
SSH_CLIENT=131.188.33.180 44739 22
SSH_TTY=/dev/pts/4
SLURM_NNODES=1
GMTHOME=/usr/share/gmt-4.5.8
USER=yn87erow
LD_LIBRARY_PATH=/home/yn87erow/bachelor/hpx/build/lib:/home/yn87erow/.usr_bin/lib/
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.pdf=00;32:*.ps=00;32:*.txt=00;32:*.patch=00;32:*.diff=00;32:*.log=00;32:*.tex=00;32:*.doc=00;32:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:
GUILE_LOAD_PATH=/usr/share/guile/1.8
SLURM_JOBID=600
SLURM_NTASKS=1
PAGER=/usr/bin/less
CONFIG_PROTECT_MASK=/etc/gentoo-release /etc/sandbox.d /etc/fonts/fonts.conf /etc/terminfo /etc/ca-certificates.conf /etc/revdep-rebuild
XDG_CONFIG_DIRS=/etc/xdg
NLSPATH=/opt/intel/composerxe-2013.3.174/lib/locale/en_US/%N
SLURM_TASKS_PER_NODE=1
SHELOB=finomnis@shelob.hpc.lsu.edu
MAIL=/var/mail/yn87erow
PATH=/home/yn87erow/.usr_bin/bin:/home/yn87erow/.usr_bin/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.4:/opt/intel/composerxe-2013.3.174/bin/intel64:/opt/intel/composerxe-2013.3.174/mpirt/bin/intel64:/opt/intel/composerxe-2013.3.174/bin/ia32:/opt/intel/composerxe-2013.3.174/mpirt/bin/ia32:/opt/cuda/bin:/opt/MATLAB/R2014a/bin/
SLURM_CPUS_PER_TASK=32
SLURM_JOB_ID=600
HG=/usr/bin/hg
AMDAPPSDKROOT=/home/yn87erow/.usr_bin
PWD=/home/yn87erow/hpx_tmp/build/bin
EDITOR=/bin/nano
LANG=en_US.UTF-8
QT_GRAPHICSSYSTEM=raster
SLURM_SUBMIT_DIR=/home/yn87erow
SLURM_NPROCS=1
SLURM_JOB_NODELIST=rusty
SHLVL=3
HOME=/home/yn87erow
SLURM_JOB_CPUS_PER_NODE=32
SLURM_SUBMIT_HOST=luna
SYSTEMD_LESS=FRSM --shift 5
LESS=-R -M --shift 5
LOGNAME=yn87erow
IBPATH=/usr/sbin
GCC_SPECS=
XDG_DATA_DIRS=/usr/local/share:/usr/share
SSH_CONNECTION=131.188.33.180 44739 131.188.33.181 22
SLURM_JOB_NUM_NODES=1
LESSOPEN=|lesspipe %s
INFOPATH=/usr/share/info:/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.4/info:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.24/info:/usr/share/info/emacs-24
GMT_SHAREDIR=/usr/share/gmt-4.5.8
RUBYOPT=-rauto_gem
OPENGL_PROFILE=xorg-x11
OPENCL_VENDOR_PATH=/home/yn87erow/.usr_bin/etc/OpenCL/vendors/
CONFIG_PROTECT=/usr/share/gnupg/qualified.txt
OPENCL_PROFILE=nvidia
OLDPWD=/home/yn87erow/hpx_tmp
_=/usr/bin/printenv

and these got added by srun:

> SLURM_PRIO_PROCESS=0
> SRUN_DEBUG=3
> SLURM_DISTRIBUTION=cyclic
> SLURM_STEP_ID=20
> SLURM_STEPID=20
> SLURM_SRUN_COMM_PORT=53281
> SLURM_STEP_NODELIST=rusty
> SLURM_STEP_NUM_NODES=1
> SLURM_STEP_NUM_TASKS=1
> SLURM_STEP_TASKS_PER_NODE=1
> SLURM_STEP_LAUNCHER_PORT=53281
> SLURM_STEP_RESV_PORTS=12123-12124
> SLURM_SRUN_COMM_HOST=131.188.33.181
> SLURM_TOPOLOGY_ADDR=rusty
> SLURM_TOPOLOGY_ADDR_PATTERN=node
> TMPDIR=/tmp
> SLURM_CPUS_ON_NODE=32
> SLURM_TASK_PID=2959
> SLURM_NODEID=0
> SLURM_PROCID=0
> SLURM_LOCALID=0
> SLURM_LAUNCH_NODE_IPADDR=131.188.33.181
> SLURM_GTIDS=0
> SLURM_CHECKPOINT_IMAGE_DIR=/home/yn87erow/hpx_tmp/build/bin
> SLURMD_NODENAME=rusty
@hkaiser

This comment has been minimized.

Copy link
Member

commented Mar 24, 2015

This happens only if the executable is launched by SLURM. It works from the command line.

@hkaiser

This comment has been minimized.

Copy link
Member

commented Mar 27, 2015

@Finomnis: could you verify whether my fixes (see above) make the problem go away, please?

@sithhell

This comment has been minimized.

Copy link
Member

commented Mar 27, 2015

The problem has been fixed by the above commits.

@sithhell sithhell closed this Mar 27, 2015

@Finomnis

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2015

Confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.