New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error parsing host file #643
Comments
Sameer, when I simply execute
That is an indication that I'm missing something (or the system configuration is messed up, which I don't think is the case). |
Hi Hartmut, [sameer@hn1 ~]$ module list
source /etc/profile.d/modules.csh On Dec 18, 2012, at 5:23 PM, Hartmut Kaiser notifications@github.com wrote:
|
Hartmut - The issue is simpler than that... ACISS has two networks. The $PBS_NODEFILE contains the hostnames on the torque network. HPX is trying to get information for the hostnames on the GigE network. We were able to work around the problem by doing this in our submission script:
That way, we pass the GigE network names to HPX, rather than the torque names. It would be nice if HPX could use the torque names, because it is the faster 10GigE network. Thanks - |
If it is possible to deduce the 10GigE hostnames from the 1GigE ones you can use |
This is resolved, I'm closing it. Please reopen if you need more information/help. |
Hi,
The Aciss system has two different types of nodes. When I use:
qsub -I -V -q generic -l nodes=2:ppn=12 -d /home3/sameer
to allocate the generic nodes, I can run an application on the two nodes properly using:
[sameer@cn169 bin]$ pwd
/ibrix/packages/HPX/apps/hpx.un/bin
[sameer@cn169 bin]$ pbsdsh -v -u
pwd
/hello_world --hpx:nodes=cat $PBS_NODEFILE
-t 2...
pbsdsh: rescinfo from 10: Linux cn169 2.6.32-279.9.1.el6.x86_64 #1 SMP Fri Aug 31 09:04:24 EDT 2012 x86_64:nodes=2:ppn=12,walltime=24:00:00
pbsdsh: rescinfo from 11: Linux cn169 2.6.32-279.9.1.el6.x86_64 #1 SMP Fri Aug 31 09:04:24 EDT 2012 x86_64:nodes=2:ppn=12,walltime=24:00:00
pbsdsh: spawned task 0
pbsdsh: spawned task 1
pbsdsh: spawn event returned: 0 (2 spawns and 0 obits outstanding)
pbsdsh: sending obit for task 8
pbsdsh: spawn event returned: 1 (1 spawns and 1 obits outstanding)
pbsdsh: sending obit for task 9
hello world from OS-thread 1 on locality 0
hello world from OS-thread 0 on locality 0
hello world from OS-thread 0 on locality 1
hello world from OS-thread 1 on locality 1
pbsdsh: obit event returned: 0 (0 spawns and 2 obits outstanding)
pbsdsh: task 0 exit status 0
pbsdsh: obit event returned: 1 (0 spawns and 1 obits outstanding)
pbsdsh: task 1 exit status 0
When I try to use the fatnodes, I get an error.
[sameer@hn1 ~]$ qsub -I -V -q fatnodes -l nodes=2:ppn=32 -d /home3/sameer
[sameer@un12 bin]$ pbsdsh -v -u
pwd
/hello_world --hpx:nodes=cat $PBS_NODEFILE
-t 2pbsdsh: rescinfo from 29: Linux un12 2.6.32-279.9.1.el6.x86_64 #1 SMP Fri Aug 31 09:04:24 EDT 2012 x86_64:nodes=2:ppn=32,walltime=24:00:00
pbsdsh: rescinfo from 30: Linux un12 2.6.32-279.9.1.el6.x86_64 #1 SMP Fri Aug 31 09:04:24 EDT 2012 x86_64:nodes=2:ppn=32,walltime=24:00:00
pbsdsh: rescinfo from 31: Linux un12 2.6.32-279.9.1.el6.x86_64 #1 SMP Fri Aug 31 09:04:24 EDT 2012 x86_64:nodes=2:ppn=32,walltime=24:00:00
pbsdsh: spawned task 0
pbsdsh: spawned task 1
pbsdsh: spawn event returned: 0 (2 spawns and 0 obits outstanding)
pbsdsh: sending obit for task 2
hpx::init: std::exception caught: Cannot retrieve number of OS threads for host_name: un12
pbsdsh: obit event returned: 0 (1 spawns and 1 obits outstanding)
pbsdsh: task 0 exit status 255
pbsdsh: spawn event returned: 1 (1 spawns and 0 obits outstanding)
pbsdsh: sending obit for task 3
hpx::init: std::exception caught: Cannot retrieve number of OS threads for host_name: un10
pbsdsh: obit event returned: 1 (0 spawns and 1 obits outstanding)
pbsdsh: task 1 exit status 255
cat $PBS_NODEFILE | more
fn12
fn12
fn12
fn12
fn12
fn12
fn12
fn12
fn12
fn12
fn12
fn12
...
I am not sure how it gets un from fn12?
hpx::init: std::exception caught: Cannot retrieve number of OS threads for host_name: un12
The executables are in /ibrix/packages/HPX/apps/hpx.un/bin
Thanks,
The text was updated successfully, but these errors were encountered: