Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heartbeat example fails on separate nodes #1632

Closed
biddisco opened this issue Jun 26, 2015 · 17 comments
Closed

heartbeat example fails on separate nodes #1632

biddisco opened this issue Jun 26, 2015 · 17 comments

Comments

@biddisco
Copy link
Contributor

on monchhm05 which is ip 148.187.68.78

ip route get 8.8.8.8 | awk 'NR==1 {print $NF}'
148.187.68.78
bin/heartbeat_console -Ihpx.parcel.port=7910 (or port 7909)

on monchhm06 which is ip 148.187.68.79

ping 148.187.68.78 is ok
PING 148.187.68.78 (148.187.68.78) 56(84) bytes of data.
64 bytes from 148.187.68.78: icmp_seq=1 ttl=64 time=0.168 ms

bin/heartbeat -Ihpx.parcel.port=7909 --hpx:hpx=148.187.68.78:7909
heartbeat: /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:465: \
    hpx::exception::exception(hpx::error, const char*, hpx::throwmode): \
    Assertion `e >= success && e < last_error' failed.

Tried various combinations of port in commandline

the stacktrace shows tcp::parcelport_handler::do_run is failing

#0  0x00002aaaaed77625 in raise () from /lib64/libc.so.6
#1  0x00002aaaaed78e05 in abort () from /lib64/libc.so.6
#2  0x00002aaaaed7074e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00002aaaaed70810 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000067158d in hpx::exception::exception(hpx::error, char const*, hpx::throwmode) ()
    at /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:465
#5  0x00002aaaad2273ec in hpx::error_code::error_code(hpx::error, hpx::throwmode) () at /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:1473
#6  0x00002aaaad2270d3 in hpx::make_error_code(hpx::error, hpx::throwmode) () at /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:373
#7  0x00002aaaad22710f in hpx::exception::exception(hpx::error) () at /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:438
#8  0x00002aaaad226665 in hpx::exception_list::add(boost::exception_ptr const&) () at /mnt/lnec/biddisco/src/spinmaster/hpx/src/exception_list.cpp:132
#9  0x00002aaaad98bce1 in hpx::parcelset::policies::tcp::connection_handler::do_run() ()
    at /mnt/lnec/biddisco/src/spinmaster/hpx/plugins/parcelport/tcp/connection_handler_tcp.cpp:98
@hkaiser
Copy link
Member

hkaiser commented Jun 26, 2015

I think this should work:

node0:
    bin/heartbeat_console -Ihpx.parcel.port=7910

node1:
    bin/heartbeat --hpx:agas=node0:7910

@hkaiser
Copy link
Member

hkaiser commented Jun 26, 2015

Also, please post the output of --hpx:dump-config for both executables here.

@biddisco
Copy link
Contributor Author

Here's the output of console, heartbeat worker, just hangs when I run

spinmaster/bin/heartbeat --hpx:agas=148.187.68.111:7910 --hpx:dump-config

environment is identical except I'm on node 39 instead of 38

spinmaster/bin/heartbeat_console  -Ihpx.parcel.port=7910 --hpx:dump-config

Configuration after runtime start:

============================
  [application]
  [hpx]
    'cmd_line' : 'spinmaster/bin/heartbeat_console -Ihpx.parcel.port=7910 --hpx:dump-config'
    'component_path' : '$[hpx.location]:$[system.executable_prefix]' -> '/mnt/lnec/biddisco/build/spinmaster:/mnt/lnec/biddisco/build/spinmaster'
    'component_path_suffixes' : '/lib/hpx:/bin/hpx'
    'cores' : '20'
    'finalize_wait_time' : '-1.0'
    'first_pu' : '0'
    'first_used_core' : '0'
    'localities' : '1'
    'locality' : '0'
    'location' : '$[system.prefix]' -> '/mnt/lnec/biddisco/build/spinmaster'
    'lock_detection' : '1'
    'master_ini_path' : '$[hpx.location]:$[system.executable_prefix]/' -> '/mnt/lnec/biddisco/build/spinmaster:/mnt/lnec/biddisco/build/spinmaster/'
    'master_ini_path_suffixes' : '/share/hpx-0.9.11:/../share/hpx-0.9.11'
    'minimal_deadlock_detection' : '1'
    'os_threads' : '1'
    'program_name' : 'spinmaster/bin/heartbeat_console'
    'reconstructed_cmd_line' : 'spinmaster/bin/heartbeat_console --hpx:config --hpx:cores=all --hpx:dump-config --hpx:ini=hpx.parcel.port=7910 --runfor=600'
    'runtime_mode' : 'console'
    'scheduler' : 'local-priority'
    'shutdown_timeout' : '-1.0'
    'throw_on_held_lock' : '1'
    [agas]
      'address' : '127.0.0.1'
      'dedicated_server' : '0'
      'local_cache_size' : '256'
      'local_cache_size_per_thread' : '32'
      'max_pending_refcnt_requests' : '4096'
      'port' : '7910'
      'service_mode' : 'bootstrap'
      'use_caching' : '1'
      'use_range_caching' : '1'
    [commandline]
      'aliasing' : '1'
      'allow_unknown' : '0'
      [aliases]
        '-0' : '--hpx:node=0'
        '-1' : '--hpx:node=1'
        '-2' : '--hpx:node=2'
        '-3' : '--hpx:node=3'
        '-4' : '--hpx:node=4'
        '-5' : '--hpx:node=5'
        '-6' : '--hpx:node=6'
        '-7' : '--hpx:node=7'
        '-8' : '--hpx:node=8'
        '-9' : '--hpx:node=9'
        '-I' : '--hpx:ini'
        '-a' : '--hpx:agas'
        '-c' : '--hpx:console'
        '-h' : '--hpx:help'
        '-l' : '--hpx:localities'
        '-p' : '--hpx:app-config'
        '-q' : '--hpx:queuing'
        '-r' : '--hpx:run-agas-server'
        '-t' : '--hpx:threads'
        '-v' : '--hpx:version'
        '-w' : '--hpx:worker'
        '-x' : '--hpx:hpx'
    [components]
      'load_external' : '1'
      [adding_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [average_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'path' : '$[hpx.location]/bin/libhpxd.so' -> '/mnt/lnec/biddisco/build/spinmaster/bin/libhpxd.so'
        'static' : '1'
      [barrier]
        'enabled' : '1'
        'name' : 'hpx'
        'path' : '$[hpx.location]/bin/libhpxd.so' -> '/mnt/lnec/biddisco/build/spinmaster/bin/libhpxd.so'
        'static' : '1'
      [component_namespace]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [dividing_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [elapsed_time_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'path' : '$[hpx.location]/bin/libhpxd.so' -> '/mnt/lnec/biddisco/build/spinmaster/bin/libhpxd.so'
        'static' : '1'
      [hpx_lcos_server_latch]
        'enabled' : '1'
        'name' : 'hpx'
        'path' : '$[hpx.location]/bin/libhpxd.so' -> '/mnt/lnec/biddisco/build/spinmaster/bin/libhpxd.so'
        'static' : '1'
      [locality_namespace]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [max_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [median_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [min_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [multiplying_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [output_stream_factory]
        'enabled' : '1'
        'name' : 'hpx_iostreams'
        'static' : '1'
      [primary_namespace]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [raw_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'path' : '$[hpx.location]/bin/libhpxd.so' -> '/mnt/lnec/biddisco/build/spinmaster/bin/libhpxd.so'
        'static' : '1'
      [rolling_mean_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [subtracting_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [symbol_namespace]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [variance_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
    [logging]
      'destination' : 'console'
      'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%]|\n'
      'level' : '0'
      [agas]
        'destination' : 'file(hpx.agas.$[system.pid].log)' -> 'file(hpx.agas.18550.log)'
        'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%][AGAS] |\n'
        'level' : '-1'
      [application]
        'destination' : 'console'
        'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%] [APP] |\n'
        'level' : '-1'
      [console]
        'destination' : 'file(hpx.$[system.pid].log)' -> 'file(hpx.18550.log)'
        'format' : '|'
        'level' : '$[hpx.logging.level]' -> '0'
        [agas]
          'destination' : 'file(hpx.agas.$[system.pid].log)' -> 'file(hpx.agas.18550.log)'
          'format' : '|'
          'level' : '$[hpx.logging.agas.level]' -> '-1'
        [application]
          'destination' : 'file(hpx.application.$[system.pid].log)' -> 'file(hpx.application.18550.log)'
          'format' : '|'
          'level' : '$[hpx.logging.application.level]' -> '-1'
        [debuglog]
          'destination' : 'file(hpx.debuglog.$[system.pid].log)' -> 'file(hpx.debuglog.18550.log)'
          'format' : '|'
          'level' : '$[hpx.logging.debuglog.level]' -> '-1'
        [parcel]
          'destination' : 'file(hpx.parcel.$[system.pid].log)' -> 'file(hpx.parcel.18550.log)'
          'format' : '|'
          'level' : '$[hpx.logging.parcel.level]' -> '-1'
        [timing]
          'destination' : 'file(hpx.timing.$[system.pid].log)' -> 'file(hpx.timing.18550.log)'
          'format' : '|'
          'level' : '$[hpx.logging.timing.level]' -> '-1'
      [debuglog]
        'destination' : 'console'
        'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%] [DEB] |\n'
        'level' : '-1'
      [parcel]
        'destination' : 'file(hpx.parcel.$[system.pid].log)' -> 'file(hpx.parcel.18550.log)'
        'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%][  PT] |\n'
        'level' : '-1'
      [timing]
        'destination' : 'console'
        'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%] [TIM] |\n'
        'level' : '-1'
    [parcel]
      'address' : '127.0.0.1'
      'array_optimization' : '1'
      'async_serialization' : '1'
      'bootstrap' : 'tcp'
      'enable_security' : '0'
      'endian_out' : 'little'
      'max_connections' : '512'
      'max_connections_per_locality' : '4'
      'max_message_size' : '1000000000'
      'max_outbound_message_size' : '1000000'
      'message_handlers' : '0'
      'port' : '7910'
      'zero_copy_optimization' : '$[hpx.parcel.array_optimization]' -> '1'
      [tcp]
        'array_optimization' : '$[hpx.parcel.array_optimization]' -> '1'
        'async_serialization' : '$[hpx.parcel.async_serialization]' -> '1'
        'enable' : '1'
        'enable_security' : '$[hpx.parcel.enable_security]' -> '0'
        'max_connections' : '$[hpx.parcel.max_connections]' -> '512'
        'max_connections_per_locality' : '$[hpx.parcel.max_connections_per_locality]' -> '4'
        'max_message_size' : '$[hpx.parcel.max_message_size]' -> '1000000000'
        'max_outbound_message_size' : '$[hpx.parcel.max_outbound_message_size]' -> '1000000'
        'name' : 'hpx'
        'parcel_pool_size' : '$[hpx.threadpools.parcel_pool_size]' -> '2'
        'path' : '/mnt/lnec/biddisco/build/spinmaster/lib/hpx:'
        'priority' : '1'
        'zero_copy_optimization' : '$[hpx.parcel.zero_copy_optimization]' -> '1'
    [stacks]
      'huge_size' : '0x2000000'
      'large_size' : '0x0200000'
      'medium_size' : '0x0020000'
      'small_size' : '0x10000'
      'use_guard_pages' : '1'
    [threadpools]
      'io_pool_size' : '2'
      'parcel_pool_size' : '2'
      'timer_pool_size' : '2'
  [system]
    'executable_prefix' : '/mnt/lnec/biddisco/build/spinmaster'
    'pid' : '18550'
    'prefix' : '/mnt/lnec/biddisco/build/spinmaster'
============================
Heartbeat Console, waiting for 600[s].

@hkaiser
Copy link
Member

hkaiser commented Jun 26, 2015

Let's make everything explicit:

node0:
    bin/heartbeat_console --hpx:agas=node0:7910 --hpx.hpx=node0:7910 

node1:
    bin/heartbeat --hpx:agas=node0:7910 --hpx.hpx=node1:7910 

This works for me.

@sithhell
Copy link
Member

From looking at your command line on monchhm06 (which is ip 148.187.68.79) you specified the --hpx:hpx=148.187.68.78:7909 command line option, --hpx:hpx is used to bind the local parcelport to a certain ip. you specified the remote, which should go to --hpx:agas.

@biddisco
Copy link
Contributor Author

sorry, only had a couple of minutes on airport wifi and didn't test well. Will rerun when I get nodes back.

@biddisco
Copy link
Contributor Author

only one node available so node 0 is login node ip addr 148.187.68.37

spinmaster/bin/heartbeat_console --hpx:agas=148.187.68.37:7910 --hpx:hpx=148.187.68.37:7910
Heartbeat Console, waiting for 600[s].

and node 1 is monchlm35

spinmaster/bin/heartbeat --hpx:agas=148.187.68.37:7910 --hpx:hpx=148.187.68.37:7910
heartbeat: /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:465: \
    hpx::exception::exception(hpx::error, const char*, hpx::throwmode): \
    Assertion `e >= success && e < last_error' failed.
Aborted

@hkaiser
Copy link
Member

hkaiser commented Jun 27, 2015

@biddisco locality zero uses the same ip-address/port for both, the parcelport and the agas (the agas option by default uses the same as the parcelport so you don't really need to specify it). The connecting locality uses the same agas address as locality 0 but its own parcelport address - something on the node the connecting locality lives. Please look carefully:

node0:
    bin/heartbeat_console --hpx:agas=node0:7910 --hpx.hpx=node0:7910 

node1:
    bin/heartbeat --hpx:agas=node0:7910 --hpx.hpx=node1:7910 
                                                  ^^^^^

You were using the same ip-address/port for all of the parameters, which can't work.

IOW, the parcelport address is where a locality is listening for incoming parcels, while the agas option tells the locality were to send the initial (connecting) parcel to.

@biddisco
Copy link
Contributor Author

node0
spinmaster/bin/heartbeat_console --hpx:agas=148.187.68.74:7910 --hpx:hpx=148.187.68.74:7910
node1
spinmaster/bin/heartbeat --hpx:agas=148.187.68.74:7910 --hpx:hpx=148.187.68.75:7910

it works.

Please accept my apologies for misunderstanding and being useless.

@hkaiser
Copy link
Member

hkaiser commented Jun 27, 2015

@biddisco I think we can do better than having to fully spell out things on the command line. Let's keep our eyes open for possible default settings which would allow to be less verbose.

@hkaiser
Copy link
Member

hkaiser commented Jun 27, 2015

Also, we need to improve the error message you were receiving. It should spell things out in a comprehensible form. That would have told you that you're trying to bind a socket to a non-existing interface.

@biddisco biddisco reopened this Jul 3, 2015
@biddisco
Copy link
Contributor Author

biddisco commented Jul 3, 2015

I took a look at the code in command-line handlng and have implemented a few checks on start so that if we are using runtime_connect mode and the hpx ip address is 127.0.0.1, then it will reset the default IP to the correct public IP address.

This solves problems for me because when I use

    srun -n 4 executable --hpx:agas=ip:port --hpx:hpx=ip:port

I need to supply the --hpx:hpx ip address on the command line which is

  • different for each node started, and
  • not known before running srun without also querying the slurm environment, but this would mean calling a script to get the host, then running the executable from the script which is tedious.

If you allow hpx to use the hostname of 127.0.0.1 (which it currently does) then it works if two localities are on the same node, but if one or more are separate, then they send 127.0.0.1 to the root node and then deadlocks take place as the root node cannot find the real remote host. The patch I've implemented will not always work if multiple ip addresses exist for the node and they do not support tcp, but it will at least choose a default one which works most of the time. This allows one to run

    srun -n 4 executable --hpx:agas=rootnode:port 

(omit hpx:hpx as the default public IP is now discovered)

I will submit a patch via PR soon. (more testing required).

@hkaiser
Copy link
Member

hkaiser commented Jul 29, 2015

@biddisco Any news on this?

@hkaiser
Copy link
Member

hkaiser commented Sep 27, 2015

@biddisco ping? Could you show me what you have, please? I'd like to wrap this up asap.

@biddisco
Copy link
Contributor Author

I will need to get back to you on this at the end of this week or possibly next week as I'm on travel and not able to help right now.

@hkaiser hkaiser modified the milestones: 0.9.11, 0.9.12 Nov 12, 2015
@hkaiser
Copy link
Member

hkaiser commented Mar 14, 2016

@biddisco #2028 adds a test case demonstrating on how to conveniently launch a new HPX locality which then automatically connects back to the launching locality during startup.

@hkaiser
Copy link
Member

hkaiser commented May 1, 2016

This has been fixed recently, so I'm going to close it. Please reopen if necessary.

@hkaiser hkaiser closed this as completed May 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants