heartbeat example fails on separate nodes #1632

biddisco · 2015-06-26T12:17:39Z

on monchhm05 which is ip 148.187.68.78

ip route get 8.8.8.8 | awk 'NR==1 {print $NF}'
148.187.68.78
bin/heartbeat_console -Ihpx.parcel.port=7910 (or port 7909)

on monchhm06 which is ip 148.187.68.79

ping 148.187.68.78 is ok
PING 148.187.68.78 (148.187.68.78) 56(84) bytes of data.
64 bytes from 148.187.68.78: icmp_seq=1 ttl=64 time=0.168 ms

bin/heartbeat -Ihpx.parcel.port=7909 --hpx:hpx=148.187.68.78:7909
heartbeat: /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:465: \
    hpx::exception::exception(hpx::error, const char*, hpx::throwmode): \
    Assertion `e >= success && e < last_error' failed.

Tried various combinations of port in commandline

the stacktrace shows tcp::parcelport_handler::do_run is failing

#0  0x00002aaaaed77625 in raise () from /lib64/libc.so.6
#1  0x00002aaaaed78e05 in abort () from /lib64/libc.so.6
#2  0x00002aaaaed7074e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00002aaaaed70810 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000067158d in hpx::exception::exception(hpx::error, char const*, hpx::throwmode) ()
    at /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:465
#5  0x00002aaaad2273ec in hpx::error_code::error_code(hpx::error, hpx::throwmode) () at /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:1473
#6  0x00002aaaad2270d3 in hpx::make_error_code(hpx::error, hpx::throwmode) () at /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:373
#7  0x00002aaaad22710f in hpx::exception::exception(hpx::error) () at /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:438
#8  0x00002aaaad226665 in hpx::exception_list::add(boost::exception_ptr const&) () at /mnt/lnec/biddisco/src/spinmaster/hpx/src/exception_list.cpp:132
#9  0x00002aaaad98bce1 in hpx::parcelset::policies::tcp::connection_handler::do_run() ()
    at /mnt/lnec/biddisco/src/spinmaster/hpx/plugins/parcelport/tcp/connection_handler_tcp.cpp:98

The text was updated successfully, but these errors were encountered:

hkaiser · 2015-06-26T16:35:31Z

I think this should work:

node0:
    bin/heartbeat_console -Ihpx.parcel.port=7910

node1:
    bin/heartbeat --hpx:agas=node0:7910

hkaiser · 2015-06-26T16:38:07Z

Also, please post the output of --hpx:dump-config for both executables here.

biddisco · 2015-06-26T18:12:28Z

Here's the output of console, heartbeat worker, just hangs when I run

spinmaster/bin/heartbeat --hpx:agas=148.187.68.111:7910 --hpx:dump-config

environment is identical except I'm on node 39 instead of 38

spinmaster/bin/heartbeat_console  -Ihpx.parcel.port=7910 --hpx:dump-config

Configuration after runtime start:

============================
  [application]
  [hpx]
    'cmd_line' : 'spinmaster/bin/heartbeat_console -Ihpx.parcel.port=7910 --hpx:dump-config'
    'component_path' : '$[hpx.location]:$[system.executable_prefix]' -> '/mnt/lnec/biddisco/build/spinmaster:/mnt/lnec/biddisco/build/spinmaster'
    'component_path_suffixes' : '/lib/hpx:/bin/hpx'
    'cores' : '20'
    'finalize_wait_time' : '-1.0'
    'first_pu' : '0'
    'first_used_core' : '0'
    'localities' : '1'
    'locality' : '0'
    'location' : '$[system.prefix]' -> '/mnt/lnec/biddisco/build/spinmaster'
    'lock_detection' : '1'
    'master_ini_path' : '$[hpx.location]:$[system.executable_prefix]/' -> '/mnt/lnec/biddisco/build/spinmaster:/mnt/lnec/biddisco/build/spinmaster/'
    'master_ini_path_suffixes' : '/share/hpx-0.9.11:/../share/hpx-0.9.11'
    'minimal_deadlock_detection' : '1'
    'os_threads' : '1'
    'program_name' : 'spinmaster/bin/heartbeat_console'
    'reconstructed_cmd_line' : 'spinmaster/bin/heartbeat_console --hpx:config --hpx:cores=all --hpx:dump-config --hpx:ini=hpx.parcel.port=7910 --runfor=600'
    'runtime_mode' : 'console'
    'scheduler' : 'local-priority'
    'shutdown_timeout' : '-1.0'
    'throw_on_held_lock' : '1'
    [agas]
      'address' : '127.0.0.1'
      'dedicated_server' : '0'
      'local_cache_size' : '256'
      'local_cache_size_per_thread' : '32'
      'max_pending_refcnt_requests' : '4096'
      'port' : '7910'
      'service_mode' : 'bootstrap'
      'use_caching' : '1'
      'use_range_caching' : '1'
    [commandline]
      'aliasing' : '1'
      'allow_unknown' : '0'
      [aliases]
        '-0' : '--hpx:node=0'
        '-1' : '--hpx:node=1'
        '-2' : '--hpx:node=2'
        '-3' : '--hpx:node=3'
        '-4' : '--hpx:node=4'
        '-5' : '--hpx:node=5'
        '-6' : '--hpx:node=6'
        '-7' : '--hpx:node=7'
        '-8' : '--hpx:node=8'
        '-9' : '--hpx:node=9'
        '-I' : '--hpx:ini'
        '-a' : '--hpx:agas'
        '-c' : '--hpx:console'
        '-h' : '--hpx:help'
        '-l' : '--hpx:localities'
        '-p' : '--hpx:app-config'
        '-q' : '--hpx:queuing'
        '-r' : '--hpx:run-agas-server'
        '-t' : '--hpx:threads'
        '-v' : '--hpx:version'
        '-w' : '--hpx:worker'
        '-x' : '--hpx:hpx'
    [components]
      'load_external' : '1'
      [adding_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [average_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'path' : '$[hpx.location]/bin/libhpxd.so' -> '/mnt/lnec/biddisco/build/spinmaster/bin/libhpxd.so'
        'static' : '1'
      [barrier]
        'enabled' : '1'
        'name' : 'hpx'
        'path' : '$[hpx.location]/bin/libhpxd.so' -> '/mnt/lnec/biddisco/build/spinmaster/bin/libhpxd.so'
        'static' : '1'
      [component_namespace]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [dividing_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [elapsed_time_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'path' : '$[hpx.location]/bin/libhpxd.so' -> '/mnt/lnec/biddisco/build/spinmaster/bin/libhpxd.so'
        'static' : '1'
      [hpx_lcos_server_latch]
        'enabled' : '1'
        'name' : 'hpx'
        'path' : '$[hpx.location]/bin/libhpxd.so' -> '/mnt/lnec/biddisco/build/spinmaster/bin/libhpxd.so'
        'static' : '1'
      [locality_namespace]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [max_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [median_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [min_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [multiplying_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [output_stream_factory]
        'enabled' : '1'
        'name' : 'hpx_iostreams'
        'static' : '1'
      [primary_namespace]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [raw_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'path' : '$[hpx.location]/bin/libhpxd.so' -> '/mnt/lnec/biddisco/build/spinmaster/bin/libhpxd.so'
        'static' : '1'
      [rolling_mean_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [subtracting_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [symbol_namespace]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
      [variance_count_counter]
        'enabled' : '1'
        'name' : 'hpx'
        'static' : '1'
    [logging]
      'destination' : 'console'
      'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%]|\n'
      'level' : '0'
      [agas]
        'destination' : 'file(hpx.agas.$[system.pid].log)' -> 'file(hpx.agas.18550.log)'
        'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%][AGAS] |\n'
        'level' : '-1'
      [application]
        'destination' : 'console'
        'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%] [APP] |\n'
        'level' : '-1'
      [console]
        'destination' : 'file(hpx.$[system.pid].log)' -> 'file(hpx.18550.log)'
        'format' : '|'
        'level' : '$[hpx.logging.level]' -> '0'
        [agas]
          'destination' : 'file(hpx.agas.$[system.pid].log)' -> 'file(hpx.agas.18550.log)'
          'format' : '|'
          'level' : '$[hpx.logging.agas.level]' -> '-1'
        [application]
          'destination' : 'file(hpx.application.$[system.pid].log)' -> 'file(hpx.application.18550.log)'
          'format' : '|'
          'level' : '$[hpx.logging.application.level]' -> '-1'
        [debuglog]
          'destination' : 'file(hpx.debuglog.$[system.pid].log)' -> 'file(hpx.debuglog.18550.log)'
          'format' : '|'
          'level' : '$[hpx.logging.debuglog.level]' -> '-1'
        [parcel]
          'destination' : 'file(hpx.parcel.$[system.pid].log)' -> 'file(hpx.parcel.18550.log)'
          'format' : '|'
          'level' : '$[hpx.logging.parcel.level]' -> '-1'
        [timing]
          'destination' : 'file(hpx.timing.$[system.pid].log)' -> 'file(hpx.timing.18550.log)'
          'format' : '|'
          'level' : '$[hpx.logging.timing.level]' -> '-1'
      [debuglog]
        'destination' : 'console'
        'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%] [DEB] |\n'
        'level' : '-1'
      [parcel]
        'destination' : 'file(hpx.parcel.$[system.pid].log)' -> 'file(hpx.parcel.18550.log)'
        'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%][  PT] |\n'
        'level' : '-1'
      [timing]
        'destination' : 'console'
        'format' : '(T%locality%/%hpxthread%.%hpxphase%/%hpxcomponent%) P%parentloc%/%hpxparent%.%hpxparentphase% %time%($hh:$mm.$ss.$mili) [%idx%] [TIM] |\n'
        'level' : '-1'
    [parcel]
      'address' : '127.0.0.1'
      'array_optimization' : '1'
      'async_serialization' : '1'
      'bootstrap' : 'tcp'
      'enable_security' : '0'
      'endian_out' : 'little'
      'max_connections' : '512'
      'max_connections_per_locality' : '4'
      'max_message_size' : '1000000000'
      'max_outbound_message_size' : '1000000'
      'message_handlers' : '0'
      'port' : '7910'
      'zero_copy_optimization' : '$[hpx.parcel.array_optimization]' -> '1'
      [tcp]
        'array_optimization' : '$[hpx.parcel.array_optimization]' -> '1'
        'async_serialization' : '$[hpx.parcel.async_serialization]' -> '1'
        'enable' : '1'
        'enable_security' : '$[hpx.parcel.enable_security]' -> '0'
        'max_connections' : '$[hpx.parcel.max_connections]' -> '512'
        'max_connections_per_locality' : '$[hpx.parcel.max_connections_per_locality]' -> '4'
        'max_message_size' : '$[hpx.parcel.max_message_size]' -> '1000000000'
        'max_outbound_message_size' : '$[hpx.parcel.max_outbound_message_size]' -> '1000000'
        'name' : 'hpx'
        'parcel_pool_size' : '$[hpx.threadpools.parcel_pool_size]' -> '2'
        'path' : '/mnt/lnec/biddisco/build/spinmaster/lib/hpx:'
        'priority' : '1'
        'zero_copy_optimization' : '$[hpx.parcel.zero_copy_optimization]' -> '1'
    [stacks]
      'huge_size' : '0x2000000'
      'large_size' : '0x0200000'
      'medium_size' : '0x0020000'
      'small_size' : '0x10000'
      'use_guard_pages' : '1'
    [threadpools]
      'io_pool_size' : '2'
      'parcel_pool_size' : '2'
      'timer_pool_size' : '2'
  [system]
    'executable_prefix' : '/mnt/lnec/biddisco/build/spinmaster'
    'pid' : '18550'
    'prefix' : '/mnt/lnec/biddisco/build/spinmaster'
============================
Heartbeat Console, waiting for 600[s].

hkaiser · 2015-06-26T18:56:58Z

Let's make everything explicit:

node0:
    bin/heartbeat_console --hpx:agas=node0:7910 --hpx.hpx=node0:7910 

node1:
    bin/heartbeat --hpx:agas=node0:7910 --hpx.hpx=node1:7910

This works for me.

sithhell · 2015-06-26T19:29:21Z

From looking at your command line on monchhm06 (which is ip 148.187.68.79) you specified the --hpx:hpx=148.187.68.78:7909 command line option, --hpx:hpx is used to bind the local parcelport to a certain ip. you specified the remote, which should go to --hpx:agas.

biddisco · 2015-06-27T00:03:00Z

sorry, only had a couple of minutes on airport wifi and didn't test well. Will rerun when I get nodes back.

biddisco · 2015-06-27T00:08:37Z

only one node available so node 0 is login node ip addr 148.187.68.37

spinmaster/bin/heartbeat_console --hpx:agas=148.187.68.37:7910 --hpx:hpx=148.187.68.37:7910
Heartbeat Console, waiting for 600[s].

and node 1 is monchlm35

spinmaster/bin/heartbeat --hpx:agas=148.187.68.37:7910 --hpx:hpx=148.187.68.37:7910
heartbeat: /mnt/lnec/biddisco/src/spinmaster/hpx/hpx/exception.hpp:465: \
    hpx::exception::exception(hpx::error, const char*, hpx::throwmode): \
    Assertion `e >= success && e < last_error' failed.
Aborted

hkaiser · 2015-06-27T01:02:40Z

@biddisco locality zero uses the same ip-address/port for both, the parcelport and the agas (the agas option by default uses the same as the parcelport so you don't really need to specify it). The connecting locality uses the same agas address as locality 0 but its own parcelport address - something on the node the connecting locality lives. Please look carefully:

node0:
    bin/heartbeat_console --hpx:agas=node0:7910 --hpx.hpx=node0:7910 

node1:
    bin/heartbeat --hpx:agas=node0:7910 --hpx.hpx=node1:7910 
                                                  ^^^^^

You were using the same ip-address/port for all of the parameters, which can't work.

IOW, the parcelport address is where a locality is listening for incoming parcels, while the agas option tells the locality were to send the initial (connecting) parcel to.

biddisco · 2015-06-27T07:56:21Z

node0
spinmaster/bin/heartbeat_console --hpx:agas=148.187.68.74:7910 --hpx:hpx=148.187.68.74:7910
node1
spinmaster/bin/heartbeat --hpx:agas=148.187.68.74:7910 --hpx:hpx=148.187.68.75:7910

it works.

Please accept my apologies for misunderstanding and being useless.

hkaiser · 2015-06-27T13:18:07Z

@biddisco I think we can do better than having to fully spell out things on the command line. Let's keep our eyes open for possible default settings which would allow to be less verbose.

hkaiser · 2015-06-27T13:22:15Z

Also, we need to improve the error message you were receiving. It should spell things out in a comprehensible form. That would have told you that you're trying to bind a socket to a non-existing interface.

biddisco · 2015-07-03T17:41:10Z

I took a look at the code in command-line handlng and have implemented a few checks on start so that if we are using runtime_connect mode and the hpx ip address is 127.0.0.1, then it will reset the default IP to the correct public IP address.

This solves problems for me because when I use

    srun -n 4 executable --hpx:agas=ip:port --hpx:hpx=ip:port

I need to supply the --hpx:hpx ip address on the command line which is

different for each node started, and
not known before running srun without also querying the slurm environment, but this would mean calling a script to get the host, then running the executable from the script which is tedious.

If you allow hpx to use the hostname of 127.0.0.1 (which it currently does) then it works if two localities are on the same node, but if one or more are separate, then they send 127.0.0.1 to the root node and then deadlocks take place as the root node cannot find the real remote host. The patch I've implemented will not always work if multiple ip addresses exist for the node and they do not support tcp, but it will at least choose a default one which works most of the time. This allows one to run

    srun -n 4 executable --hpx:agas=rootnode:port

(omit hpx:hpx as the default public IP is now discovered)

I will submit a patch via PR soon. (more testing required).

hkaiser · 2015-07-29T18:25:21Z

@biddisco Any news on this?

hkaiser · 2015-09-27T17:07:13Z

@biddisco ping? Could you show me what you have, please? I'd like to wrap this up asap.

biddisco · 2015-09-27T21:11:16Z

I will need to get back to you on this at the end of this week or possibly next week as I'm on travel and not able to help right now.

hkaiser · 2016-03-14T22:04:02Z

@biddisco #2028 adds a test case demonstrating on how to conveniently launch a new HPX locality which then automatically connects back to the launching locality during startup.

hkaiser · 2016-05-01T12:14:34Z

This has been fixed recently, so I'm going to close it. Please reopen if necessary.

hkaiser added type: defect category: parcel transport affecting: CSCS labels Jun 26, 2015

hkaiser added this to the 0.9.11 milestone Jun 26, 2015

hkaiser self-assigned this Jun 26, 2015

biddisco closed this as completed Jun 27, 2015

biddisco reopened this Jul 3, 2015

hkaiser modified the milestones: 0.9.11, 0.9.12 Nov 12, 2015

hkaiser closed this as completed May 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

heartbeat example fails on separate nodes #1632

heartbeat example fails on separate nodes #1632

biddisco commented Jun 26, 2015

hkaiser commented Jun 26, 2015

hkaiser commented Jun 26, 2015

biddisco commented Jun 26, 2015

hkaiser commented Jun 26, 2015

sithhell commented Jun 26, 2015

biddisco commented Jun 27, 2015

biddisco commented Jun 27, 2015

hkaiser commented Jun 27, 2015

biddisco commented Jun 27, 2015

hkaiser commented Jun 27, 2015

hkaiser commented Jun 27, 2015

biddisco commented Jul 3, 2015

hkaiser commented Jul 29, 2015

hkaiser commented Sep 27, 2015

biddisco commented Sep 27, 2015

hkaiser commented Mar 14, 2016

hkaiser commented May 1, 2016

heartbeat example fails on separate nodes #1632

heartbeat example fails on separate nodes #1632

Comments

biddisco commented Jun 26, 2015

hkaiser commented Jun 26, 2015

hkaiser commented Jun 26, 2015

biddisco commented Jun 26, 2015

hkaiser commented Jun 26, 2015

sithhell commented Jun 26, 2015

biddisco commented Jun 27, 2015

biddisco commented Jun 27, 2015

hkaiser commented Jun 27, 2015

biddisco commented Jun 27, 2015

hkaiser commented Jun 27, 2015

hkaiser commented Jun 27, 2015

biddisco commented Jul 3, 2015

hkaiser commented Jul 29, 2015

hkaiser commented Sep 27, 2015

biddisco commented Sep 27, 2015

hkaiser commented Mar 14, 2016

hkaiser commented May 1, 2016