New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
heartbeat example fails on separate nodes #1632
Comments
I think this should work:
|
Also, please post the output of |
Here's the output of console, heartbeat worker, just hangs when I run
environment is identical except I'm on node 39 instead of 38
Configuration after runtime start:
|
Let's make everything explicit:
This works for me. |
From looking at your command line on monchhm06 (which is ip 148.187.68.79) you specified the |
sorry, only had a couple of minutes on airport wifi and didn't test well. Will rerun when I get nodes back. |
only one node available so node 0 is login node ip addr 148.187.68.37
and node 1 is monchlm35
|
@biddisco locality zero uses the same ip-address/port for both, the parcelport and the agas (the agas option by default uses the same as the parcelport so you don't really need to specify it). The connecting locality uses the same agas address as locality 0 but its own parcelport address - something on the node the connecting locality lives. Please look carefully:
You were using the same ip-address/port for all of the parameters, which can't work. IOW, the parcelport address is where a locality is listening for incoming parcels, while the agas option tells the locality were to send the initial (connecting) parcel to. |
it works. Please accept my apologies for misunderstanding and being useless. |
@biddisco I think we can do better than having to fully spell out things on the command line. Let's keep our eyes open for possible default settings which would allow to be less verbose. |
Also, we need to improve the error message you were receiving. It should spell things out in a comprehensible form. That would have told you that you're trying to bind a socket to a non-existing interface. |
I took a look at the code in command-line handlng and have implemented a few checks on start so that if we are using runtime_connect mode and the hpx ip address is 127.0.0.1, then it will reset the default IP to the correct public IP address. This solves problems for me because when I use
I need to supply the --hpx:hpx ip address on the command line which is
If you allow hpx to use the hostname of 127.0.0.1 (which it currently does) then it works if two localities are on the same node, but if one or more are separate, then they send 127.0.0.1 to the root node and then deadlocks take place as the root node cannot find the real remote host. The patch I've implemented will not always work if multiple ip addresses exist for the node and they do not support tcp, but it will at least choose a default one which works most of the time. This allows one to run
(omit hpx:hpx as the default public IP is now discovered) I will submit a patch via PR soon. (more testing required). |
@biddisco Any news on this? |
@biddisco ping? Could you show me what you have, please? I'd like to wrap this up asap. |
I will need to get back to you on this at the end of this week or possibly next week as I'm on travel and not able to help right now. |
This has been fixed recently, so I'm going to close it. Please reopen if necessary. |
on monchhm05 which is ip 148.187.68.78
on monchhm06 which is ip 148.187.68.79
Tried various combinations of port in commandline
the stacktrace shows tcp::parcelport_handler::do_run is failing
The text was updated successfully, but these errors were encountered: