-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating access files in advance #595
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks ok, but removing PID from the server info is an unwelcome regression :(
In the current implementation, it is not a problem to put PID back. I just want to leave it only with the connection information without details about running instance. |
Ok, adding PID to server info so that it can be printed in CLI would be nice. |
Pid and start time returned to "server info" (not into access.json) |
Awesome! I am going to try this out, and can report back. |
okay so it looks like that generate command needs to run on a server with the same hostname that it will be deployed on?
So if I want a fully qualified name I'll need the server itself to report FQDNs as well? |
oh just kidding, I see this!
Will try that! |
okay one tiny tweak and (I think?) it might work - it looks like I can only define one host:
However, the host and worker(s) have different addresses. Can we specify (akin to port) a worker and host port? Unless I'm setting this up incorrectly? Basically, we have different nodes entirely that will act as workers, and then a central server that starts everything up (and the workers connect to!) |
And assuming that the model is different and the server/worker port are supposed to running on the main node (for workers to connect to) when I start there it looks like this:
but then starting the worker node (different hostname, the same but with -worker- instead of -server- I get: # hq --server-dir=./hq worker start
2023-06-17T01:43:09Z INFO Detected 16448925696B of memory (15.32 GiB)
2023-06-17T01:43:09Z INFO Starting hyperqueue worker 0.15.0-dev
2023-06-17T01:43:09Z INFO Connecting to: hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local:6789
2023-06-17T01:43:09Z INFO Listening on port 35213
2023-06-17T01:43:09Z INFO Connecting to server (candidate addresses = [10.244.0.61:6789])
Error: Authentication failed: Expected peer role server, got hq-server It seems to want a peer role server? The worker is definitely hitting the main server, because I see: 2023-06-17T01:43:09Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker
2023-06-17T01:44:20Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker
2023-06-17T01:44:31Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker I could try random stuffs but I will wait for you to advise! Thank you! |
There are two ports, one for connecting clients and one for connecting workers. An address to which workers and clients connects may different but it has point to be the same physical machine. I will add configuration options (--worker-host, --client-host). If you want to try it, for now you can just manually edit hostnames in generated access files. Use case: This option allows client connections from an outer network (sever has a public name) and worker connections only in inner network. |
I can try again, but I'm pretty sure I got the above error about wanting "peer role hq-client, got worker" when I changed the worker hostname manually for both. |
oh I think it works! I must have not done the right combination of things yesterday! okay so here is my main server, and I think this should say that clients connect to 6789 and workers 1234?
Start it up like:
And here is from my worker:
Works!
and I did (from the server):
and I think it ran?
|
AH and I just found the output on the worker node!
This is great! I did this run manually but next I'll have these steps be fully automated... |
okay we are in business! I added a retry loop to the worker, because often it can come up before the main server (and not be ready, and then it doesn't retry):
But then we have them both running! I built this branch into a custom container base since I couldn't wget the binary to just use, but next (a little later today after a bit more sleep) I will try running LAMMPS (this will work with mpi too?) If that is all good - then I say ship it! 🚀 This is really exciting! |
Thank you for implementing this! |
Solves #592.
Read
cloud.md
for usage.