Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating access files in advance #595

Merged
merged 11 commits into from
Jun 19, 2023
Merged

Generating access files in advance #595

merged 11 commits into from
Jun 19, 2023

Conversation

spirali
Copy link
Collaborator

@spirali spirali commented Jun 14, 2023

Solves #592.
Read cloud.md for usage.

@spirali spirali requested a review from Kobzol June 14, 2023 17:46
Copy link
Collaborator

@Kobzol Kobzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks ok, but removing PID from the server info is an unwelcome regression :(

crates/hyperqueue/src/client/commands/server.rs Outdated Show resolved Hide resolved
crates/hyperqueue/src/client/commands/server.rs Outdated Show resolved Hide resolved
tests/test_server.py Outdated Show resolved Hide resolved
crates/hyperqueue/src/client/output/cli.rs Outdated Show resolved Hide resolved
crates/hyperqueue/src/client/output/json.rs Outdated Show resolved Hide resolved
@spirali
Copy link
Collaborator Author

spirali commented Jun 16, 2023

In the current implementation, it is not a problem to put PID back. I just want to leave it only with the connection information without details about running instance.

@Kobzol
Copy link
Collaborator

Kobzol commented Jun 16, 2023

Ok, adding PID to server info so that it can be printed in CLI would be nice.

@spirali spirali requested a review from Kobzol June 16, 2023 09:34
@spirali
Copy link
Collaborator Author

spirali commented Jun 16, 2023

Pid and start time returned to "server info" (not into access.json)

@vsoch
Copy link
Contributor

vsoch commented Jun 16, 2023

Awesome! I am going to try this out, and can report back.

@vsoch
Copy link
Contributor

vsoch commented Jun 17, 2023

okay so it looks like that generate command needs to run on a server with the same hostname that it will be deployed on?

Access token found but HQ server hyperqueue-sample-access:6789 is unreachable.
Try to (re)start the server using `hq server start`

So if I want a fully qualified name I'll need the server itself to report FQDNs as well?

@vsoch
Copy link
Contributor

vsoch commented Jun 17, 2023

oh just kidding, I see this!

 --host <HOST>
          Override target host name, otherwise local hostname is used

Will try that!

@vsoch
Copy link
Contributor

vsoch commented Jun 17, 2023

okay one tiny tweak and (I think?) it might work - it looks like I can only define one host:

 --host <HOST>
          Override target host name, otherwise local hostname is used

However, the host and worker(s) have different addresses. Can we specify (akin to port) a worker and host port? Unless I'm setting this up incorrectly? Basically, we have different nodes entirely that will act as workers, and then a central server that starts everything up (and the workers connect to!)

@vsoch
Copy link
Contributor

vsoch commented Jun 17, 2023

And assuming that the model is different and the server/worker port are supposed to running on the main node (for workers to connect to) when I start there it looks like this:

root@hyperqueue-sample-server-0-0:/app#     hq server start --access-file=./hq/access.json
2023-06-17T01:41:28Z INFO No online server found, starting a new server
2023-06-17T01:41:28Z INFO Storing access file as '/root/.hq-server/001/access.json'
+------------------+-------------------------------------------------------------------------------+
| Server directory | /root/.hq-server                                                              |
| Server UID       | Lqacwy                                                                        |
| Client host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Client port      | 6789                                                                          |
| Worker host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Worker port      | 1234                                                                          |
| Version          | 0.15.0-dev                                                                    |
| Pid              | 425                                                                           |
| Start date       | 2023-06-17 01:41:28 UTC                                                       |
+------------------+-------------------------------------------------------------------------------+

but then starting the worker node (different hostname, the same but with -worker- instead of -server- I get:

# hq --server-dir=./hq worker start
2023-06-17T01:43:09Z INFO Detected 16448925696B of memory (15.32 GiB)
2023-06-17T01:43:09Z INFO Starting hyperqueue worker 0.15.0-dev
2023-06-17T01:43:09Z INFO Connecting to: hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local:6789
2023-06-17T01:43:09Z INFO Listening on port 35213
2023-06-17T01:43:09Z INFO Connecting to server (candidate addresses = [10.244.0.61:6789])
Error: Authentication failed: Expected peer role server, got hq-server

It seems to want a peer role server? The worker is definitely hitting the main server, because I see:

2023-06-17T01:43:09Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker
2023-06-17T01:44:20Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker
2023-06-17T01:44:31Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker

I could try random stuffs but I will wait for you to advise! Thank you!

@spirali
Copy link
Collaborator Author

spirali commented Jun 17, 2023

There are two ports, one for connecting clients and one for connecting workers.

An address to which workers and clients connects may different but it has point to be the same physical machine. I will add configuration options (--worker-host, --client-host). If you want to try it, for now you can just manually edit hostnames in generated access files. Use case: This option allows client connections from an outer network (sever has a public name) and worker connections only in inner network.

@vsoch
Copy link
Contributor

vsoch commented Jun 17, 2023

I can try again, but I'm pretty sure I got the above error about wanting "peer role hq-client, got worker" when I changed the worker hostname manually for both.

@vsoch
Copy link
Contributor

vsoch commented Jun 17, 2023

oh I think it works! I must have not done the right combination of things yesterday! okay so here is my main server, and I think this should say that clients connect to 6789 and workers 1234?

# cat hq/access.json 

{
  "version": "0.15.0-dev",
  "server_uid": "Lqacwy",
  "client": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 6789,
    "secret_key": "92abb20c99eca5085f5b3dcbdc4e5caa00074d31f33a23bc9edd53d1254ea8e8"
  },
  "worker": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 1234,
    "secret_key": "2bcc9b0a847d901a35ee23f8e1acbad98faef93671a2b99cb4b600681d98bf1b"
  }
}

Start it up like:

#     hq server start --access-file=./hq/access.json
2023-06-17T11:48:57Z INFO No online server found, starting a new server
2023-06-17T11:48:57Z INFO Storing access file as '/root/.hq-server/002/access.json'
+------------------+-------------------------------------------------------------------------------+
| Server directory | /root/.hq-server                                                              |
| Server UID       | Lqacwy                                                                        |
| Client host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Client port      | 6789                                                                          |
| Worker host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Worker port      | 1234                                                                          |
| Version          | 0.15.0-dev                                                                    |
| Pid              | 428                                                                           |
| Start date       | 2023-06-17 11:48:57 UTC                                                       |
+------------------+-------------------------------------------------------------------------------+

And here is from my worker:

# cat hq/access.json 

{
  "version": "0.15.0-dev",
  "server_uid": "Lqacwy",
  "client": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 6789,
    "secret_key": "92abb20c99eca5085f5b3dcbdc4e5caa00074d31f33a23bc9edd53d1254ea8e8"
  },
  "worker": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 1234,
    "secret_key": "2bcc9b0a847d901a35ee23f8e1acbad98faef93671a2b99cb4b600681d98bf1b"
  }
}

Works!

root@hyperqueue-sample-worker-0-0:/app# hq worker start --server-dir=./hq
2023-06-17T11:53:47Z INFO Detected 16448925696B of memory (15.32 GiB)
2023-06-17T11:53:47Z INFO Starting hyperqueue worker 0.15.0-dev
2023-06-17T11:53:47Z INFO Connecting to: hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local:1234
2023-06-17T11:53:47Z INFO Listening on port 42573
2023-06-17T11:53:47Z INFO Connecting to server (candidate addresses = [10.244.0.61:1234])
+-------------------+------------------------------------+
| Worker ID         | 1                                  |
| Hostname          | hyperqueue-sample-worker-0-0       |
| Started           | "2023-06-17T11:53:47.659056770Z"   |
| Data provider     | hyperqueue-sample-worker-0-0:42573 |
| Working directory | /tmp/hq-worker.H6L7KkcvvKAu/work   |
| Logging directory | /tmp/hq-worker.H6L7KkcvvKAu/logs   |
| Heartbeat         | 8s                                 |
| Idle timeout      | None                               |
| Resources         | cpus: 8                            |
|                   | mem: 15.32 GiB                     |
| Time Limit        | None                               |
| Process pid       | 463                                |
| Group             | default                            |
| Manager           | None                               |
| Manager Job ID    | N/A                                |
+-------------------+------------------------------------+

and I did (from the server):

hq submit echo hello world

and I think it ran?

# hq job list --all
+----+------+----------+-------+
| ID | Name | State    | Tasks |
+----+------+----------+-------+
|  1 | echo | FINISHED | 1     |
+----+------+----------+-------+

@vsoch
Copy link
Contributor

vsoch commented Jun 17, 2023

AH and I just found the output on the worker node!

# cat job-1/0.stdout 
hello world
root@hyperqueue-sample-worker-0-0:/app# 

This is great! I did this run manually but next I'll have these steps be fully automated...

@vsoch
Copy link
Contributor

vsoch commented Jun 17, 2023

okay we are in business! I added a retry loop to the worker, because often it can come up before the main server (and not be ready, and then it doesn't retry):

# Keep trying until we connect
until hq --server-dir=./hq worker start
do
    echo "Trying again to connect to main server..."
    sleep 2
done

But then we have them both running!

image

I built this branch into a custom container base since I couldn't wget the binary to just use, but next (a little later today after a bit more sleep) I will try running LAMMPS (this will work with mpi too?) If that is all good - then I say ship it! 🚀

This is really exciting!

@vsoch
Copy link
Contributor

vsoch commented Jun 17, 2023

okay it's all good! I was able to submit a job with --wait and then specify a --log file so I can cat it at the end, and we are in business!
image

I say ship it - right now I'm building from a custom container with this branch, but after that we should be able to use the release here. Thank you so much for doing this! We are planning experiments that look at job managers in Kubernetes and (with this update) there is a very good chance we can include hq!

@vsoch vsoch mentioned this pull request Jun 19, 2023
@spirali spirali merged commit db011ed into main Jun 19, 2023
6 checks passed
@spirali spirali deleted the access branch June 19, 2023 12:57
@vsoch
Copy link
Contributor

vsoch commented Jun 19, 2023

Thank you for implementing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants