Generating access files in advance #595

spirali · 2023-06-14T17:46:18Z

Solves #592.
Read cloud.md for usage.

Kobzol

Generally looks ok, but removing PID from the server info is an unwelcome regression :(

crates/hyperqueue/src/client/commands/server.rs

tests/test_server.py

crates/hyperqueue/src/client/output/cli.rs

crates/hyperqueue/src/client/output/json.rs

spirali · 2023-06-16T08:12:22Z

In the current implementation, it is not a problem to put PID back. I just want to leave it only with the connection information without details about running instance.

Kobzol · 2023-06-16T08:15:10Z

Ok, adding PID to server info so that it can be printed in CLI would be nice.

spirali · 2023-06-16T09:34:39Z

Pid and start time returned to "server info" (not into access.json)

vsoch · 2023-06-16T22:35:13Z

Awesome! I am going to try this out, and can report back.

vsoch · 2023-06-17T00:20:01Z

okay so it looks like that generate command needs to run on a server with the same hostname that it will be deployed on?

Access token found but HQ server hyperqueue-sample-access:6789 is unreachable.
Try to (re)start the server using `hq server start`

So if I want a fully qualified name I'll need the server itself to report FQDNs as well?

vsoch · 2023-06-17T00:21:09Z

oh just kidding, I see this!

 --host <HOST>
          Override target host name, otherwise local hostname is used

Will try that!

vsoch · 2023-06-17T01:36:11Z

okay one tiny tweak and (I think?) it might work - it looks like I can only define one host:

 --host <HOST>
          Override target host name, otherwise local hostname is used

However, the host and worker(s) have different addresses. Can we specify (akin to port) a worker and host port? Unless I'm setting this up incorrectly? Basically, we have different nodes entirely that will act as workers, and then a central server that starts everything up (and the workers connect to!)

vsoch · 2023-06-17T01:45:07Z

And assuming that the model is different and the server/worker port are supposed to running on the main node (for workers to connect to) when I start there it looks like this:

root@hyperqueue-sample-server-0-0:/app#     hq server start --access-file=./hq/access.json
2023-06-17T01:41:28Z INFO No online server found, starting a new server
2023-06-17T01:41:28Z INFO Storing access file as '/root/.hq-server/001/access.json'
+------------------+-------------------------------------------------------------------------------+
| Server directory | /root/.hq-server                                                              |
| Server UID       | Lqacwy                                                                        |
| Client host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Client port      | 6789                                                                          |
| Worker host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Worker port      | 1234                                                                          |
| Version          | 0.15.0-dev                                                                    |
| Pid              | 425                                                                           |
| Start date       | 2023-06-17 01:41:28 UTC                                                       |
+------------------+-------------------------------------------------------------------------------+

but then starting the worker node (different hostname, the same but with -worker- instead of -server- I get:

# hq --server-dir=./hq worker start
2023-06-17T01:43:09Z INFO Detected 16448925696B of memory (15.32 GiB)
2023-06-17T01:43:09Z INFO Starting hyperqueue worker 0.15.0-dev
2023-06-17T01:43:09Z INFO Connecting to: hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local:6789
2023-06-17T01:43:09Z INFO Listening on port 35213
2023-06-17T01:43:09Z INFO Connecting to server (candidate addresses = [10.244.0.61:6789])
Error: Authentication failed: Expected peer role server, got hq-server

It seems to want a peer role server? The worker is definitely hitting the main server, because I see:

2023-06-17T01:43:09Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker
2023-06-17T01:44:20Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker
2023-06-17T01:44:31Z ERROR Client error: Tako error: Error: Authentication failed: Expected peer role hq-client, got worker

I could try random stuffs but I will wait for you to advise! Thank you!

spirali · 2023-06-17T10:14:38Z

There are two ports, one for connecting clients and one for connecting workers.

An address to which workers and clients connects may different but it has point to be the same physical machine. I will add configuration options (--worker-host, --client-host). If you want to try it, for now you can just manually edit hostnames in generated access files. Use case: This option allows client connections from an outer network (sever has a public name) and worker connections only in inner network.

vsoch · 2023-06-17T11:40:01Z

I can try again, but I'm pretty sure I got the above error about wanting "peer role hq-client, got worker" when I changed the worker hostname manually for both.

vsoch · 2023-06-17T11:55:56Z

oh I think it works! I must have not done the right combination of things yesterday! okay so here is my main server, and I think this should say that clients connect to 6789 and workers 1234?

# cat hq/access.json 

{
  "version": "0.15.0-dev",
  "server_uid": "Lqacwy",
  "client": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 6789,
    "secret_key": "92abb20c99eca5085f5b3dcbdc4e5caa00074d31f33a23bc9edd53d1254ea8e8"
  },
  "worker": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 1234,
    "secret_key": "2bcc9b0a847d901a35ee23f8e1acbad98faef93671a2b99cb4b600681d98bf1b"
  }
}

Start it up like:

#     hq server start --access-file=./hq/access.json
2023-06-17T11:48:57Z INFO No online server found, starting a new server
2023-06-17T11:48:57Z INFO Storing access file as '/root/.hq-server/002/access.json'
+------------------+-------------------------------------------------------------------------------+
| Server directory | /root/.hq-server                                                              |
| Server UID       | Lqacwy                                                                        |
| Client host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Client port      | 6789                                                                          |
| Worker host      | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Worker port      | 1234                                                                          |
| Version          | 0.15.0-dev                                                                    |
| Pid              | 428                                                                           |
| Start date       | 2023-06-17 11:48:57 UTC                                                       |
+------------------+-------------------------------------------------------------------------------+

And here is from my worker:

# cat hq/access.json 

{
  "version": "0.15.0-dev",
  "server_uid": "Lqacwy",
  "client": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 6789,
    "secret_key": "92abb20c99eca5085f5b3dcbdc4e5caa00074d31f33a23bc9edd53d1254ea8e8"
  },
  "worker": {
    "host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
    "port": 1234,
    "secret_key": "2bcc9b0a847d901a35ee23f8e1acbad98faef93671a2b99cb4b600681d98bf1b"
  }
}

Works!

root@hyperqueue-sample-worker-0-0:/app# hq worker start --server-dir=./hq
2023-06-17T11:53:47Z INFO Detected 16448925696B of memory (15.32 GiB)
2023-06-17T11:53:47Z INFO Starting hyperqueue worker 0.15.0-dev
2023-06-17T11:53:47Z INFO Connecting to: hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local:1234
2023-06-17T11:53:47Z INFO Listening on port 42573
2023-06-17T11:53:47Z INFO Connecting to server (candidate addresses = [10.244.0.61:1234])
+-------------------+------------------------------------+
| Worker ID         | 1                                  |
| Hostname          | hyperqueue-sample-worker-0-0       |
| Started           | "2023-06-17T11:53:47.659056770Z"   |
| Data provider     | hyperqueue-sample-worker-0-0:42573 |
| Working directory | /tmp/hq-worker.H6L7KkcvvKAu/work   |
| Logging directory | /tmp/hq-worker.H6L7KkcvvKAu/logs   |
| Heartbeat         | 8s                                 |
| Idle timeout      | None                               |
| Resources         | cpus: 8                            |
|                   | mem: 15.32 GiB                     |
| Time Limit        | None                               |
| Process pid       | 463                                |
| Group             | default                            |
| Manager           | None                               |
| Manager Job ID    | N/A                                |
+-------------------+------------------------------------+

and I did (from the server):

hq submit echo hello world

and I think it ran?

# hq job list --all
+----+------+----------+-------+
| ID | Name | State    | Tasks |
+----+------+----------+-------+
|  1 | echo | FINISHED | 1     |
+----+------+----------+-------+

vsoch · 2023-06-17T11:57:46Z

AH and I just found the output on the worker node!

# cat job-1/0.stdout 
hello world
root@hyperqueue-sample-worker-0-0:/app#

This is great! I did this run manually but next I'll have these steps be fully automated...

vsoch · 2023-06-17T12:22:58Z

okay we are in business! I added a retry loop to the worker, because often it can come up before the main server (and not be ready, and then it doesn't retry):

# Keep trying until we connect
until hq --server-dir=./hq worker start
do
    echo "Trying again to connect to main server..."
    sleep 2
done

But then we have them both running!

I built this branch into a custom container base since I couldn't wget the binary to just use, but next (a little later today after a bit more sleep) I will try running LAMMPS (this will work with mpi too?) If that is all good - then I say ship it! 🚀

This is really exciting!

vsoch · 2023-06-17T19:13:42Z

okay it's all good! I was able to submit a job with --wait and then specify a --log file so I can cat it at the end, and we are in business!

I say ship it - right now I'm building from a custom container with this branch, but after that we should be able to use the release here. Thank you so much for doing this! We are planning experiments that look at job managers in Kubernetes and (with this update) there is a very good chance we can include hq!

crates/hyperqueue/src/client/commands/server.rs

crates/hyperqueue/src/common/serverdir.rs

vsoch · 2023-06-19T15:04:37Z

Thank you for implementing this!

spirali added 8 commits June 7, 2023 12:16

Access file refactored

d9de07a

Loading only necessary part of access file

312c955

Server UID taken from server (no access.json)

fd54703

Command: server generate-access

9879b1b

Server info refactored

d0244e1

PyAPI updated

1450254

Tests & clippy fixes

e9b9162

Docs updated

c9529e7

spirali requested a review from Kobzol June 14, 2023 17:46

Kobzol reviewed Jun 16, 2023

View reviewed changes

spirali requested a review from Kobzol June 16, 2023 09:34

spirali force-pushed the access branch from c0542d4 to 459ff5b Compare June 16, 2023 09:39

Kobzol approved these changes Jun 16, 2023

View reviewed changes

spirali added 2 commits June 16, 2023 11:44

Pid & created date returned

f53df56

Logging when symlink is created

72b3b19

spirali force-pushed the access branch from 459ff5b to 72b3b19 Compare June 16, 2023 09:44

vsoch mentioned this pull request Jun 19, 2023

Stream log to stdout? #599

Open

Kobzol reviewed Jun 19, 2023

View reviewed changes

crates/hyperqueue/src/client/commands/server.rs Outdated Show resolved Hide resolved

crates/hyperqueue/src/common/serverdir.rs Outdated Show resolved Hide resolved

spirali force-pushed the access branch from f6a5db2 to 38ed0b7 Compare June 19, 2023 12:25

--worker-host and --client-host in generate-access

266677e

spirali force-pushed the access branch from 38ed0b7 to 266677e Compare June 19, 2023 12:49

spirali requested a review from Kobzol June 19, 2023 12:50

Kobzol approved these changes Jun 19, 2023

View reviewed changes

spirali merged commit db011ed into main Jun 19, 2023
6 checks passed

spirali deleted the access branch June 19, 2023 12:57

vsoch mentioned this pull request Jun 19, 2023

Start without shared filesystem? #592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating access files in advance #595

Generating access files in advance #595

spirali commented Jun 14, 2023

Kobzol left a comment

spirali commented Jun 16, 2023

Kobzol commented Jun 16, 2023

spirali commented Jun 16, 2023

vsoch commented Jun 16, 2023

vsoch commented Jun 17, 2023

vsoch commented Jun 17, 2023

vsoch commented Jun 17, 2023 •

edited

vsoch commented Jun 17, 2023 •

edited

spirali commented Jun 17, 2023

vsoch commented Jun 17, 2023 •

edited

vsoch commented Jun 17, 2023

vsoch commented Jun 17, 2023

vsoch commented Jun 17, 2023

vsoch commented Jun 17, 2023

vsoch commented Jun 19, 2023

Generating access files in advance #595

Generating access files in advance #595

Conversation

spirali commented Jun 14, 2023

Kobzol left a comment

Choose a reason for hiding this comment

spirali commented Jun 16, 2023

Kobzol commented Jun 16, 2023

spirali commented Jun 16, 2023

vsoch commented Jun 16, 2023

vsoch commented Jun 17, 2023

vsoch commented Jun 17, 2023

vsoch commented Jun 17, 2023 • edited

vsoch commented Jun 17, 2023 • edited

spirali commented Jun 17, 2023

vsoch commented Jun 17, 2023 • edited

vsoch commented Jun 17, 2023

vsoch commented Jun 17, 2023

vsoch commented Jun 17, 2023

vsoch commented Jun 17, 2023

vsoch commented Jun 19, 2023

vsoch commented Jun 17, 2023 •

edited

vsoch commented Jun 17, 2023 •

edited

vsoch commented Jun 17, 2023 •

edited