Start without shared filesystem? #592

vsoch · 2023-06-04T05:58:41Z

Hi!

I'm creating an operator for hyperqueue, and it looks like I can target a --host for server start, but not for worker start. For the worker start, the most I can do is specify a directory (which assumes a shared filesystem). I have a cluster network but the filesystem is read only, so I wanted to ask if there are any reasonable options? Ideally we could:

Generate an access config in advance (e.g., the access.json I saw generated in the --server-dir
An ability to give the network hostname of the main server to the worker to register instead.

Really looking forward to getting this working - thanks!

The text was updated successfully, but these errors were encountered:

Kobzol · 2023-06-04T06:59:10Z

Hi :) HQ creates a file on a shared filesystem for two reasons:

Convenience (most HPC clusters use a shared filesystem) - obviously this is not a benefit here for you.
Configuration - there are more parameters to connect to the server than just the host. There are also encryption keys (HQ encrypts data), HQ version, worker/server port etc. It would be a bit unwieldy to pass all these parameters using the command line. And even if we added CLI parameters, you would need to somehow read the parameters produced by hq server start in order to connect to the server programmatically in an automated way.

We already had a use-case without a shared filesystem, and we solved this issue by simply copying the HQ server access file (access.json) to the node with the worker, and then pointing the worker to it. Something like this:

server $ hq server start --output-mode=json
# (read the server_dir by a script from the JSON output)

server $ scp <path/to/access.json> worker-node:/tmp/hq

worker-node $ hq worker start --server-dir /tmp/hq

This works for me locally, let me know if it works for you. I can already see one potential problem - the JSON data is printed with formatting, not one message per line. This is fine for synchronous commands (e.g. client submit), but when the command is long running (server start), then it might be problematic for you to read the server directory programmatically. Let me know if that is the case and we will fix it.

vsoch · 2023-06-04T07:23:48Z

I could try doing a copy, or bringing up a one-off pod (this is in Kubernetes) to run start first (and then saving the directory) but that's not ideal (but likely if it's the only way I will have to try it). With one worker and one server, it's not a big deal, but it would get messy to need to coordinate with one server and tens to hundreds of workers (or more)!

Is there a way or instructions that I could generate that access.json in advance (or each field separately and manually assemble it), and then write that to a read only ConfigMap (that is shared by the worker and server?) That's what we do for configs for Flux Framework in the same setup! Minimally I could probably run the server start command on the one-off pod, save the access.json output, turn it into a config map, and then mount to be found by all subsequent nodes. I can tell you we did something similar for Flux, and the feedback from the batch working group was that having this extra pod wasn't an ideal design, and ultimately we compiled the needed libraries into the operator itself to generate there.

Thanks for your help! I'm off to bed soon but will be working on this again tomorrow.

Kobzol · 2023-06-04T07:30:30Z

Depends on what do you mean by "in advance". Do you need to create the config without starting the server, or create the config, then kill the server and then start it again with the same config, or something like that? Or can you start the server to generate the config, and then keep it running?

vsoch · 2023-06-04T07:33:00Z

Ideally we could create it without starting the server (we know the hostname in advance, for example) but if we must start it, I could start it in an ephemeral pod, save the access.json, and then make a read only config map for new pods to use (assuming there are no other files and read only is OK!)

I don’t think we could keep it running because we’d have to recreate pods to add the config map.

vsoch · 2023-06-04T17:18:09Z

Okay I made a design that can spin up a pod to generate that access.json, and then get it via a log. I ran into some trouble because when I mount as a read only config map, I get an error that it wants write. It also seems to want the same structure with <dirname>/001/access.json - I'm worried this means there are going to be incremental numbered directories written that are assumed to be on a shared filesystem (but are not). Then when I do all that, it tells me it doesn't like the access.json (and makes a new directory 😢 )

2023-06-04T17:13:19Z INFO No online server found, starting a new server
Invalid server directory

Caused by:
    Error: "/root/.hq-server" is not a directory
2023-06-04T17:13:19Z INFO Saving access file as '/./hq/002/access.json'

Could we chat about ways to make it work? I'm hoping there is a way to setup the config files and use hq without requiring a shared fileystem. I understand most HPC can assume that (although not all, actually) but for cloud native stuffs the isolation is a lot more common. If we can predictably know the attributes required in advance, and any cert or token generators can be reproduced in Go, and then I can write the access.json to a read only file system, this would be fantastic. Otherwise we would always require a shared RWX volume which hugely hurts the design.

Here is what I put together this morning: converged-computing/hyperqueue-operator#3 looking forward to discussing more! I'm not hugely experienced with Rust (but I'm eager to learn and generally learn languages quickly) so if there is an improvement we can make here (that I can help with) please recruit me! I'll just need some pointers. Happy Sunday!

Kobzol · 2023-06-05T06:58:22Z

If you want to discuss this on a chat, we have a Zulip instance. You can log in there with your GitHub account.

spirali · 2023-06-05T14:15:56Z

I think it would not be a difficult to create a command like:

$ hq server create-access-file <myaccessfile.json> --client-port=1234 --worker-port=4321

A then have an option --access-file in server start:

$ hq server start --access-file=<myaccessfile.json>

It would still use the machinery and create an access file in .hq-sever, etc. But it would not randomly initialize keys, but reuse them from the given file, same for the ports.

vsoch · 2023-06-05T14:27:04Z

Yes that would be perfect to start!

pditommaso · 2023-06-06T07:53:22Z

Indeed, even tho a shared filesystem is pretty common in HPC and on-prem clusters it would be great that HyperQueue would not hard require it. This would allow the deployment of ephemeral clusters such as Kubernetes.

spirali · 2023-06-07T10:08:31Z

It would need some refactoring, but I will look at it during this week and prepare a first version.

spirali · 2023-06-09T16:44:15Z

Just to let you know: I have start to work on it and found some older technical debts, so it takes me more time. But I hope that I can finish it during next week.

vsoch · 2023-06-19T15:04:55Z

Solved by #595

vsoch · 2023-07-20T22:44:21Z

hey thanks again for this! We had to wait for the release of JobSet and Hyperqueue here, but the operator is working as I'd like (and I'm going to test it soon!) https://github.com/converged-computing/hyperqueue-operator#hello-world-example

vsoch mentioned this issue Jun 4, 2023

add one off pod to generate access.json converged-computing/hyperqueue-operator#3

Merged

spirali mentioned this issue Jun 14, 2023

Generating access files in advance #595

Merged

vsoch closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start without shared filesystem? #592

Start without shared filesystem? #592

vsoch commented Jun 4, 2023

Kobzol commented Jun 4, 2023 •

edited

vsoch commented Jun 4, 2023

Kobzol commented Jun 4, 2023 •

edited

vsoch commented Jun 4, 2023 •

edited

vsoch commented Jun 4, 2023

Kobzol commented Jun 5, 2023

spirali commented Jun 5, 2023

vsoch commented Jun 5, 2023

pditommaso commented Jun 6, 2023

spirali commented Jun 7, 2023

spirali commented Jun 9, 2023

vsoch commented Jun 19, 2023

vsoch commented Jul 20, 2023

Start without shared filesystem? #592

Start without shared filesystem? #592

Comments

vsoch commented Jun 4, 2023

Kobzol commented Jun 4, 2023 • edited

vsoch commented Jun 4, 2023

Kobzol commented Jun 4, 2023 • edited

vsoch commented Jun 4, 2023 • edited

vsoch commented Jun 4, 2023

Kobzol commented Jun 5, 2023

spirali commented Jun 5, 2023

vsoch commented Jun 5, 2023

pditommaso commented Jun 6, 2023

spirali commented Jun 7, 2023

spirali commented Jun 9, 2023

vsoch commented Jun 19, 2023

vsoch commented Jul 20, 2023

Kobzol commented Jun 4, 2023 •

edited

Kobzol commented Jun 4, 2023 •

edited

vsoch commented Jun 4, 2023 •

edited