-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start without shared filesystem? #592
Comments
Hi :) HQ creates a file on a shared filesystem for two reasons:
We already had a use-case without a shared filesystem, and we solved this issue by simply copying the HQ server access file ( server $ hq server start --output-mode=json
# (read the server_dir by a script from the JSON output)
server $ scp <path/to/access.json> worker-node:/tmp/hq
worker-node $ hq worker start --server-dir /tmp/hq This works for me locally, let me know if it works for you. I can already see one potential problem - the JSON data is printed with formatting, not one message per line. This is fine for synchronous commands (e.g. client submit), but when the command is long running ( |
I could try doing a copy, or bringing up a one-off pod (this is in Kubernetes) to run start first (and then saving the directory) but that's not ideal (but likely if it's the only way I will have to try it). With one worker and one server, it's not a big deal, but it would get messy to need to coordinate with one server and tens to hundreds of workers (or more)! Is there a way or instructions that I could generate that access.json in advance (or each field separately and manually assemble it), and then write that to a read only ConfigMap (that is shared by the worker and server?) That's what we do for configs for Flux Framework in the same setup! Minimally I could probably run the server start command on the one-off pod, save the access.json output, turn it into a config map, and then mount to be found by all subsequent nodes. I can tell you we did something similar for Flux, and the feedback from the batch working group was that having this extra pod wasn't an ideal design, and ultimately we compiled the needed libraries into the operator itself to generate there. Thanks for your help! I'm off to bed soon but will be working on this again tomorrow. |
Depends on what do you mean by "in advance". Do you need to create the config without starting the server, or create the config, then kill the server and then start it again with the same config, or something like that? Or can you start the server to generate the config, and then keep it running? |
Ideally we could create it without starting the server (we know the hostname in advance, for example) but if we must start it, I could start it in an ephemeral pod, save the access.json, and then make a read only config map for new pods to use (assuming there are no other files and read only is OK!) I don’t think we could keep it running because we’d have to recreate pods to add the config map. |
Okay I made a design that can spin up a pod to generate that access.json, and then get it via a log. I ran into some trouble because when I mount as a read only config map, I get an error that it wants write. It also seems to want the same structure with
Could we chat about ways to make it work? I'm hoping there is a way to setup the config files and use hq without requiring a shared fileystem. I understand most HPC can assume that (although not all, actually) but for cloud native stuffs the isolation is a lot more common. If we can predictably know the attributes required in advance, and any cert or token generators can be reproduced in Go, and then I can write the access.json to a read only file system, this would be fantastic. Otherwise we would always require a shared RWX volume which hugely hurts the design. Here is what I put together this morning: converged-computing/hyperqueue-operator#3 looking forward to discussing more! I'm not hugely experienced with Rust (but I'm eager to learn and generally learn languages quickly) so if there is an improvement we can make here (that I can help with) please recruit me! I'll just need some pointers. Happy Sunday! |
If you want to discuss this on a chat, we have a Zulip instance. You can log in there with your GitHub account. |
I think it would not be a difficult to create a command like:
A then have an option
It would still use the machinery and create an access file in .hq-sever, etc. But it would not randomly initialize keys, but reuse them from the given file, same for the ports. |
Yes that would be perfect to start! |
Indeed, even tho a shared filesystem is pretty common in HPC and on-prem clusters it would be great that HyperQueue would not hard require it. This would allow the deployment of ephemeral clusters such as Kubernetes. |
It would need some refactoring, but I will look at it during this week and prepare a first version. |
Just to let you know: I have start to work on it and found some older technical debts, so it takes me more time. But I hope that I can finish it during next week. |
Solved by #595 |
hey thanks again for this! We had to wait for the release of JobSet and Hyperqueue here, but the operator is working as I'd like (and I'm going to test it soon!) https://github.com/converged-computing/hyperqueue-operator#hello-world-example |
Hi!
I'm creating an operator for hyperqueue, and it looks like I can target a
--host
for server start, but not for worker start. For the worker start, the most I can do is specify a directory (which assumes a shared filesystem). I have a cluster network but the filesystem is read only, so I wanted to ask if there are any reasonable options? Ideally we could:--server-dir
Really looking forward to getting this working - thanks!
The text was updated successfully, but these errors were encountered: