Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource requests for multiple jobs limited by first one submitted #1237

Closed
psschwei opened this issue Mar 1, 2024 · 2 comments
Closed

Resource requests for multiple jobs limited by first one submitted #1237

psschwei opened this issue Mar 1, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@psschwei
Copy link
Collaborator

psschwei commented Mar 1, 2024

Steps to reproduce the problem

Run two serverless jobs concurrently, the first one using 1 worker and the second one using 3.

Using the basic getting started running_program.ipynb notebook, make the following updates:

To ensure jobs run concurrently, add a pause in source_files/pattern.py around L19:

import time
time.sleep(120)

Then, update the running the pattern section of the notebook to launch jobs with different resource configurations:

from quantum_serverless import Configuration
job = serverless.run("my-first-pattern")
job2 = serverless.run("my-first-pattern", config=Configuration(workers=3))

(note: I've also tried this when setting auto-scaling to true on both jobs with no change in behavior)

What is the current behavior?

A Ray cluster with two pods (one head, one worker) are launched and both workloads are run on that cluster.

$ k get po
NAME                                 READY   STATUS    RESTARTS   AGE
c-mockuser-a1a35d28-head-4pf5d       2/2     Running   0          95s
c-mockuser-a1a35d28-worker-g-m7qgm   1/1     Running   0          95s
gateway-796cfb4d5b-cbslp             1/1     Running   0          6m50s
kuberay-operator-654bf75dcb-4tvcw    1/1     Running   0          6m50s
postgresql-0                         1/1     Running   0          6m50s
prepuller-5v2xg                      1/1     Running   0          6m50s
scheduler-fbb99cb54-mlxrt            1/1     Running   0          6m50s

What is the expected behavior?

At a minimum, I would expect the cluster to be resized to add the additional requested workers.

Not sure if the best behavior would be to start a new Ray cluster for the additional job, given the differing resource requests. I could see arguments in favor of both approaches...

How to Fix

When a job is submitted, we check to see if there's an existing compute resource:

https://github.com/Qiskit-Extensions/quantum-serverless/blob/7dfe5bbe644b15cca75691639fb898ce0f2da53e/gateway/api/schedule.py#L43-L45

and if so, we reuse it:

https://github.com/Qiskit-Extensions/quantum-serverless/blob/7dfe5bbe644b15cca75691639fb898ce0f2da53e/gateway/api/schedule.py#L49-L53

In some use cases, this may not be the ideal behavior, so we may want to revisit this decision...

@psschwei psschwei added the bug Something isn't working label Mar 1, 2024
@Tansito
Copy link
Member

Tansito commented Mar 6, 2024

Probably @IceKhan13 can give us a better insight but if I remember correctly something that it surprised me in the past is that Ray tries to allocate the workload where it can. So if you have two jobs that will consume 4 cpus and you have a cluster with 4 cpus available it will try to run those two jobs in that cluster instead of create two clusters or more workers. So maybe it's happening something similar in this case.

Said that, I never tried to run any workload with that configuration. I think it's a good idea to start testing different configurations for the workloads and monitor the behavior.

@psschwei
Copy link
Collaborator Author

This will be fixed by #1337

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants