You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(note: I've also tried this when setting auto-scaling to true on both jobs with no change in behavior)
What is the current behavior?
A Ray cluster with two pods (one head, one worker) are launched and both workloads are run on that cluster.
$ k get po
NAME READY STATUS RESTARTS AGE
c-mockuser-a1a35d28-head-4pf5d 2/2 Running 0 95s
c-mockuser-a1a35d28-worker-g-m7qgm 1/1 Running 0 95s
gateway-796cfb4d5b-cbslp 1/1 Running 0 6m50s
kuberay-operator-654bf75dcb-4tvcw 1/1 Running 0 6m50s
postgresql-0 1/1 Running 0 6m50s
prepuller-5v2xg 1/1 Running 0 6m50s
scheduler-fbb99cb54-mlxrt 1/1 Running 0 6m50s
What is the expected behavior?
At a minimum, I would expect the cluster to be resized to add the additional requested workers.
Not sure if the best behavior would be to start a new Ray cluster for the additional job, given the differing resource requests. I could see arguments in favor of both approaches...
How to Fix
When a job is submitted, we check to see if there's an existing compute resource:
Probably @IceKhan13 can give us a better insight but if I remember correctly something that it surprised me in the past is that Ray tries to allocate the workload where it can. So if you have two jobs that will consume 4 cpus and you have a cluster with 4 cpus available it will try to run those two jobs in that cluster instead of create two clusters or more workers. So maybe it's happening something similar in this case.
Said that, I never tried to run any workload with that configuration. I think it's a good idea to start testing different configurations for the workloads and monitor the behavior.
Steps to reproduce the problem
Run two serverless jobs concurrently, the first one using 1 worker and the second one using 3.
Using the basic getting started
running_program.ipynb
notebook, make the following updates:To ensure jobs run concurrently, add a pause in
source_files/pattern.py
around L19:Then, update the running the pattern section of the notebook to launch jobs with different resource configurations:
(note: I've also tried this when setting auto-scaling to true on both jobs with no change in behavior)
What is the current behavior?
A Ray cluster with two pods (one head, one worker) are launched and both workloads are run on that cluster.
What is the expected behavior?
At a minimum, I would expect the cluster to be resized to add the additional requested workers.
Not sure if the best behavior would be to start a new Ray cluster for the additional job, given the differing resource requests. I could see arguments in favor of both approaches...
How to Fix
When a job is submitted, we check to see if there's an existing compute resource:
https://github.com/Qiskit-Extensions/quantum-serverless/blob/7dfe5bbe644b15cca75691639fb898ce0f2da53e/gateway/api/schedule.py#L43-L45
and if so, we reuse it:
https://github.com/Qiskit-Extensions/quantum-serverless/blob/7dfe5bbe644b15cca75691639fb898ce0f2da53e/gateway/api/schedule.py#L49-L53
In some use cases, this may not be the ideal behavior, so we may want to revisit this decision...
The text was updated successfully, but these errors were encountered: