Resource requests for multiple jobs limited by first one submitted #1237

psschwei · 2024-03-01T15:18:33Z

Steps to reproduce the problem

Run two serverless jobs concurrently, the first one using 1 worker and the second one using 3.

Using the basic getting started running_program.ipynb notebook, make the following updates:

To ensure jobs run concurrently, add a pause in source_files/pattern.py around L19:

import time
time.sleep(120)

Then, update the running the pattern section of the notebook to launch jobs with different resource configurations:

from quantum_serverless import Configuration
job = serverless.run("my-first-pattern")
job2 = serverless.run("my-first-pattern", config=Configuration(workers=3))

(note: I've also tried this when setting auto-scaling to true on both jobs with no change in behavior)

What is the current behavior?

A Ray cluster with two pods (one head, one worker) are launched and both workloads are run on that cluster.

$ k get po
NAME                                 READY   STATUS    RESTARTS   AGE
c-mockuser-a1a35d28-head-4pf5d       2/2     Running   0          95s
c-mockuser-a1a35d28-worker-g-m7qgm   1/1     Running   0          95s
gateway-796cfb4d5b-cbslp             1/1     Running   0          6m50s
kuberay-operator-654bf75dcb-4tvcw    1/1     Running   0          6m50s
postgresql-0                         1/1     Running   0          6m50s
prepuller-5v2xg                      1/1     Running   0          6m50s
scheduler-fbb99cb54-mlxrt            1/1     Running   0          6m50s

What is the expected behavior?

At a minimum, I would expect the cluster to be resized to add the additional requested workers.

Not sure if the best behavior would be to start a new Ray cluster for the additional job, given the differing resource requests. I could see arguments in favor of both approaches...

How to Fix

When a job is submitted, we check to see if there's an existing compute resource:

https://github.com/Qiskit-Extensions/quantum-serverless/blob/7dfe5bbe644b15cca75691639fb898ce0f2da53e/gateway/api/schedule.py#L43-L45

and if so, we reuse it:

https://github.com/Qiskit-Extensions/quantum-serverless/blob/7dfe5bbe644b15cca75691639fb898ce0f2da53e/gateway/api/schedule.py#L49-L53

In some use cases, this may not be the ideal behavior, so we may want to revisit this decision...

The text was updated successfully, but these errors were encountered:

Tansito · 2024-03-06T21:46:52Z

Probably @IceKhan13 can give us a better insight but if I remember correctly something that it surprised me in the past is that Ray tries to allocate the workload where it can. So if you have two jobs that will consume 4 cpus and you have a cluster with 4 cpus available it will try to run those two jobs in that cluster instead of create two clusters or more workers. So maybe it's happening something similar in this case.

Said that, I never tried to run any workload with that configuration. I think it's a good idea to start testing different configurations for the workloads and monitor the behavior.

psschwei · 2024-07-19T15:19:39Z

This will be fixed by #1337

psschwei added the bug Something isn't working label Mar 1, 2024

psschwei closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource requests for multiple jobs limited by first one submitted #1237

Resource requests for multiple jobs limited by first one submitted #1237

psschwei commented Mar 1, 2024

Tansito commented Mar 6, 2024

psschwei commented Jul 19, 2024

Resource requests for multiple jobs limited by first one submitted #1237

Resource requests for multiple jobs limited by first one submitted #1237

Comments

psschwei commented Mar 1, 2024

Steps to reproduce the problem

What is the current behavior?

What is the expected behavior?

How to Fix

Tansito commented Mar 6, 2024

psschwei commented Jul 19, 2024