Skip to content

dstack-cloud: add gcp_config.provisioning_model for SPOT instances#15

Merged
kvinwang merged 1 commit into
mainfrom
kvin/dstack-cloud-spot-provisioning
May 27, 2026
Merged

dstack-cloud: add gcp_config.provisioning_model for SPOT instances#15
kvinwang merged 1 commit into
mainfrom
kvin/dstack-cloud-spot-provisioning

Conversation

@kvinwang
Copy link
Copy Markdown
Collaborator

Summary

Add a provisioning_model field to gcp_config (STANDARD / SPOT), and pass the corresponding flags through to gcloud compute instances create.

Why

Many GCP projects only ship preemptible (SPOT) quota for newer GPUs. For example, on a typical project PREEMPTIBLE-NVIDIA-H100-GPUS-per-project-{region,zone} is granted while NVIDIA-H100-GPUS-per-project-region is zero. Without on-demand quota, the only way to launch a3-highgpu-1g (H100) in a Confidential TDX VM is to ask for --provisioning-model=SPOT. Currently dstack-cloud deploy hard-codes STANDARD and the launch fails with QUOTA_EXCEEDED.

Behavior

  • provisioning_model defaults to STANDARD — fully backwards-compatible.
  • When set to SPOT, the deploy adds:
    • --provisioning-model=SPOT
    • --instance-termination-action=STOP (so the LUKS-encrypted data disk survives preemption and dstack-cloud start can resume the instance — gcloud's default is DELETE)
  • Any other value raises RuntimeError early instead of silently dropping through.

Example app.json:

```json
"gcp_config": {
"machine_type": "a3-highgpu-1g",
"zone": "us-central1-a",
"provisioning_model": "SPOT"
}
```

Test plan

  • dstack-cloud new: template now includes "provisioning_model": "STANDARD"
  • dstack-cloud deploy with default (STANDARD) — unchanged gcloud invocation
  • dstack-cloud deploy with SPOT — emits --provisioning-model=SPOT --instance-termination-action=STOP; verified on GCP a3-highgpu-1g
  • dstack-cloud deploy with bogus value — raises Unsupported provisioning_model

Many GCP projects only ship preemptible (SPOT) quota for newer GPUs —
in particular `PREEMPTIBLE-NVIDIA-H100-GPUS-per-project-{region,zone}`
is granted by default while `NVIDIA-H100-GPUS-per-project-region` is
zero. Without on-demand quota, the only way to launch H100 in a
Confidential TDX VM is to request `--provisioning-model=SPOT`.

Expose a `provisioning_model` field in `gcp_config` (default
`STANDARD`, backwards-compatible). When set to `SPOT`, also emit
`--instance-termination-action=STOP` so the boot/data disks survive
preemption and the instance can be resumed via `dstack-cloud start`
(important for the LUKS-encrypted data disk, which is keyed by the
KMS-provisioned per-instance secret).

Anything other than `STANDARD`/`SPOT` raises an early error rather
than silently dropping through.

Example `app.json` snippet for an H100 deploy:

    "gcp_config": {
      "machine_type": "a3-highgpu-1g",
      "zone": "us-central1-a",
      "provisioning_model": "SPOT"
    }
Copilot AI review requested due to automatic review settings May 27, 2026 04:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kvinwang kvinwang merged commit 5a1bfea into main May 27, 2026
@kvinwang kvinwang mentioned this pull request May 27, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants