Summary
vmm-cli.py update <vm_id> --compose new.yaml --env-file new.env --kms-url ... silently drops the --compose update when the env-file's keys differ from the VM's current allowed_envs. The resulting VMM-stored compose_file keeps the old docker_compose_file but with the new allowed_envs. vmm-cli update exits 0 and reports success.
Reproduction
Any combined update where --env-file introduces (or removes) any env var changes allowed_envs. For us this surfaced when adding LAUNCHER_CHANNEL to the env list alongside a new service in the compose YAML — the new service was silently dropped on two hosts.
Root cause
vmm/src/vmm-cli.py, update_vm() (current master, lines 1051–1124): two unrelated branches both write to upgrade_params["compose_file"], and the env-file branch runs last:
# Branch 1 — compose update (line 1051)
if needs_compose_update:
vm_configuration = vm_info_response["info"].get("configuration") or {}
compose_file_content = vm_configuration.get("compose_file")
app_compose = json.loads(compose_file_content) if compose_file_content else {}
if docker_compose_content:
app_compose["docker_compose_file"] = docker_compose_content # ← inserts NEW YAML
...
upgrade_params["compose_file"] = json.dumps(app_compose, ...)
# Branch 2 — env-file (line 1088)
if env_file:
envs = parse_env_file(env_file)
if envs:
...
if compose_file_content:
app_compose = json.loads(compose_file_content) # ← RE-READS ORIGINAL (no new YAML)
...
if app_compose.get("allowed_envs") != allowed_envs:
app_compose["allowed_envs"] = allowed_envs
compose_changed = True
...
if compose_changed:
upgrade_params["compose_file"] = json.dumps(app_compose, ...) # ← OVERWRITES branch 1's result
Branch 2 reloads compose_file_content from vm_info_response (pre-update state) instead of continuing to mutate the app_compose dict already built by branch 1. When allowed_envs differs, compose_changed=True and branch 2's upgrade_params["compose_file"] = json.dumps(app_compose, ...) clobbers the new YAML.
Why it's hard to notice
vmm-cli update exits 0 and prints success
- The resulting
compose_file still has the new allowed_envs, so subsequent env operations look correct
- The KMS hash registered by the operator (computed from
app-compose.json) matches what VMM stores — both are wrong-but-internally-consistent
- The CVM boots fine; the missing service simply… never existed
Suggested fix
Have branch 2 reuse the app_compose dict built by branch 1 instead of reloading from vm_configuration. Sketch:
app_compose = None # accumulated across both branches
if needs_compose_update or env_file:
vm_info_response = self.rpc_call("GetInfo", {"id": vm_id})
...
if needs_compose_update:
vm_configuration = vm_info_response["info"].get("configuration") or {}
compose_file_content = vm_configuration.get("compose_file")
try:
app_compose = json.loads(compose_file_content) if compose_file_content else {}
except json.JSONDecodeError:
app_compose = {}
if docker_compose_content:
app_compose["docker_compose_file"] = docker_compose_content
updates.append("docker compose")
# ... prelaunch_script, swap_size ...
upgrade_params["compose_file"] = json.dumps(app_compose, ...)
if env_file:
envs = parse_env_file(env_file)
if envs:
...
# Reuse the in-flight app_compose if branch 1 ran;
# otherwise load from current VMM state.
if app_compose is None:
vm_configuration = vm_info_response["info"].get("configuration") or {}
compose_file_content = vm_configuration.get("compose_file")
try:
app_compose = json.loads(compose_file_content) if compose_file_content else {}
except json.JSONDecodeError:
app_compose = {}
compose_changed = False
allowed_envs = list(envs.keys())
if app_compose.get("allowed_envs") != allowed_envs:
app_compose["allowed_envs"] = allowed_envs
compose_changed = True
# ... launch_token_hash ...
if compose_changed or needs_compose_update:
upgrade_params["compose_file"] = json.dumps(app_compose, ...)
Two key changes: (a) app_compose is shared across both branches; (b) when branch 1 ran, always re-serialize the merged result so the env updates don't drop the compose changes.
Workaround (no upstream change needed)
Split the single update into two sequential vmm-cli update calls:
vmm-cli update <vm_id> --env-file new.env --kms-url ... — settles allowed_envs and encrypted_env
vmm-cli update <vm_id> --compose new.yaml --vcpu ... --image ... --kms-url ... — applies the new compose against an already-matching allowed_envs, so branch 2 sees compose_changed=False and doesn't clobber
Environment
Reproduced on a downstream install (/usr/bin/vmm-cli.py, md5 da37c6fecd4219363e4c43076ca4fc30); upstream master at vmm/src/vmm-cli.py has the same code path. Hosts in question were built from a dstack release using dstack-nvidia-0.5.5.
Summary
vmm-cli.py update <vm_id> --compose new.yaml --env-file new.env --kms-url ...silently drops the--composeupdate when the env-file's keys differ from the VM's currentallowed_envs. The resulting VMM-storedcompose_filekeeps the olddocker_compose_filebut with the newallowed_envs.vmm-cli updateexits 0 and reports success.Reproduction
Any combined update where
--env-fileintroduces (or removes) any env var changesallowed_envs. For us this surfaced when addingLAUNCHER_CHANNELto the env list alongside a new service in the compose YAML — the new service was silently dropped on two hosts.Root cause
vmm/src/vmm-cli.py,update_vm()(current master, lines 1051–1124): two unrelated branches both write toupgrade_params["compose_file"], and the env-file branch runs last:Branch 2 reloads
compose_file_contentfromvm_info_response(pre-update state) instead of continuing to mutate theapp_composedict already built by branch 1. Whenallowed_envsdiffers,compose_changed=Trueand branch 2'supgrade_params["compose_file"] = json.dumps(app_compose, ...)clobbers the new YAML.Why it's hard to notice
vmm-cli updateexits 0 and prints successcompose_filestill has the newallowed_envs, so subsequent env operations look correctapp-compose.json) matches what VMM stores — both are wrong-but-internally-consistentSuggested fix
Have branch 2 reuse the
app_composedict built by branch 1 instead of reloading fromvm_configuration. Sketch:Two key changes: (a)
app_composeis shared across both branches; (b) when branch 1 ran, always re-serialize the merged result so the env updates don't drop the compose changes.Workaround (no upstream change needed)
Split the single update into two sequential
vmm-cli updatecalls:vmm-cli update <vm_id> --env-file new.env --kms-url ...— settlesallowed_envsandencrypted_envvmm-cli update <vm_id> --compose new.yaml --vcpu ... --image ... --kms-url ...— applies the new compose against an already-matchingallowed_envs, so branch 2 seescompose_changed=Falseand doesn't clobberEnvironment
Reproduced on a downstream install (
/usr/bin/vmm-cli.py, md5da37c6fecd4219363e4c43076ca4fc30); upstream master atvmm/src/vmm-cli.pyhas the same code path. Hosts in question were built from a dstack release usingdstack-nvidia-0.5.5.