Skip to content

fix: regression in disable and stop sshd service#8596

Merged
djsly merged 1 commit into
mainfrom
nishp/fix/ssh/regression
May 27, 2026
Merged

fix: regression in disable and stop sshd service#8596
djsly merged 1 commit into
mainfrom
nishp/fix/ssh/regression

Conversation

@awesomenix
Copy link
Copy Markdown
Contributor

There's the problem. On Ubuntu, sshd.service shows up in systemctl list-units --full --all (likely as "not-found" or "inactive"), so the guard check passes — but systemctl stop sshd fails every time with "not loaded."

It then retries 20 times with 5s sleep = 100 seconds wasted just on stop, plus another 100s on disable = ~3.3 minutes for a service that simply doesn't exist on Ubuntu.

The ssh call succeeds quickly (first try), but the sshd call burns through all retries pointlessly. The fix would be to either:

  1. Check the OS and only call the appropriate service name, or
  2. Use systemctl list-unit-files instead (which won't list non-existent units), or
  3. Bail early if the error is "not loaded" rather than retrying

For the sshd call on Ubuntu (where it doesn't exist):

  • Stop: 20 retries × (daemon-reload 0.5s + stop attempt ~0.5s + sleep 5s) ≈
    120s
  • Disable: 20 retries × same ≈ ~120s

Total: ~4 minutes wasted on sshd alone.

The ssh call succeeds on first try (1-2s), so the combined wall time for both lines is roughly 4 minutes.

list-unit-files only shows units with actual files on disk, unlike list-units --full --all which includes "not-found" phantom entries.

systemctl cat returns non-zero if the unit file doesn't exist — no grep needed.

Regression risk is very low. Here's why:

  1. systemctl cat returns non-zero only if the unit file truly doesn't exist — if a service is installed (has a unit file), it returns 0 regardless of whether it's running, stopped, enabled, or disabled. So it won't skip services that should be stopped.
  2. The current behavior already silently succeeds — even today, if the service isn't found, systemctl_stop fails all retries, then the function just echoes a warning ("$1 could not be stopped"). The function doesn't return non-zero for this case. So the end result is identical — just 4 minutes faster.
  3. Same callers for nvidia services — on non-GPU VMs these services don't exist, and currently burn through the same pointless retries.

One minor thing to verify: systemctl cat has been available since systemd 209+ (Ubuntu 16.04+, all supported AzureLinux). No concern there.

No regression risk — it's strictly a performance improvement with identical functional outcome

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes Linux CSE service shutdown logic by avoiding expensive retry loops when a systemd unit doesn’t actually exist (notably sshd on Ubuntu), improving provisioning time without changing the functional outcome (missing units are still treated as non-fatal).

Changes:

  • Update systemctlDisableAndStop to gate stop/disable calls on systemctl cat <unit> instead of systemctl list-units | grep.

Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh
@djsly djsly merged commit a0e407d into main May 27, 2026
32 of 34 checks passed
@djsly djsly deleted the nishp/fix/ssh/regression branch May 27, 2026 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants