Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up Vbox VMs #5424

Open
davidpanderson opened this issue Nov 14, 2023 · 8 comments
Open

Clean up Vbox VMs #5424

davidpanderson opened this issue Nov 14, 2023 · 8 comments

Comments

@davidpanderson
Copy link
Contributor

(Win) when I open VirtualBox Manager, there are dozens of entries in the VM list
with names like boinc_091b568ebb08451b, pointing to nonexistent slot directories.
These shouldn't be there; should be cleaned up by vboxwrapper.

@computezrmle
Copy link
Contributor

Do you remember which project you ran?

At least LHC@home is using this pattern for VM names (example):
boinc_d0488831bc598c0b

Recent logs show that deregistering/removal works fine with vboxwrapper 26206 used there:

2023-12-05 10:28:50 (42792): Powering off VM.
2023-12-05 10:28:51 (42792): Successfully stopped VM.
2023-12-05 10:28:51 (42792): Deregistering VM. (boinc_d0488831bc598c0b, slot#33)
2023-12-05 10:28:51 (42792): Removing network bandwidth throttle group from VM.
2023-12-05 10:28:51 (42792): Removing VM from VirtualBox.
10:28:57 (42792): called boinc_finish(0)

Using a vboxwrapper instance to remove a VM not under it's own control may cause trouble:

  1. Another vboxwrapper running concurrently may just have sent a request to register a fresh VM
  2. Another BOINC instance may also run multiple vboxwrapper instances

As long as those instances run under the same user account their vboxmanage requests are queued by VirtualBox and finally written to/removed from the same VirtualBox.xml file.

@AenBleidd
Copy link
Member

I believe, we still need some kind of clean-up that will check next:

  1. VM was created X days ago
  2. VM was not running X days
  3. VM points to the slot directory that is now used by other task or empty

Because there always can happen some situations when vboxwrapper might fail deregistering/removal of the VM, and it will stuck in the VBoxManager forever

@computezrmle
Copy link
Contributor

As for (1.)
A typical VM entry in VirtualBox.xml looks like this.
VirtualBox provides no information as to when it has been created:
<MachineEntry uuid="{e663f635-e077-4dea-be4a-287b325fc0dd}" src="/home/boinc3/BOINC_LHCVB/slots/1/boinc_a2df64699a65780d/boinc_a2df64699a65780d.vbox"/>

As for (2.)
There's already a watchdog implemented in vboxwrapper which ensures a stuck VM can be identified and shut down.
It's up to the entire project people to use it or not to use it.
Nonetheless, there can be situations (mostly after a crash) where the relationship between a registered VM and BOINC/vboxwrapper can't be restored.
This leaves orphans.
In general I'm not aware of a method implemented in BOINC/vboxwrapper to reliably decide (from outside) whether a VM got stuck or intentionally waits for something to happen.

As for (3.)
Most promising point.
Assuming the BOINC client keeps track of the slot numbers and tells a fresh vboxwrapper to use slot n.
Then this vboxwrapper should be authorized to clean up slot n (if not empty) and remove any VirtualBox object related to it.
Needs to be ensured this doesn't have unwanted side effects, e.g on running VMs.

@AenBleidd
Copy link
Member

I think, this should be a functionality of the BOINC client, and completely decoupled from the vboxwrapper, because if might happen that there were no VBox tasks for quite a long time, and we need to clean up some orphan VMs.
There is currently a mechanism in BOINC client that do some cleaning from time to time, so it should be extended at some point.

Assuming the BOINC client keeps track of the slot numbers and tells a fresh vboxwrapper to use slot n.

I does

@davidpanderson
Copy link
Contributor Author

I agree that VM cleanup should be done by the client.
How exactly should it do it?

Also, has anyone besides me seen this issue?
If so, what project do the orphan VMs belong to?

@computezrmle
Copy link
Contributor

VM names like "boinc_4e84e6a8a719072c" are used at least by LHC@home and cosmology@home.
Hence, it can't be said which project left the orphans nor when.

As long as the orphan machine entries are only in VirtualBox.xml they may confuse a user looking through VirtualBox Manager but in fact they do not affect fresh BOINC work.
In most cases they are remains after a crash, typically due to a power outage.
They can safely be removed manually using the VirtualBox Manager or scripted via VboxManage.
In the latter case it must be ensured the entry doesn't belong to a VM that is just about to be created.

Complaints about that have been posted in the past but not recently in the forums from LHC@home.

@davidpanderson
Copy link
Contributor Author

Users can remove these entries manually.
But it would be good if BOINC did it automatically.

@Toby-Broom
Copy link

I see it too, I just go in an clean them up by hand or manually use vboxmanage to clean them up.

I think it was worse in the past than now but just a feeling, to support computezrmle comment

I hypothesis is it could be on a reboot of th computer, I run a shell script on linux/win to wait for all the VMs to close down before rebooting since the OS is not paitent enough to wait ~2 min for the VMs to close down.

You could compare the entries in VirtualBox.xml to the BOINC know list and remove the excess?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

4 participants