Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use port open check to determine if reboot completed. Fixes #856. #857

Closed

Conversation

@nh2
Copy link
Contributor

@nh2 nh2 commented Jan 27, 2018

See #856

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the last reboot on the remote side
to change, which is not racy.

@nh2
Copy link
Contributor Author

@nh2 nh2 commented Jan 27, 2018

Changed output with this patch:

nixos-cachecache-hetzner> closures copied successfully
machine1> updating GRUB 2 menu...
machine1> rebooting...
packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
machine1> waiting for reboot to complete...done.
machine1> activation finished successfully
@nh2
Copy link
Contributor Author

@nh2 nh2 commented Jan 27, 2018

Alternative output that can be created by this:

nixos-cachecache-hetzner> closures copied successfully
machine1> updating GRUB 2 menu...
machine1> rebooting...
machine1> waiting for reboot to complete...packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe

machine1> mux_client_request_session: read from master failed: Broken pipe
machine1> done.
machine1> activation finished successfully

So the output is not as pretty as before due to the errors, but at least it works without race-condition.

@nh2 nh2 force-pushed the nh2:issue-856-too-fast-reboot-race-condition branch from a3f6b8a to 6bc3e33 Feb 3, 2018
@nh2
Copy link
Contributor Author

@nh2 nh2 commented Feb 3, 2018

I pushed a small fix (def reboot() being overridden in subclasses).

@nh2 nh2 force-pushed the nh2:issue-856-too-fast-reboot-race-condition branch from 6bc3e33 to 37bc5af Feb 3, 2018
# and show an 'x' as progress indicator in that case.
self.log_continue("x")
if last_reboot_output is not None and last_reboot_output != pre_reboot_last_reboot_output:
break

This comment has been minimized.

@domenkozar

domenkozar Feb 4, 2018
Member

Shouldn't this call ssh.reset() eventually?

This comment has been minimized.

@nh2

nh2 Feb 14, 2018
Author Contributor

Based on my current understanding, this is not needed, because the only thing reset() is call shutdown() which exits the control master process, and I think that gets cleaned up automatically when the connection dies due to the machine rebooting.

This comment has been minimized.

@aszlig

aszlig May 27, 2018
Member

Hm, in theory you're right, but if the machine didn't manage to properly close the TCP socket (for example due to a hard reboot), the control master process is still alive.

This comment has been minimized.

@nh2

nh2 Jun 28, 2018
Author Contributor

Great point, you're right. I've changed this back to always reset at reboot.

I've also re-tested that change with the EC2 and Hetzner backends (though only Hetzner can do --hard reboots, where this is relevant).

@nh2 nh2 mentioned this pull request May 8, 2018
@nh2
Copy link
Contributor Author

@nh2 nh2 commented May 8, 2018

All good? This PR is the required base of my new PR #948, so it would be cool if we could finish this one.

# manner, we compare the output of `last reboot` before and after
# the reboot. Once the output has changed, the reboot is done.
def get_last_reboot_output():
return self.run_command('last reboot --time-format iso | head -n1', capture_stdout=True).rstrip()

This comment has been minimized.

@aszlig

aszlig May 15, 2018
Member

Both the Hetzner rescue system and the actual NixOS system are using systemd, so maybe it's a better idea to use systemd-analyze because it fails whenever bootup is not finished. At least that would avoid the current vs. last string comparison.

This comment has been minimized.

@nh2

nh2 May 26, 2018
Author Contributor

I don't think that can work because when NixOS is waiting for nixops keys to be uploaded (so, right here), you'll get

# systemd-analyze
Bootup is not yet finished. Please try again later.

This comment has been minimized.

@aszlig

aszlig May 27, 2018
Member

Oh, right... you're correct and after all we don't care about a truly finished reboot but only want to make sure the machine has rebooted after all, so last reboot should work.

@nh2 nh2 force-pushed the nh2:issue-856-too-fast-reboot-race-condition branch from 37bc5af to 3cbb613 May 26, 2018
@nh2
Copy link
Contributor Author

@nh2 nh2 commented May 26, 2018

Updated the PR, I noticed I have to not only catch nixops.ssh_util.SSHCommandFailed, but also nixops.ssh_util.SSHConnectionFailed explicitly.

@nh2 nh2 force-pushed the nh2:issue-856-too-fast-reboot-race-condition branch from 3cbb613 to 2fff961 May 26, 2018
@nh2
Copy link
Contributor Author

@nh2 nh2 commented May 26, 2018

Made another small improvement so that nixops reboot --hard also works when the machine is off.

@aszlig
aszlig approved these changes May 27, 2018
@aszlig
Copy link
Member

@aszlig aszlig commented May 27, 2018

@nh2: Approved, but see my last comment.

# command invocation changes.
# We use timeout=10 so that the user gets some sense
# of progress, as reboots can take a long time.
return self.run_command('last reboot --time-format iso | head -n1', capture_stdout=True, timeout=10).rstrip()

This comment has been minimized.

@coretemp

coretemp Jun 1, 2018

This pipe could be broken.

This comment has been minimized.

@nh2

nh2 Jun 28, 2018
Author Contributor

Fixed

@nh2 nh2 force-pushed the nh2:issue-856-too-fast-reboot-race-condition branch from 2fff961 to f83e244 Jun 28, 2018
@nh2
Copy link
Contributor Author

@nh2 nh2 commented Jun 28, 2018

OK, I've updated the commits to address the remaining comments.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
@nh2 nh2 force-pushed the nh2:issue-856-too-fast-reboot-race-condition branch from f83e244 to 8f94a85 Jul 2, 2018
@grahamc
Copy link
Member

@grahamc grahamc commented Mar 26, 2020

Hello!

Thank you for this PR.

In the past several months, some major changes have taken place in
NixOps:

  1. Backends have been removed, preferring a plugin-based architecture.
    Here are some of them:

  2. NixOps Core has been updated to be Python 3 only, and at the
    same time, MyPy type hints have been added and are now strictly
    required during CI.

This is all accumulating in to what I hope will be a NixOps 2.0
release
. There is a tracking issue for that:
#1242 . It is possible that
more core changes will be made to NixOps for this release, with a
focus on simplifying NixOps core and making it easier to use and work
on.

My hope is that by adding types and more thorough automated testing,
it will be easier for contributors to make improvements, and for
contributions like this one to merge in the future.

However, because of the major changes, it has become likely that this
PR cannot merge right now as it is. The backlog of now-unmergable PRs
makes it hard to see which ones are being kept up to date.

If you would like to see this merge, please bring it up to date with
master and reopen it
. If the or mypy type checking fails, please
correct any issues and then reopen it. I will be looking primarily at
open PRs whose tests are all green.

Thank you again for the work you've done here, I am sorry to be
closing it now.

Graham

@grahamc grahamc closed this Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.