Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixops deploy kills the network connection #640

Closed
basvandijk opened this issue Apr 3, 2017 · 5 comments · Fixed by NixOS/nixpkgs#30994
Closed

nixops deploy kills the network connection #640

basvandijk opened this issue Apr 3, 2017 · 5 comments · Fixed by NixOS/nixpkgs#30994
Labels

Comments

@basvandijk
Copy link
Member

I've had the following problem for as long as I have been using nixops (~3 years) but never took the time to report it since I had an acceptable work around. However since this is a serious issue that other people can run into I would like to report it now and work towards a diagnosis and a fix.

Almost everytime I perform a significant nixpkgs upgrade (for example I just upgraded my nixos-17.03-small with the following changes: NixOS/nixpkgs-channels@8183006...6018464) and deploy to one of my machines on Hetzner using nixops I get the following:

$ nixops deploy -s state.nixops -d my-net --include mymachine
building all machine configurations...
mymachine..............................> copying closure...
stalling-net> closures copied successfully
mymachine..............................> updating GRUB 2 menu...
mymachine..............................> stopping the following units:
  alsa-store.service,
  audit.service,
  kmod-static-nodes.service,
  network-addresses-eth0.service,
  network-link-eth0.service,
  network-local-commands.service,
  network-setup.service,
  nix-daemon.service,
  nix-daemon.socket,
  nscd.service,
  rngd.service,
  sshd-keygen.service,
  systemd-journal-catalog-update.service,
  systemd-modules-load.service,
  systemd-sysctl.service,
  systemd-timesyncd.service,
  systemd-tmpfiles-clean.timer,
  systemd-tmpfiles-setup-dev.service,
  systemd-udev-trigger.service,
  systemd-udevd-control.socket,
  systemd-udevd-kernel.socket,
  systemd-udevd.service

# hang!

^Cerror: interrupted
mymachine..............................> error: unable to activate new configuration

Note that the network related services are stopped. This is probably because the systemd unit files have been changed. However after the network has been stopped nixops presumably lost the SSH connection it had to mymachine because it will just hang (I have waited for at least a minute). I have to interrupt nixops using Ctrl-C.

Since the network is down production services on mymachine become unreachable from the network. Fortunately our service can tolerate a bit of downtime but I can imagine not everybody can.

The way I currently "solve" it is to reboot mymachine from the Hetzner web interface. This is clearly a hack and not a real solution.

Some additional information that may or may not have anything to do with it:

@edolstra any idea why this is happening?

@edolstra
Copy link
Member

edolstra commented Apr 4, 2017

I've seen this as well, but in my case the network did come back up eventually. switch-to-configuration ignores HUP signals precisely to handle this situation.

I thought the obvious solution is to set stopIfChanged = false on network-addresses-* units (so that they're restarted rather than stopped and started). But that breaks the reconfiguration case where the IP address is changed, since the preStop script of network-addresses-* will then try to delete the new rather than the old IP address.

edolstra added a commit to NixOS/nixpkgs that referenced this issue Apr 4, 2017
This reduces the time window during which IP addresses are gone during
switch-to-configuration. A complication is that with stopIfChanged =
true, preStop would try to delete the *new* IP addresses rather than
the old one (since the preStop script now runs after the switch to the
new configuration). So we now record the actually configured addresses
in /run/nixos/network/addresses/<interface>. This is more robust in
any case.

Issue NixOS/nixops#640.
@domenkozar domenkozar added the bug label Apr 12, 2017
@basvandijk
Copy link
Member Author

@nh2 this might be the problem you talked about during NixCon2017.

@basvandijk
Copy link
Member Author

basvandijk commented Oct 30, 2017

@groxxda, just FYI we (@aszlig, @nh2, @domenkozar, @fpletz and me) are currently investigating this issue at the NixCon 2017 hackathon.

We have a suspicion the problem is that the PartOf=network-setup.service dependency of network-addresses-* that was added by you is in the wrong way around. Before your commit the dependency is the other way around.

I think the desired behavior is that network-setup gets restarted when network-addresses-* gets restarted. However currently the opposite happens which means that whenever you change network-setup (for example by adding a nameserver) the service gets restarted and network-addresses-* gets restarted as well. When you do this over a SSH connection (like with nixops) you lock yourself out.

Note that a simple reboot of the machine makes it boot without problems.

We're working on a fix.

@nh2
Copy link
Contributor

nh2 commented Oct 30, 2017

Here's a user story to describe in detail how things can break:

  • user installs dnsmasq

  • thus nameserver gets changed ("nameserver 127.0.0.1" gets added); in practice that means that in systemctl cat network-setup.service the ExecStart script that before contained contents like this:

    #! /nix/store/lpk84rsbha199vm3k54498lqv2jswqj8-bash-4.4-p5/bin/bash -e
    # Set the static DNS configuration, if given.
    /nix/store/631h021x3w0zaicaikzyzwxhig79si00-openresolv-3.8.1/sbin/resolvconf -m 1 -a static <<EOF
    
    
    nameserver 213.133.98.98
    nameserver 213.133.99.99
    nameserver 213.133.100.100
    nameserver 1234:123:0:a12a::add:1010
    nameserver 1234:123:0:a12b::add:9999
    nameserver 1234:123:0:1234::add:9898
    
    EOF
    
    # Set the default gateway.
    # FIXME: get rid of "|| true" (necessary to make it idempotent).
    ip route add default  via "1.2.3.5"   || true
    

    gains another nameserver 127.0.0.1 line:

    [...]
    nameserver 1234:123:0:1234::add:9898
    nameserver 127.0.0.1
    
    EOF
    [...]
    
  • When these contents are changed, network-setup.service gets restarted

  • Another service, systemctl cat network-addresses-eth0.service has the mentioned problematic entry PartOf=network-setup.service

  • As a result, when network-setup.service, network-addresses-eth0.service gets restarted

  • network-addresses-eth0.service has an ExecStop script with these contents:

    #! /nix/store/lpk84rsbha199vm3k54498lqv2jswqj8-bash-4.4-p5/bin/bash -e
    echo -n "deleting 1.2.3.4/26..."
    ip addr del "1.2.3.4/26" dev "eth0" >/dev/null 2>&1 || echo -n " Failed"
    echo ""
    

    which removes the IP of the server.

@basvandijk
Copy link
Member Author

basvandijk added a commit to LumiGuide/nixpkgs that referenced this issue Oct 30, 2017
Reverse the PartOf dependency between network-setup and network-addresses-*

This was joint work of: @nh2, @domenkozar, @fpletz, @aszlig and @basvandijk
at the NixCon 2017 hackathon.
basvandijk added a commit to LumiGuide/nixpkgs that referenced this issue Oct 30, 2017
Reverse the PartOf dependency between network-setup and network-addresses-*

This was joint work of: @nh2, @domenkozar, @fpletz, @aszlig and @basvandijk
at the NixCon 2017 hackathon.
basvandijk added a commit to LumiGuide/nixpkgs that referenced this issue Oct 30, 2017
Reverse the PartOf dependency between network-setup and network-addresses-*

This was joint work of: @nh2, @domenkozar, @fpletz, @aszlig and @basvandijk
at the NixCon 2017 hackathon.
domenkozar pushed a commit to NixOS/nixpkgs that referenced this issue Oct 30, 2017
Reverse the PartOf dependency between network-setup and network-addresses-*

This was joint work of: @nh2, @domenkozar, @fpletz, @aszlig and @basvandijk
at the NixCon 2017 hackathon.
domenkozar pushed a commit to NixOS/nixpkgs that referenced this issue Oct 30, 2017
Reverse the PartOf dependency between network-setup and network-addresses-*

This was joint work of: @nh2, @domenkozar, @fpletz, @aszlig and @basvandijk
at the NixCon 2017 hackathon.
domenkozar pushed a commit to NixOS/nixpkgs that referenced this issue Oct 30, 2017
Reverse the PartOf dependency between network-setup and network-addresses-*

This was joint work of: @nh2, @domenkozar, @fpletz, @aszlig and @basvandijk
at the NixCon 2017 hackathon.
adrianpk added a commit to adrianpk/nixpkgs that referenced this issue May 31, 2024
Reverse the PartOf dependency between network-setup and network-addresses-*

This was joint work of: @nh2, @domenkozar, @fpletz, @aszlig and @basvandijk
at the NixCon 2017 hackathon.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants