Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket activated SSHD shows error in journalctl early at boot #3279

Closed
CMCDragonkai opened this issue Jul 15, 2014 · 8 comments
Closed

Socket activated SSHD shows error in journalctl early at boot #3279

CMCDragonkai opened this issue Jul 15, 2014 · 8 comments

Comments

@CMCDragonkai
Copy link
Member

I've got a socket activated SSHD on my NixOS instance:

In configuration.nix

    services = {
        openssh = {
            enable = true;
            startWhenNeeded = true;
            permitRootLogin = "no";
            passwordAuthentication = false;
            extraConfig = ''
                PrintLastLog no
            '';
        };
    };

When checking my sudo systemctl status, I always get this on every boot:

● matrix-node
    State: degraded
     Jobs: 0 queued
   Failed: 1 units
    Since: Tue 2014-07-15 02:20:03 UTC; 1h 34min ago
   CGroup: /
           ├─1 systemd --log-target=journal
           ├─system.slice
           │ ├─nix-daemon.service
           │ │ └─3233 nix-daemon --daemon
           │ ├─ntpd.service
           │ │ └─1751 ntpd -g -n -c /nix/store/02izw6mxdn7sirjy9n6m0zrb9bp3rn8j-ntp.conf -u ntp:nogroup -i /var/lib/ntp
           │ ├─nscd.service
           │ │ └─1697 nscd -f /nix/store/jyaww75i9jac8rpjs3skjx519nqg0yy8-nscd.conf
           │ ├─dbus.service
           │ │ └─1381 /nix/store/6cyp9201l5wyh209fwzvyksva5s9fz4a-dbus-tools-1.8.0/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd
           │ ├─systemd-logind.service
           │ │ └─1375 /nix/store/sk3017fxx2vhj69saizk5634y3w6hqhn-systemd-212/lib/systemd/systemd-logind
           │ ├─haveged.service
           │ │ └─1336 /nix/store/c6x8mp7yvzm8agcfaanwfjcnnmjk72r3-haveged-1.9.1/sbin/haveged -w 1024 -v 1
           │ ├─virtualbox.service
           │ │ └─1322 VBoxService --foreground
           │ ├─dhcpcd.service
           │ │ └─1714 dhcpcd --quiet --config /nix/store/272zp63msk1mj2ar1fhirisfn4vf8z9d-dhcpcd.conf
           │ ├─systemd-udevd.service
           │ │ └─1255 /nix/store/sk3017fxx2vhj69saizk5634y3w6hqhn-systemd-212/lib/systemd/systemd-udevd
           │ ├─systemd-journald.service
           │ │ └─1243 /nix/store/sk3017fxx2vhj69saizk5634y3w6hqhn-systemd-212/lib/systemd/systemd-journald
           │ └─system-getty.slice
           │   └─getty@tty1.service
           │     └─1331 agetty --noclear --login-program /nix/store/gzr868gspawhic2pcamqms8xbpp06rm2-shadow-4.1.5.1/bin/login --keep-baud tty1 115200 38400 9
           └─user.slice
             └─user-1000.slice
               ├─session-3.scope
               │ ├─2713 sshd: operator [priv]
               │ ├─2726 sshd: operator@pts/0
               │ ├─2727 -bash
               │ ├─3319 sudo systemctl status
               │ ├─3320 systemctl status
               │ └─3321 less
               └─user@1000.service
                 ├─2718 /nix/store/sk3017fxx2vhj69saizk5634y3w6hqhn-systemd-212/lib/systemd/systemd --user
                 ├─2719 (sd-pam
                 └─ssh-agent.service
                   └─2725 /nix/store/2wc50fcn54axkg2kk71jm2r5h0w5rbh6-openssh-6.6p1/bin/ssh-agent -a /run/user/1000/ssh-agent

So there's always a failed service, and it always shows that this OS is in a degraded state. This is not helpful as these will alert errors in a centralised monitoring/logging system.

The specific failed service is an sshd service:

$ sudo systemctl status sshd@0-10.0.2.15:22-10.0.2.2:51317.service
● sshd@0-10.0.2.15:22-10.0.2.2:51317.service - SSH Daemon (10.0.2.2:51317)
   Loaded: loaded (/nix/store/wr8r8jrj204q3i0v4vfav8m63ssnv8w1-unit/sshd@.service)
   Active: failed (Result: exit-code) since Tue 2014-07-15 02:22:12 UTC; 1h 35min ago
  Process: 1650 ExecStart=/nix/store/2wc50fcn54axkg2kk71jm2r5h0w5rbh6-openssh-6.6p1/sbin/sshd -i -f /nix/store/ai2a554az21b5zhd1kamcznbim4gd924-sshd_config (code=exited, status=1/FAILURE)
  Process: 1648 ExecStartPre=/nix/store/i5wnidc4707k3pgcbhyjq3qb4ajgyx5n-unit-script/bin/sshd@-pre-start (code=exited, status=0/SUCCESS)
 Main PID: 1650 (code=exited, status=1/FAILURE)

Jul 15 02:20:12 matrix-node systemd[1]: Starting SSH Daemon (10.0.2.2:51317)...
Jul 15 02:20:12 matrix-node systemd[1]: Started SSH Daemon (10.0.2.2:51317).
Jul 15 02:22:12 matrix-node systemd[1]: sshd@0-10.0.2.15:22-10.0.2.2:51317.service: main process exited, code=exited, status=1/FAILURE
Jul 15 02:22:12 matrix-node systemd[1]: Unit sshd@0-10.0.2.15:22-10.0.2.2:51317.service entered failed state.

However, there are 3 instances of SSHD services that were launched. This is proven via the systemctl status sshd.socket output shows that there were 3 connections accepted.

$ sudo systemctl status sshd.socket
● sshd.socket - SSH Socket
   Loaded: loaded (/nix/store/g8s84aaj5p3dcak3c2xinlwm7n7vxn52-unit/sshd.socket)
   Active: active (listening) since Tue 2014-07-15 02:20:06 UTC; 1h 36min ago
   Listen: [::]:22 (Stream)
 Accepted: 3; Connected: 1

Jul 15 02:20:06 matrix-node systemd[1]: Listening on SSH Socket.

I'm launching this through vagrant. This means there should at least be 2 sshd instances. The first would be when vagrant boots the VM up, and then ssh in to setup the network and shared folders. The second is when I run vagrant ssh to ssh into the machine.

However this shows 3 instances. I did some more digging and found that these 3 instances were:

sshd@0-10.0.2.15:22-10.0.2.2:51317.service  -> Failed - Unknown ???
sshd@0-10.0.2.15:22-10.0.2.2:51331.service -> Succeeded and Exited - Vagrant Startup Session
sshd@0-10.0.2.15:22-10.0.2.2:51446.service -> Still running (current session)

The 51317 happened first, then 51331, then 51446. Now I check sudo journalctl -b -r, and this is what I see (reversed, top most recent, bottom older):

...LOTS OF LOGS HERE...
**Jul 15 02:22:12 matrix-node systemd[1]: Unit sshd@0-10.0.2.15:22-10.0.2.2:51317.service entered failed state.**
Jul 15 02:22:12 matrix-node systemd[1]: sshd@0-10.0.2.15:22-10.0.2.2:51317.service: main process exited, code=exited, status=1/FAILURE
Jul 15 02:20:26 matrix-node systemd[1]: Started User Manager for UID 1000.
Jul 15 02:20:26 matrix-node systemd[2718]: Startup finished in 20ms.
...LOTS OF LOGS HERE...
Jul 15 02:20:26 matrix-node sshd[2713]: Accepted publickey for operator from 10.0.2.2 port 51336 ssh2: RSA dd:3b:b8:2e:85:04:06:e9:ab:ff:a8:0a:c0:04:6e:d6
**Jul 15 02:20:26 matrix-node systemd[1]: Started SSH Daemon (10.0.2.2:51336).**
Jul 15 02:20:26 matrix-node systemd[1]: Starting SSH Daemon (10.0.2.2:51336)...
...LOTS OF LOGS HERE...
Jul 15 02:20:15 matrix-node sshd[1757]: Accepted publickey for operator from 10.0.2.2 port 51331 ssh2: RSA dd:3b:b8:2e:85:04:06:e9:ab:ff:a8:0a:c0:04:6e:d6
**Jul 15 02:20:15 matrix-node systemd[1]: Started SSH Daemon (10.0.2.2:51331).**
Jul 15 02:20:15 matrix-node systemd[1]: Starting SSH Daemon (10.0.2.2:51331)...
...LOTS OF LOGS HERE...
**Jul 15 02:20:12 matrix-node systemd[1]: Started SSH Daemon (10.0.2.2:51317).**
Jul 15 02:20:12 matrix-node systemd[1]: Starting SSH Daemon (10.0.2.2:51317)...
Jul 15 02:20:12 matrix-node systemd[1]: Created slice system-sshd.slice.
Jul 15 02:20:12 matrix-node systemd[1]: Starting system-sshd.slice.

Take a look at the ones that I marked with ** **. They show that 51317 started first, and then later failed after 2 minutes. This is the unknown SSH instance, I don't know why this instance started, and nothing ever tried logging in. The other 2 ssh service instances started, succeeded. One of them by Vagrant, the other by my current session.

So what is 51317? Why is it starting up with nothing trying to login, and then subsequently fail and cause my VM to be in a degraded state. Does this have something to do with the socket activation?

@wmertens
Copy link
Contributor

What happens when you just open and close the port with telnet?

@CMCDragonkai
Copy link
Member Author

Interesting, when I telnet, and run systemctl status, it now shows 2 ssh services failed.

 ೱ telnet 127.0.0.1 2222                                                                                                                   14-07-17 12:01:14
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
SSH-2.0-OpenSSH_6.6
dfgfgdfg
Protocol mismatch.
Connection closed by foreign host.
$ sudo systemctl status
● matrix-node
    State: degraded
     Jobs: 0 queued
   Failed: 2 units
    Since: Thu 2014-07-17 01:59:26 UTC; 24min ago
$ sudo systemctl
● sshd@0-10.0.2.15:22-10.0.2.2:50367.service                        loaded failed failed    SSH Daemon (10.0.2.2:50367)
  sshd@2-10.0.2.15:22-10.0.2.2:50423.service                        loaded active running   SSH Daemon (10.0.2.2:50423)
● sshd@3-10.0.2.15:22-10.0.2.2:51014.service                        loaded failed failed    SSH Daemon (10.0.2.2:51014)

I'm thinking this means:

  1. Something is telnetting the VM as it is booting.
  2. Why would SSH show up as a service failure, if it is the client's fault to connect with a mismatched protocol?

So perhaps is there a way to change it so that it's not a SSH failure just because a client fails in connecting?

@CMCDragonkai
Copy link
Member Author

I spoke to someone on the systemd mailing list, they said:

sshd chooses to exit with a failure code in this case. What we should really
do is add SuccessExitStatus= setting to the sshd@.service and ignore that
code. Is it some specific value?

Then:

That looks like -1, but whatever. Does it work if 
you add SuccessExitStatus=255?

How would I change the SuccessExitStatus for the sshd@.service inside Nix? I can't change it manually, is there some declarative nix expression for this?

@Profpatsch
Copy link
Member

(triage) @CMCDragonkai can you re-check with a recent system?

@Profpatsch
Copy link
Member

No response, so close @zimbatm

peterhoeg pushed a commit to peterhoeg/nixpkgs that referenced this issue Jun 25, 2016
sshd will at times fail when exiting. When socket activated, this will
leave a number of sshd@ service instances in the failed state, so we
simply ignore the error code if we are running socket activated.

Recommended by upstream:
http://systemd-devel.freedesktop.narkive.com/d0eapMCG/socket-activated-sshd-service-showing-up-as-a-failure-when-the-client-connection-fails

Fixes: NixOS#3279
@CMCDragonkai
Copy link
Member Author

@Profpatsch Apologies for the late reply. But it appears that this has not been fixed. However @peterhoeg has been working on this. The latest commit that I can see is: peterhoeg@83b9df2

@CMCDragonkai
Copy link
Member Author

@peterhoeg Can you send a PR for your fix, so this issue can be truly closed?

peterhoeg pushed a commit to peterhoeg/nixpkgs that referenced this issue Aug 4, 2016
sshd will at times fail when exiting. When socket activated, this will
leave a number of sshd@ service instances in the failed state, so we
simply ignore the error code if we are running socket activated.

Recommended by upstream:
http://systemd-devel.freedesktop.narkive.com/d0eapMCG/socket-activated-sshd-service-showing-up-as-a-failure-when-the-client-connection-fails

Fixes: NixOS#3279
@peterhoeg
Copy link
Member

By the way, I am REALLY sorry about spamming this issue. Note to self - only reference issues in the actual PR and not the commit message.

@globin globin reopened this Aug 5, 2016
peterhoeg pushed a commit to peterhoeg/nixpkgs that referenced this issue Aug 14, 2016
sshd will at times fail when exiting. When socket activated, this will
leave a number of sshd@ service instances in the failed state, so we
simply ignore the error code if we are running socket activated.

Recommended by upstream:
http://systemd-devel.freedesktop.narkive.com/d0eapMCG/socket-activated-sshd-service-showing-up-as-a-failure-when-the-client-connection-fails

Fixes: NixOS#3279
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants