Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH login delayed by 25 seconds with pam_systemd(sshd:session): Failed to create session: Connection timed out #30348

Open
nh2 opened this issue Oct 12, 2017 · 6 comments

Comments

Projects
None yet
4 participants
@nh2
Copy link
Contributor

commented Oct 12, 2017

Issue description

On NixOS 17.09 machines, initial SSH connections after boot are delayed for exactly 25 seconds.

I think this is a regression from 17.03, I didn't observe it there.

journalctl --boot --unit sshd

Oct 12 12:20:22.652948 gluster-cluster-1 sshd[726]: Server listening on 0.0.0.0 port 22.
Oct 12 12:20:22.653218 gluster-cluster-1 sshd[726]: Server listening on :: port 22.
Oct 12 12:20:22.995019 gluster-cluster-1 sshd[766]: Accepted publickey for root from 1.2.3.4 port 52664 ssh2: RSA SHA256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Oct 12 12:20:22.997136 gluster-cluster-1 sshd[766]: pam_unix(sshd:session): session opened for user root by (uid=0)
# 25 seconds pass
Oct 12 12:20:48.022631 gluster-cluster-1 sshd[766]: pam_systemd(sshd:session): Failed to create session: Connection timed out
Oct 12 12:20:51.050106 gluster-cluster-1 sshd[1238]: Accepted publickey for root from 1.2.3.4 port 52266 ssh2: RSA SHA256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

The error here is pam_systemd(sshd:session): Failed to create session: Connection timed out.

This issue has been observed in Debian in two places but results were inconclusive:

And a systemd bug report that also doesn't really suggest what to do:

There is some talk about dbus in it though, and I noticed that on my machines:

# systemctl list-unit-files --user
UNIT FILE                    STATE   
dbus.service                 indirect
systemd-exit.service         static  
dbus.socket                  disabled
basic.target                 static  
bluetooth.target             static  
busnames.target              static  
default.target               static  
exit.target                  static  
graphical-session-pre.target static  
graphical-session.target     static  
paths.target                 static  
printer.target               static  
shutdown.target              static  
smartcard.target             static  
sockets.target               static  
sound.target                 static  
timers.target                static  

17 unit files listed.

Note that dbus is disabled (and red in my terminal). However, it is the same on my 17.03 machine where I don't have this problem.

I've already tried UseDNS no for sshd but that doesn't change anything.

Steps to reproduce

Technical details

  • System: NixOS deployed with NixOps (nixpkgs commit below)
  • Nixpkgs version: release-17.09 (commit dea58ce228e6f9686773745024eb844d8dc39fc3)
@nh2

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2017

Possibly related?

Edit: Using dbus 1.11.20 didn't change anything for me.

nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 12, 2017

WIP: Reproduction for NixOS/nixpkgs#30348
After deploying, use nixos reboot, and observe
how gluster-cluster-1 has the annoying

  pam_systemd(sshd:session): Failed to create session: Connection timed out

hang at the first SSH connection.

nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 16, 2017

Fix SSH hang by using .mount unit instead of fstab converter.
This apparently works around NixOS/nixpkgs#30348
for me.

No idea why the fstab converter would have this problem.

nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 16, 2017

Improve robustness when rebooting.
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 16, 2017

Improve robustness when rebooting.
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 16, 2017

Improve robustness when rebooting.
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

This commit also switches to using `services.glusterfs.tlsSettings`
as implemented in NixOS/nixpkgs#27340
which revealed a lot of the above issues.

nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 22, 2017

Improve robustness when rebooting.
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

This commit also switches to using `services.glusterfs.tlsSettings`
as implemented in NixOS/nixpkgs#27340
which revealed a lot of the above issues.

nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 22, 2017

Fix SSH hang by using .mount unit instead of fstab converter.
This apparently works around NixOS/nixpkgs#30348
for me.

No idea why the fstab converter would have this problem.

Only done this on the client so far, not yet on the servers

nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 22, 2017

Improve robustness when rebooting.
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

This commit also switches to using `services.glusterfs.tlsSettings`
as implemented in NixOS/nixpkgs#27340
which revealed a lot of the above issues.
@devurandom

This comment has been minimized.

Copy link

commented Mar 5, 2018

systemd/systemd#2863 was fixed by systemd/systemd#7102. Does that also fix this issue?

@nh2

This comment has been minimized.

Copy link
Contributor Author

commented Mar 5, 2018

@devurandom I had a suspicion (see here) that systemd/systemd#7102 might the symptom but not the cause.

The train of thought is the following:

  • There is a difference between the behaviour of the fstab generator and mount units that we're not aware of that is possibly the key bug; this difference did not exist in an older systemd version
  • pam_systemd(login:session): Failed to create session: Connection timed out appearing was a nice repro to observe that difference
  • systemd/systemd#7102 might make that error go away (I haven't had a chance to test that yet), thus possibly removing my chance to repro but the underlying problematic difference between fstab processing and mount unit behaviour might still exist

But I'm not very sure about it.

@volth

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2018

perhaps different, but very similar bug:

25 seconds delay and pam_systemd(slim:session): Failed to create session: Connection timed out in system log take place on each startup on the virtual machines which have no network interfaces except lo

reproducible on current (2018-04-05) master and probably on release-18.03

@r-darwish

This comment has been minimized.

Copy link

commented Apr 5, 2018

I'm not using NixOS nor Nix, but I had the same issue on Arch Linux when systemd 238 came. Downgrading back to systemd 237 solved it.

@volth

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2018

It is with 237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.