-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PAM responder delays with to many requests #6035
Comments
Hi, thanks for your test. I wouldn't object backlog size increase. I'd suggest to open a PR. Not sure if it's worth making size configurable... |
The previous value (10) could introduce delays in responder answer in some highly used environment. See SSSD#6035 for test and details.
The previous value (10) could introduce delays in responder answer in some highly used environment. See #6035 for test and details. Reviewed-by: Alexey Tikhonov <atikhono@redhat.com> Reviewed-by: Sumit Bose <sbose@redhat.com>
The previous value (10) could introduce delays in responder answer in some highly used environment. See SSSD#6035 for test and details. Reviewed-by: Alexey Tikhonov <atikhono@redhat.com> Reviewed-by: Sumit Bose <sbose@redhat.com>
Hi @sumit-bose, what would you say about deducing |
Hi, this would work as well. But it looks like PR #6036 already increased the size to 128 to a value which is causing no issues for the reporter and we just forgot to close this ticket here. So I would wait until we come across an use-case where a larger or more flexible backlog size is needed. bye, |
Ok, thank you. So: fixed via 91e8c4f |
Hello everybody,
tldr, it seems that the hardcoded backlog value (10) used in the listen call of src/responder/common/responder_common.c can introduce delays.
https://github.com/SSSD/sssd/blob/2.6.3/src/responder/common/responder_common.c#L887
We've got servers with 2x48 cores in them running sssd, with an ldap provider. We are using the slurm job scheduler to let users run their jobs on the cluster.
From time to time, we've got timeout issue from the slurm daemon on the compute node which is trying to retrieve the user environnement by (literally) running
/bin/su - <username> -c /usr/bin/env
.https://github.com/SchedMD/slurm/blob/slurm-21-08-6-1/src/common/env.c#L1978
This can take time for multiple reasons, but we were still able to reproduce the timeout problem without executing login scripts (and deactivating pam_lastlog which can also introduce a one second delay as it try to lock /var/log/lastlog).
For example, here is a reproducer (without the su --login option, pam_lastlog deactivated):
You've got tow timeout groups: d <1s and 1s < d < 2s. This is a synthetic reproducer but that's exactly what can occured on a compute node when a logt of jobs
We "patched" the listen call from src/responder/common/responder_common.c using a listen function wrapper in an .so file (it was simpler in our context) using 128 as backlog parameter and checked that we only had 99% <1s answers after the change. (We used this recipe and
patchelf --add-needed
instead of LD_PRELOAD: https://access.redhat.com/solutions/3314151).What do you think about it ?
Thank you for your help.
Jean-Baptiste
The text was updated successfully, but these errors were encountered: