Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAM responder delays with to many requests #6035

Closed
jbd opened this issue Mar 7, 2022 · 4 comments
Closed

PAM responder delays with to many requests #6035

jbd opened this issue Mar 7, 2022 · 4 comments
Assignees
Labels
Closed: Fixed Issue was closed as fixed.

Comments

@jbd
Copy link
Contributor

jbd commented Mar 7, 2022

Hello everybody,

tldr, it seems that the hardcoded backlog value (10) used in the listen call of src/responder/common/responder_common.c can introduce delays.

https://github.com/SSSD/sssd/blob/2.6.3/src/responder/common/responder_common.c#L887

We've got servers with 2x48 cores in them running sssd, with an ldap provider. We are using the slurm job scheduler to let users run their jobs on the cluster.

From time to time, we've got timeout issue from the slurm daemon on the compute node which is trying to retrieve the user environnement by (literally) running /bin/su - <username> -c /usr/bin/env.

https://github.com/SchedMD/slurm/blob/slurm-21-08-6-1/src/common/env.c#L1978

This can take time for multiple reasons, but we were still able to reproduce the timeout problem without executing login scripts (and deactivating pam_lastlog which can also introduce a one second delay as it try to lock /var/log/lastlog).

For example, here is a reproducer (without the su --login option, pam_lastlog deactivated):

# for i in {1..64}; do time /bin/su jbdenis -c /usr/bin/env & done |& grep real; wait
real	0m0.178s
real	0m0.184s
[...]   # a few dozens
real	0m0.178s
real	0m0.174s
real	0m0.185s
real	0m0.178s
real	0m1.154s
real	0m1.154s
real	0m1.148s
real	0m1.156s
real	0m1.147s
real	0m1.146s
real	0m1.157s
real	0m1.157s
real	0m1.146s
real	0m1.150s
real	0m1.150s
real	0m1.149s
real	0m1.154s
real	0m1.152s
real	0m1.152s
real	0m1.149s
real	0m1.151s
real	0m1.154s
real	0m1.147s
real	0m1.148s

You've got tow timeout groups: d <1s and 1s < d < 2s. This is a synthetic reproducer but that's exactly what can occured on a compute node when a logt of jobs

We "patched" the listen call from src/responder/common/responder_common.c using a listen function wrapper in an .so file (it was simpler in our context) using 128 as backlog parameter and checked that we only had 99% <1s answers after the change. (We used this recipe and patchelf --add-needed instead of LD_PRELOAD: https://access.redhat.com/solutions/3314151).

What do you think about it ?

Thank you for your help.

Jean-Baptiste

@alexey-tikhonov
Copy link
Member

alexey-tikhonov commented Mar 7, 2022

Hi,

thanks for your test. I wouldn't object backlog size increase.

I'd suggest to open a PR. Not sure if it's worth making size configurable...

jbd added a commit to jbd/sssd that referenced this issue Mar 7, 2022
The previous value (10) could introduce delays in responder answer in some highly used environment.

See SSSD#6035 for test and details.
pbrezina pushed a commit that referenced this issue Mar 10, 2022
The previous value (10) could introduce delays in responder answer in some highly used environment.

See #6035 for test and details.

Reviewed-by: Alexey Tikhonov <atikhono@redhat.com>
Reviewed-by: Sumit Bose <sbose@redhat.com>
shridhargadekar pushed a commit to shridhargadekar/sssd that referenced this issue Apr 1, 2022
The previous value (10) could introduce delays in responder answer in some highly used environment.

See SSSD#6035 for test and details.

Reviewed-by: Alexey Tikhonov <atikhono@redhat.com>
Reviewed-by: Sumit Bose <sbose@redhat.com>
@alexey-tikhonov alexey-tikhonov self-assigned this Feb 21, 2023
@alexey-tikhonov
Copy link
Member

Hi @sumit-bose,

what would you say about deducing listen() backlog size from fd_limit responder conf setting?
For example, fd_limit / 10?

@sumit-bose
Copy link
Contributor

Hi,

this would work as well. But it looks like PR #6036 already increased the size to 128 to a value which is causing no issues for the reporter and we just forgot to close this ticket here. So I would wait until we come across an use-case where a larger or more flexible backlog size is needed.

bye,
Sumit

@alexey-tikhonov
Copy link
Member

Ok, thank you.

So: fixed via 91e8c4f

@alexey-tikhonov alexey-tikhonov added Closed: Fixed Issue was closed as fixed. and removed Future work labels Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closed: Fixed Issue was closed as fixed.
Projects
None yet
Development

No branches or pull requests

3 participants