PAM responder delays with to many requests #6035

jbd · 2022-03-07T13:34:46Z

Hello everybody,

tldr, it seems that the hardcoded backlog value (10) used in the listen call of src/responder/common/responder_common.c can introduce delays.

https://github.com/SSSD/sssd/blob/2.6.3/src/responder/common/responder_common.c#L887

We've got servers with 2x48 cores in them running sssd, with an ldap provider. We are using the slurm job scheduler to let users run their jobs on the cluster.

From time to time, we've got timeout issue from the slurm daemon on the compute node which is trying to retrieve the user environnement by (literally) running /bin/su - <username> -c /usr/bin/env.

https://github.com/SchedMD/slurm/blob/slurm-21-08-6-1/src/common/env.c#L1978

This can take time for multiple reasons, but we were still able to reproduce the timeout problem without executing login scripts (and deactivating pam_lastlog which can also introduce a one second delay as it try to lock /var/log/lastlog).

For example, here is a reproducer (without the su --login option, pam_lastlog deactivated):

# for i in {1..64}; do time /bin/su jbdenis -c /usr/bin/env & done |& grep real; wait
real	0m0.178s
real	0m0.184s
[...]   # a few dozens
real	0m0.178s
real	0m0.174s
real	0m0.185s
real	0m0.178s
real	0m1.154s
real	0m1.154s
real	0m1.148s
real	0m1.156s
real	0m1.147s
real	0m1.146s
real	0m1.157s
real	0m1.157s
real	0m1.146s
real	0m1.150s
real	0m1.150s
real	0m1.149s
real	0m1.154s
real	0m1.152s
real	0m1.152s
real	0m1.149s
real	0m1.151s
real	0m1.154s
real	0m1.147s
real	0m1.148s

You've got tow timeout groups: d <1s and 1s < d < 2s. This is a synthetic reproducer but that's exactly what can occured on a compute node when a logt of jobs

We "patched" the listen call from src/responder/common/responder_common.c using a listen function wrapper in an .so file (it was simpler in our context) using 128 as backlog parameter and checked that we only had 99% <1s answers after the change. (We used this recipe and patchelf --add-needed instead of LD_PRELOAD: https://access.redhat.com/solutions/3314151).

What do you think about it ?

Thank you for your help.

Jean-Baptiste

The text was updated successfully, but these errors were encountered:

alexey-tikhonov · 2022-03-07T14:41:17Z

Hi,

thanks for your test. I wouldn't object backlog size increase.

I'd suggest to open a PR. Not sure if it's worth making size configurable...

The previous value (10) could introduce delays in responder answer in some highly used environment. See SSSD#6035 for test and details.

The previous value (10) could introduce delays in responder answer in some highly used environment. See #6035 for test and details. Reviewed-by: Alexey Tikhonov <atikhono@redhat.com> Reviewed-by: Sumit Bose <sbose@redhat.com>

The previous value (10) could introduce delays in responder answer in some highly used environment. See SSSD#6035 for test and details. Reviewed-by: Alexey Tikhonov <atikhono@redhat.com> Reviewed-by: Sumit Bose <sbose@redhat.com>

alexey-tikhonov · 2023-02-21T19:43:20Z

Hi @sumit-bose,

what would you say about deducing listen() backlog size from fd_limit responder conf setting?
For example, fd_limit / 10?

sumit-bose · 2023-02-22T06:37:34Z

Hi,

this would work as well. But it looks like PR #6036 already increased the size to 128 to a value which is causing no issues for the reporter and we just forgot to close this ticket here. So I would wait until we come across an use-case where a larger or more flexible backlog size is needed.

bye,
Sumit

alexey-tikhonov · 2023-02-22T14:29:59Z

Ok, thank you.

So: fixed via 91e8c4f

jbd added a commit to jbd/sssd that referenced this issue Mar 7, 2022

Increase listen backlog

3902944

The previous value (10) could introduce delays in responder answer in some highly used environment. See SSSD#6035 for test and details.

jbd mentioned this issue Mar 7, 2022

Increase listen backlog #6036

Closed

alexey-tikhonov self-assigned this Feb 21, 2023

alexey-tikhonov added the Future work label Feb 21, 2023

alexey-tikhonov closed this as completed Feb 22, 2023

alexey-tikhonov added Closed: Fixed Issue was closed as fixed. and removed Future work labels Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PAM responder delays with to many requests #6035

PAM responder delays with to many requests #6035

jbd commented Mar 7, 2022

alexey-tikhonov commented Mar 7, 2022 •

edited

Loading

alexey-tikhonov commented Feb 21, 2023

sumit-bose commented Feb 22, 2023

alexey-tikhonov commented Feb 22, 2023

PAM responder delays with to many requests #6035

PAM responder delays with to many requests #6035

Comments

jbd commented Mar 7, 2022

alexey-tikhonov commented Mar 7, 2022 • edited Loading

alexey-tikhonov commented Feb 21, 2023

sumit-bose commented Feb 22, 2023

alexey-tikhonov commented Feb 22, 2023

alexey-tikhonov commented Mar 7, 2022 •

edited

Loading