New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sssd don't restart properly after being killed by watchdog #6219
Comments
Hi,
could you please check what operation is blocking in
|
I just realized timestamps are the same - 2022-06-14 0:31:43 - so this is the same event. |
Yes, all logs that I posted linked to the same event happened this night. Here logs from sssd_$domain.log at 2022-06-14 0:31:43
Just skip a lot of similar lines in the middle where sysdb_set_entry_attr set [ts_cache] attrs. |
(2022-06-14 0:31:07): [be[$domain]] [pam_print_data] (0x0100): cli_pid: 88687
Did it really succeed? |
Btw, this delay:
is also suspicious. After first line is printed in
What filesystem and kind of drive do you use for |
Hmmm.... I'm not sure for 100%. I get my opinion that it was ok based on ssh (auth) log : auth.log
Here are the lines from sssd_$domain.log before 0:31:43
I'll restart sssd with About sssd db files : /dev/mapper/vg-var on /var type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) Files don't seem to be huge, but I don't know how big it should be :
A little piece of information : |
I restarted sssd with debug_level = 9 And today at night we had another WATCHDOG killing, but this time it was slightly different. sssd.log
sssd_nss.log
sssd_pam.log
So it seems to me that sssd was killed by WATCHDOG at 0:38 and 0:40. It tried to restart but it didn't work until 1:34 when suddenly it starts to work. Am I right in my analyze ? With new debug_level I have a lot of lines in sssd_domain.log. Is it interesting to post them all ? Should I search for particular pattern ? Or for specific timestamp ? I already noticed one detail. I see some messages at 0:38 for nss and pam, but there is nothing for this timestamp in sssd_domain.log. Just nothing. There are line for 0:37 and then it restarts at 0:39. Not a big deal probably, but I was a little surprised. |
What is in What operation was running in Could you please grep "Starting with debug level" in
This is "expected" - stuck process (a target for watchdog) can't write to the log. |
Surpisingly nothing. Here are the sssd.log
The next line after 0:40 is at 0:32 this night (17.06). This night it was killed again and this time it didn't restart properly : Here are the lines for grep ""Starting with debug level"
Here are the lines before 0:38 (at 0:37:59 and little before, I could get more if necessary) :
before 0:40:06
before 0:40:37
In the logs that I posted there are not SSS_PAM_OPEN_SESSION messages. These messages should be printed only when user try to connect, am I right ? There were two parts of log with SSS_PAM_OPEN_SESSION messages at (2022-06-16 0:36:55) and (2022-06-16 0:37:36) so it was before first watchdog at (2022-06-16 0:38:26) and users are connected succesfully. |
I just realized you have
So when something wrong happens =>
You could try to increase Other option could be to modify |
Ok, thank you for your help.
Yes I spoke about it in y first messages. I also put "timeout = 15" into global [sssd] section? I thought it will be also active for [domain], but if it's not : it's a good idea. I will put some timeout into domain section as well. I have also some new details about this issue.
So we solved partially the problem, at least our sssd don't stop each day.
a) sssd_be process is stuck because of high load (cause by Proxmox in our case or another reason) ; But could we do something about restarting process ?
Could this option be useful for our case ? May be if we restart sssd more slowly we could have more chances that it will restart successfully ? |
Frankly, it's not often I see
This is per service/domain option - see
I guess because machine is still under high load and some basic operations
If you identified root cause correctly, then setting this timeout large enough (2*timeout > period of high load) should help.
The reason is that systemd only see "main" ("umbrella") process - Actually there is a long standing idea to get rid of this "main"
No, it doesn't affect the way |
Thank you very much. It's more clear for me now. I still have a little question about declaring service in "sssd" section. In the past I always use this option. I have some look in man and found So I decided that it doesn't need anymore. In any case, it change nothing to the original problem (difficult for sssd to restart). It occurs in the both cases. P.S.: Tomorrow, I could give the exact boot error message. |
We have had a similar issue with CentOS Stream 8 when sssd LDAP child was killed by watchdog during high load but was not properly restarted, so basically ssh to the server was broken and did not restore by itself until sssd restart via systemd.
That's it in the logs so far. Could you please tell me if it's possible to configure the number of child restart attempts or timeouts between the restarts or restart the whole process after some timeout when the child is unresponsive? |
update: probably |
Sorry, impossible to tell without logs with debug_level=9. In your quote even timestamps are cut off so order is unclear. With a debug level high enough, one could:
|
Sorry, I just wanted to mention fast that we had something smilar. I'm back with some logs: sssd.log:
sssd_LDAP.log:
sssd_nss.log:
sssd_pam.log:
13:27:26 is the time when we manually restarted sssd |
Unfortunately, default log level isn't enough in this case. |
Ok, I see, I just thought maybe it could give us a clue. Maybe we try to reproduce it with debug lvl 9. Thanks! |
Exact same problem here. |
Editing sssd.conf:
fixes the problem. This will retry connection to the provider for one week otherwise the WATCHDOG will kill it with its defaults and destroys the auth backend.
the main problem is that this configured default is incompatible with the default sssd.service unit because:
(for me on fedora and redhat flavors) will leave the stopped service in dead state. This problem should be quiet common since a overloaded connection e.g. backups could easily trigger this behavior. |
Hi everybody, for pid in In the event that java again fills the memory and an OOM occurs. |
Am I missing something? This method looks error prone to. The best method would be to find the source for this bad memory management and really fix the problem. Back to topic:
is defined in sssd config blog. Than in ./src/responder/common/responder_common.c:
later in ./src/sbus/connection/sbus_reconnect.c
and finally in the same file
Log files accordingly
Kills the process with exit 0 so for systemd everything is fine and it does not really want to restart the service because the watchdog returned 0 after 3 default attempts. Am I missing something? Se either
|
I've done some digging here, and we have This in turn introduced a problem where the system would not finish booting in the event that sssd failed to start (e.g. due to a misconfiguration), which was addressed in a049ac7 by changing In this issue, we see that in the event that sssd's child processes are not responding (e.g. due to I/O issues on virtualized hardware), it will kill itself and exit with a non-zero status; however because it now has I suspect that the
Ideally you want to set the start limit so that the limit is triggered by repeated starts with invalid configuration, but not due to timeouts. Of course the better solution would be to remove the |
Ive run into this issue when doing a zfs recv on a very weak IO computer. Applying the suggested changes to sssd.service and restarting sssd immediately fixed the issue and allow me to log back in as an AD user. |
Any progress on this major problem? |
Recently we have a strange behavior, I hope somebody could find the solution. Thank you for your help in advance.
We using sssd plugged to LDAP on our ssh server.
Server is installed on
Here are the version of sssd packages :
Here is sssd.conf :
Globally it works perfectly, but sometimes sssd process is killed by watchdog and then it can't start up again.
This load by itself is also strange thing, but probably not linked with sssd.
sssd.log (with debug_level=5)
I see these lines in sssd_pam and sssd_nss logs :
sssd_pam.log
sssd_nss.log
I said "partially" as sssd. service has active status by itself, but it doesn't work properly and no authentication is possible.
sssd_pam.log
So, this is pretty serious as it happens often enough and ssh server become useless until we restart sssd service.
Restarting sssd service solves the problem, but I hope there is some better solution.
Here some additional details/thoughts :
There was "service" setting in this conf in the sssd part :
[sssd]
services = nss, pam
As I understand this option is not necessary now. So I comment this line.
The problem described above was already present, and commenting this line didn't solve it.
But still I'd like to mention this change.
I added
timeout = 15
option into sssd part of config, but it didn't help neither.There is an option "RestartSec" for systemd.
I'm wondering may be set this option for sssd-nss and sssd-pam services could change something (may be the problem that sssd-nss restart too quickly) ?
The text was updated successfully, but these errors were encountered: