failover: tune up default timeouts #636
The defaults are the problem. Mostly I don't like that the LDAP operation time out is 24 seconds, but the fail over timeout of 18 seconds is also too long.
If the server was really slow or was dropping (not rejecting) packets, then the fail over would take almost half a minute, that seems just too much.
btw why is the ldap timeout the longest? I thought that the top-level timeout was the failover one which gives you a server? Or is the LDAP timeout used for something like sdap_op which includes the server resolution with failover and then LDAP connection?
The third one is actually used in more contexts but this one is the one that fails the failover prematurely (together with the second one). So perhaps instead of increasing the timeouts we should rework the code so we are able to failover correctly (perhaps do not use 2. and 3. at all). But this is much bigger task.
my suggestion would be to lower the "lower level" timeouts and to do this only for the AD provider for a start.
The reasoning is that imo we can assume that an AD environment is "sufficiently well maintained" that there should be no need to wait more then 1s for a reply from a DNS server.
RESOLV_TIMEOUTMS which is currently hardcode to 2000 ms (2s) is actually the time SSSD will wait for a reply from a single DNS server. After that the next DNS server in the list is tried. It would make sense to make this configurable and lower it for the AD provider to 1000 (1s) or even a bit lower.
If we say we want at least have the chance to try 3 DNS servers, dns_resolver_op_timeout can be 3s. (Btw the man page entry for dns_resolver_op_timeout has to be fixed because is says "How long would SSSD talk to a single DNS server." but it is as you said above "timeout for single dns query").
I would try to avoid touching ldap_opt_timeout since, as you said, it is used in various places. Maybe a dedicated failover timeout can be added for sdap_id_op.
@jhrozek mentioned that it might be possible to keep the c-ares state around so that when there was a timeout talking to one DNS server the state with the next server which replied in time is used for upcoming request well. Currently a new DNS request will start with the same server which showed the timeout, so for sequences like looking up a SRV record and then resolve a single host from the returned list we have to wait twice for the first DNS server not replying. But since this is an improvement which is not strictly related to the default timeout values this can of course be solve in a different ticket/PR.
Ok, how about now? It adds new option
I had to increase the