Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failover: tune up default timeouts #636

Open
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@pbrezina
Copy link
Member

commented Aug 13, 2018

@jhrozek

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2018

18 and 24 seem like /really/ long time outs. Can we use something lower? I can't imagine waiting for 24 seconds until sssd goes offline..

@pbrezina

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2018

What would you like to have? This will allow to test 3 servers before timeouting the whole operation. Can we decrease the lowest timeout from 6 to say 3? I think no.

@pbrezina

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2019

Bump. The patch is trivial we just need to agree on the defaults.

@pbrezina pbrezina force-pushed the pbrezina:timeouts branch from 58daa8f to 6beb8a4 Apr 2, 2019

@pbrezina

This comment has been minimized.

Copy link
Member Author

commented Apr 2, 2019

Rebased on top of current master.

@jhrozek

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2019

The defaults are the problem. Mostly I don't like that the LDAP operation time out is 24 seconds, but the fail over timeout of 18 seconds is also too long.

If the server was really slow or was dropping (not rejecting) packets, then the fail over would take almost half a minute, that seems just too much.

btw why is the ldap timeout the longest? I thought that the top-level timeout was the failover one which gives you a server? Or is the LDAP timeout used for something like sdap_op which includes the server resolution with failover and then LDAP connection?

@pbrezina

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2019

Yes.

  1. dns_resolver_op_timeout -- timeout for single dns query
  2. dns_resolver_timeout -- timeout for service resolution (it may include multiple dns queries)
  3. ldap_opt_timeout -- timeout for LDAP connection in sdap_id_op code

The third one is actually used in more contexts but this one is the one that fails the failover prematurely (together with the second one). So perhaps instead of increasing the timeouts we should rework the code so we are able to failover correctly (perhaps do not use 2. and 3. at all). But this is much bigger task.

@sumit-bose

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

Hi,

my suggestion would be to lower the "lower level" timeouts and to do this only for the AD provider for a start.

The reasoning is that imo we can assume that an AD environment is "sufficiently well maintained" that there should be no need to wait more then 1s for a reply from a DNS server.

RESOLV_TIMEOUTMS which is currently hardcode to 2000 ms (2s) is actually the time SSSD will wait for a reply from a single DNS server. After that the next DNS server in the list is tried. It would make sense to make this configurable and lower it for the AD provider to 1000 (1s) or even a bit lower.

If we say we want at least have the chance to try 3 DNS servers, dns_resolver_op_timeout can be 3s. (Btw the man page entry for dns_resolver_op_timeout has to be fixed because is says "How long would SSSD talk to a single DNS server." but it is as you said above "timeout for single dns query").

I would try to avoid touching ldap_opt_timeout since, as you said, it is used in various places. Maybe a dedicated failover timeout can be added for sdap_id_op.

@jhrozek mentioned that it might be possible to keep the c-ares state around so that when there was a timeout talking to one DNS server the state with the next server which replied in time is used for upcoming request well. Currently a new DNS request will start with the same server which showed the timeout, so for sequences like looking up a SRV record and then resolve a single host from the returned list we have to wait twice for the first DNS server not replying. But since this is an improvement which is not strictly related to the default timeout values this can of course be solve in a different ticket/PR.

HTH

bye,
Sumit

@pbrezina pbrezina force-pushed the pbrezina:timeouts branch from 6beb8a4 to 9c3eac5 Jun 11, 2019

@pbrezina

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2019

Ok, how about now? It adds new option dns_resolver_server_timeout (ms) that replaces hardcoded RESOLV_TIMEOUTMS. Current defaults are:

  • dns_resolver_server_timeout: 1000ms
  • dns_resolver_op_timeout: 2s
  • dns_resolver_timeout: 4s
  • ldap_opt_timeout: 8s

I had to increase the ldap_opt_timeout for the values to make any sense, otherwise it would not be able to try to resolve next hostname/discovery domain if dns_resolver_timout fails.

fo_resolve_service_send() (dns_resolver_timeout) first resolvse SRV if needed, then resolvs hostname (both dns_resolver_op_timeout). So I think if we could remember the responsive dns server in resolver state, we can set dns_resolver_op_timeout = dns_resolver_timeout and then set it to say three seconds, keeping ldap_opt_timeout = 6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.