Invalid failover logic in ldap ID provider #1785

sssd-bot · 2020-05-02T12:16:07Z

Cloned from Pagure issue: https://pagure.io/SSSD/sssd/issue/743

Created at 2010-12-18 20:26:11 by dpal
Closed as Invalid
Assigned to sbose

Current logic forces LDAP provider to retry same server is just had a communication error with.

(Sat Dec 18 11:20:03 2010) [sssd[be[foobar.com]]] [acctinfo_callback] (4): Request processed. Returned 0,0,Success
(Sat Dec 18 11:20:03 2010) [sssd[be[foobar.com]]] [sdap_process_result] (8): Trace: sh[0x936a158], connected[1], ops[(nil)], ldap[0x9aeded0]
(Sat Dec 18 11:20:03 2010) [sssd[be[foobar.com]]] [sdap_process_result] (8): Trace: ldap_result found nothing!

====> resuming on a completely new network


(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): dbus conn: 9366638
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): Dispatching.


===> First request

(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sbus_message_handler] (9): Received SBUS method [getAccountInfo]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [be_get_account_info] (4): Got request for [4099][1][name=postfix]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_id_op_connect_step] (9): reusing cached connection
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_initgr_send] (9): Retrieving info for initgroups call
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (6): calling ldap_search_ext with [(&(uid=postfix)(objectclass=posixAccount))][ou=Users,dc=foobar,dc=com].
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [objectClass]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [uid]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [userPassword]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [uidNumber]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [gidNumber]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [gecos]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [homeDirectory]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [loginShell]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbPrincipalName]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [cn]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowLastChange]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowMin]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowMax]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowWarning]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowInactive]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowExpire]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowFlag]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbLastPwdChange]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbPasswordExpiration]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [pwdAttribute]
(Sat Dec 18 13:35:40 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (8): ldap_search_ext called, msgid = 48
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): dbus conn: 9366638
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): Dispatching.


===> Second request


(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sbus_message_handler] (9): Received SBUS method [getAccountInfo]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [be_get_account_info] (4): Got request for [4097][1][idnumber=12785]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_id_op_connect_step] (9): reusing cached connection
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (6): calling ldap_search_ext with [(&(uidNumber=12785)(objectclass=posixAccount))][ou=Users,dc=foobar,dc=com].
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [objectClass]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [uid]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [userPassword]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [uidNumber]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [gidNumber]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [gecos]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [homeDirectory]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [loginShell]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbPrincipalName]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [cn]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowLastChange]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowMin]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowMax]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowWarning]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowInactive]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowExpire]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowFlag]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbLastPwdChange]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbPasswordExpiration]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [pwdAttribute]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (8): ldap_search_ext called, msgid = 49
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): dbus conn: 9366638
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): Dispatching.



===> Third request



(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sbus_message_handler] (9): Received SBUS method [getAccountInfo]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [be_get_account_info] (4): Got request for [4097][1][name=dpal]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_id_op_connect_step] (9): reusing cached connection
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (6): calling ldap_search_ext with [(&(uid=dpal)(objectclass=posixAccount))][ou=Users,dc=foobar,dc=com].
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [objectClass]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [uid]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [userPassword]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [uidNumber]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [gidNumber]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [gecos]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [homeDirectory]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [loginShell]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbPrincipalName]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [cn]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowLastChange]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowMin]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowMax]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowWarning]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowInactive]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowExpire]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowFlag]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbLastPwdChange]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbPasswordExpiration]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [pwdAttribute]
(Sat Dec 18 13:35:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (8): ldap_search_ext called, msgid = 50
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): dbus conn: 9366638
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): Dispatching.


===> Fourth request



(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sbus_message_handler] (9): Received SBUS method [getAccountInfo]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [be_get_account_info] (4): Got request for [4099][1][name=ntp]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_id_op_connect_step] (9): reusing cached connection
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_initgr_send] (9): Retrieving info for initgroups call
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (6): calling ldap_search_ext with [(&(uid=ntp)(objectclass=posixAccount))][ou=Users,dc=foobar,dc=com].
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [objectClass]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [uid]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [userPassword]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [uidNumber]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [gidNumber]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [gecos]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [homeDirectory]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [loginShell]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbPrincipalName]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [cn]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [modifyTimestamp]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowLastChange]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowMin]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowMax]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowWarning]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowInactive]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowExpire]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [shadowFlag]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbLastPwdChange]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [krbPasswordExpiration]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [pwdAttribute]
(Sat Dec 18 13:36:34 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (8): ldap_search_ext called, msgid = 51


=====> First response (timeout ofter 2 minutes, why two? Is this a bug in LDAP library?)
So it tries a different server...

(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_initgr_user] (9): Receiving info for the user
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_id_op_done] (5): communication error on cached connection, moving to next server

=====> Here... Why it does not go offline right away??? Why it retries the same server that we just had a communication error with?
It looks like it is the only server configured for me.
From sssd.conf file:

[domain/foobar.com]
id_provider = ldap
ldap_uri = ldap://ldap.foobar.com
ldap_user_search_base = ou=Users,dc=foobar,dc=com
ldap_group_search_base = ou=Groups,dc=foobar,dc=com
enumerate = false


I looked in the code in the failover. I think it should in some way pass down information about which server we failed with.
It looks like it just starts over in this case and picks the same server. Instead it should detect that the fail over list
is empty (since I have only one LDAP server configured) and say that we are offline).


(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_id_op_done] (9): advising for connection retry #1


=====> IMO this is definitely a bug to advise to retry connection of the server we just timed out on.



(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_id_op_done] (9): releasing operation connection
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_id_op_connect_step] (9): beginning to connect
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [fo_resolve_service_send] (4): Trying to resolve service 'LDAP'
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [get_server_status] (7): Status of server 'ldap.foobar.com' is 'working'
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [get_port_status] (7): Port status of port 389 for server 'ldap.foobar.com' is 'neutral'
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [get_server_status] (7): Status of server 'ldap.foobar.com' is 'working'
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [be_resolve_server_done] (4): Found address for server ldap.foobar.com: [xxx.xxx.xxx.xxx]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [setup_ldap_connection_callbacks] (9): LDAP connection callbacks are not supported.
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_rootdse_send] (9): Getting rootdse
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (6): calling ldap_search_ext with [(objectclass=*)][].
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [*]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [altServer]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [namingContexts]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [supportedControl]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [supportedExtension]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [supportedFeatures]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [supportedLDAPVersion]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [supportedSASLMechanisms]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [defaultNamingContext]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [lastUSN]
(Sat Dec 18 13:36:41 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (7): Requesting attrs: [highestCommittedUSN]


=====> Second failure after yet another 2 minutes. But it is probably message 51 timed out.
Here we have a send timeout and we handle it correctly and immediately go offline.


(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (3): ldap_search_ext failed: Can't contact LDAP server
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_get_generic_send] (3): Connection error: (null)
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_install_ldap_callbacks] (8): Trace: sh[0x9357910], connected[1], ops[(nil)], fde[0x9b06700], ldap[0x9b065a0]
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [fo_set_port_status] (4): Marking port 389 of server 'ldap.foobar.com' as 'not working'
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_handle_release] (8): Trace: sh[0x9357910], connected[1], ops[(nil)], ldap[0x9b065a0], destructor_lock[0], release_memory[0]
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [fo_resolve_service_send] (4): Trying to resolve service 'LDAP'
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [get_server_status] (7): Status of server 'ldap.foobar.com' is 'working'
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [get_port_status] (7): Port status of port 389 for server 'ldap.foobar.com' is 'not working'
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [fo_resolve_service_send] (1): No available servers for service 'LDAP'
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_id_op_connect_done] (1): Failed to connect, going offline (5 [Input/output error])
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [be_mark_offline] (8): Going offline!
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [be_run_offline_cb] (3): Going offline. Running callbacks.
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_id_op_connect_done] (9): notify offline to op #1
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [acctinfo_callback] (4): Request processed. Returned 1,11,Offline
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_id_release_conn_data] (9): releasing unused connection
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_id_op_done] (9): falling back to offline data...
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_id_op_done] (9): releasing operation connection
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [acctinfo_callback] (4): Request processed. Returned 1,11,Offline
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_id_op_done] (9): falling back to offline data...
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_id_op_done] (9): releasing operation connection
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [acctinfo_callback] (4): Request processed. Returned 1,11,Offline
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_get_initgr_user] (9): Receiving info for the user
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_id_op_done] (9): falling back to offline data...
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_id_op_done] (9): releasing operation connection
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_id_release_conn_data] (9): releasing unused connection
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sdap_handle_release] (8): Trace: sh[0x936a158], connected[1], ops[(nil)], ldap[0x9aeded0], destructor_lock[0], release_memory[0]
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [acctinfo_callback] (4): Request processed. Returned 1,11,Offline
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): dbus conn: 9366638
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): Dispatching.
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sbus_message_handler] (9): Received SBUS method [getAccountInfo]
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [be_get_account_info] (4): Got request for [4097][1][name=dpal]
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [be_get_account_info] (4): Request processed. Returned 1,11,Fast reply - offline
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): dbus conn: 9366638
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sbus_dispatch] (9): Dispatching.
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [sbus_message_handler] (9): Received SBUS method [getAccountInfo]
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [be_get_account_info] (4): Got request for [4097][1][name=dpal]
(Sat Dec 18 13:38:33 2010) [sssd[be[foobar.com]]] [be_get_account_info] (4): Request processed. Returned 1,11,Fast reply - offline
...

Comments

Comment from dpal at 2010-12-18 23:00:21

Couple design ideas:

The context should keep a list of the servers that it communicated since the start of the SSSD. This list is updated when the connection is lost and SSSD detected that it needs to try to fail over. This list actually set of key value pairs where the key is the name of the host and the value is the time stamp of the moment the connection was identified as broken. Currently the fail over code (I am not sure since I have not dived into the details but it seems to) iterates through the servers read from the configuration file. If instead the list suggested above is populated at the start of the SSSD and carried in the context the fail over code would not need (and I would argue should not) to consult the configuration. Instead it will look at the list and pick the server that it tried longest time ago. If this time is still less then some predefined period say 60 seconds (configurable) it will not try to fail over but rather return offline status right away. The set can be implemented as hash table. However collection might be better since we can use a pinned iterator and that would allow less processing (sorting, matching etc.) to detect the which server to connect to.
Currently the code assumes that if one server is not available then this is a server outage and some other server might be available, however this is usually not the case nowadays. If one of the services is not available it is most likely due to the network issues around client and it should be expected that the client is offline rather than one of the servers. I suggest that we add a new parameter into the configuration file that would be named something like "client_failover_mode" and would have the following values:[[BR]][[BR]]
- mobile - this would mean that the client is a mobile device or laptop. It will assume by default that if it can't connect to a server it is completely offline. However next time it needs to connect it will try a different server from the list
- server (default) - assume that it is a server outage not a client networking problem (current mode)
- vm - I feel that for VMs we would need to have a different set of assumptions so I suspect we would have to do something different from the cases above. This can be deferred but the point is that the configuration should not be a simple Boolean. I imagine cases when retrying the same server once again would be preferable.

Comment from sbose at 2010-12-20 13:16:49

I think I've found the possilbe reason a second server was tired, see 'Avoid multiple initializations in LDAP provider' on sssd-devel list.

Which OpenLDAP version are you using 2.3.x or 2.4.x?

Comment from sgallagh at 2010-12-21 15:15:24

Fields changed

component: SSSD => Failover
milestone: NEEDS_TRIAGE => SSSD 1.5.1
owner: somebody => sgallagh

Comment from sgallagh at 2011-01-04 15:30:10

Fields changed

owner: sgallagh => sbose

Comment from sgallagh at 2011-01-27 19:15:01

Fields changed

milestone: SSSD 1.5.1 => SSSD 1.5.2
upgrade: => 0