Requests queued during transition from offline to online mode #3397

sssd-bot · 2020-05-02T13:04:32Z

Cloned from Pagure issue: https://pagure.io/SSSD/sssd/issue/2355

Created at 2014-06-06 14:13:41 by endzone
Closed as Fixed
Assigned to mzidek
Associated bugzillas
- https://bugzilla.redhat.com/show_bug.cgi?id=1110226

Using SSSD with "cache_credentials = true", users may experience periodic blocking for up to 6 seconds if SSSD is switching from offline to online mode and the LDAP server is unreachable.

The first request to SSSD after it has been offline for more than 60 seconds is immediately answered from the cache, but then triggers a reconnection trial to the LDAP server.

All subsequent requests reaching SSSD during the connection phase are queued and answered once the connection succeeds or fails. In case the LDAP server is unreachable, SSSD waits for 6 seconds before the connection trial is aborted. This means that the user may experience a delay of up to 6 seconds every 60 seconds (in the worst case).

See the following debug logs where the LDAP server is not responding, starting off in offline mode:

The first request to SSSD (which is triggering the reconnection trial) gets answered right away in offline mode:

(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
(0x0100): Got request for [4097][1][idnumber=10011]
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
(0x0100): Request processed. Returned 1,11,Fast reply - offline
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
(0x4000): beginning to connect
...
SSSD is now trying to reconnect to the LDAP server.

Only the subsequent requests that are received while SSSD is trying to (re-)connect to the LDAP server are queued until the connection times out (for at most 6 seconds). These pending requests are causing the system to block:

...

(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
(0x0100): Got request for [4097][1][name=brauchle]
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
(0x4000): waiting for connection to complete
...
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
(0x0100): Got request for [4097][1][idnumber=10011]
(Tue Jun 3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
(0x4000): waiting for connection to complete
...

--> this is the time where the system may be unresponsive for 6 seconds <--

...

(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
(0x0020): Failed to connect, going offline (5 [Input/output error])
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_mark_offline]
(0x2000): Going offline!
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [be_run_offline_cb]
(0x0080): Going offline. Running callbacks.
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
(0x4000): notify offline to op #1
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
(0x4000): notify offline to op #2
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback]
(0x0100): Request processed. Returned 1,11,Offline
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
(0x4000): notify offline to op #3
(Tue Jun 3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback]
(0x0100): Request processed. Returned 1,11,Offline

After the connection times out, the queued request are answered with cached entries.

So why not keep the "offline" flag set to "true" until the LDAP connection trial returns (positive or negative) and only if positive, switch to online mode?

As the first request (triggering the reconnection) is answered from the cache anyway, there is no point to keep the subsequent ones pending until the connection is established successfully.

Possibly one needs to consider that start up phase (with cold caches) as a special case and actually do queue incoming request in this case?

Comments

Comment from endzone at 2014-06-06 14:19:32

Repost of the log files in readable format:

The *first* request to SSSD, answered from cache. Triggers reconnect afterwards:
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Got request for [4097][1][idnumber=10011]
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Request processed. Returned 1,11,Fast reply - offline
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
>> (0x4000): beginning to connect
...
SSSD is now trying to reconnect to the LDAP server.

*Subsequent* requests are queued until the connection times out.
These pending requests are causing the system to block:

...
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Got request for [4097][1][name=brauchle]
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
>> (0x4000): waiting for connection to complete
...
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [be_get_account_info]
>> (0x0100): Got request for [4097][1][idnumber=10011]
>> (Tue Jun  3 08:16:42 2014) [sssd[be[default]]] [sdap_id_op_connect_step]
>> (0x4000): waiting for connection to complete
...

--> this is the time where the system may be unresponsive for 6 seconds <--

...
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x0020): Failed to connect, going offline (5 [Input/output error])
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [be_mark_offline]
>> (0x2000): Going offline!
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [be_run_offline_cb]
>> (0x0080): Going offline. Running callbacks.
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x4000): notify offline to op #1
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x4000): notify offline to op #2
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback]
>> (0x0100): Request processed. Returned 1,11,Offline
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [sdap_id_op_connect_done]
>> (0x4000): notify offline to op #3
>> (Tue Jun  3 08:16:48 2014) [sssd[be[default]]] [acctinfo_callback]
>> (0x0100): Request processed. Returned 1,11,Offline

Comment from sbose at 2014-06-12 17:04:28

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.11.7

Comment from jhrozek at 2014-06-17 11:28:32

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1110226

rhbz: => [https://bugzilla.redhat.com/show_bug.cgi?id=1110226 1110226]

Comment from mzidek at 2014-06-18 14:04:46

Fields changed

owner: somebody => mzidek

Comment from mzidek at 2014-07-03 14:29:44

Fields changed

patch: 0 => 1