SSSD AD Failover Failure #4345

sssd-bot · 2020-05-02T13:36:12Z

Cloned from Pagure issue: https://pagure.io/SSSD/sssd/issue/3312

Created at 2017-02-20 11:32:58 by chrismwheeler
Assigned to nobody

I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

So it then tries to reset all services, as 'no server' is available :

(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks

Comments

Comment from chrismwheeler at 2017-02-20 11:33:26

Config and Debug data
Auth_Debug.zip

Comment from jhrozek at 2017-02-22 18:25:30

Hmm, you might be right. I would expect this code:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'                                                                                        
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'                                                                              
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_mod_ccname] (0x4000): Save ccname [FILE:/tmp/krb5cc_1763401190_ech4R3] for user [gb103690].                                                                                       
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): start ldb transaction (nesting: 0)                                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): start ldb transaction (nesting: 1)                                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x111c640                                                                                                                            
                                                                                                                                                                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x111c770                                                                                                                             
                                                                                                                                                                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Running timer event 0x111c640 "ltdb_callback"                                                                                                                           
                                                                                                                                                                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Destroying timer event 0x111c770 "ltdb_timeout"                                                                                                                         
                                                                                                                                                                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Ending timer event 0x111c640 "ltdb_callback"                                                                                                                            
                                                                                                                                                                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): commit ldb transaction (nesting: 1)                                                                                                                                     
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): commit ldb transaction (nesting: 0)                                                                                                                                     
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [check_wait_queue] (0x1000): Wait queue for user [gb103690] is empty.                                                                                                                   
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_queue_done] (0x1000): krb5_auth_queue request [0x1118f00] done.                                                                                                              
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [be_pam_handler_callback] (0x0100): Backend returned: (1, 9, <NULL>) [Provider is Offline]

To reset the offline status when the port is marked as working..

Comment from lslebodn at 2017-02-23 15:01:50

Fields changed

description: I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

So it then tries to reset all services, as 'no server' is available :

(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks => I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

{{{
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'
}}}

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :
{{{
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!
}}}

So it then tries to reset all services, as 'no server' is available :

{{{
(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services
}}}

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks

Comment from jhrozek at 2017-02-23 15:39:52

We're take a look once the core features for 1.15 are done, but as I said, this looks like a bug.

milestone: NEEDS_TRIAGE => SSSD 1.15.3

Comment from chrismwheeler at 2017-02-24 14:40:01

Metadata Update from @ChrisMWheeler:

Issue set to the milestone: SSSD 1.15.3

Comment from jhrozek at 2017-03-15 11:26:36

Metadata Update from @jhrozek:

Custom field design_review reset
Custom field mark reset
Custom field patch reset
Custom field review reset
Custom field sensitive reset
Custom field testsupdated reset
Issue close_status updated to: None
Issue set to the milestone: SSSD 1.15.4 (was: SSSD 1.15.3)

Comment from pbrezina at 2017-09-14 09:48:12

Hi, I'm sorry that it took so long, but unfortunately the debug logs are not available anymore. Do you still have this issue? Could you please attach new logs?

However, from the description, it does look similar as https://pagure.io/SSSD/sssd/issue/3401

Comment from pbrezina at 2017-09-14 09:48:12

Metadata Update from @pbrezina:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)

Comment from chrismwheeler at 2017-09-14 10:48:28

Files re-uploaded on request:

It does indeed look similar.

Comment from pbrezina at 2017-09-22 16:49:35

I see you have very low timeout 1 second which from the logs doesn't seem to be enough. Can you try to set dns_resolver_timeout to some higher value? It also should be bigger then dns_resolver_op_timeout which was not very clear until we mentioned it in documentation in recent version. See: https://pagure.io/SSSD/sssd/issue/3217#comment-223033

Set it for example to dns_resolver_timeout = 6, dns_resolver_op_timeout = 3.

Comment from pbrezina at 2017-09-22 16:49:36

Metadata Update from @pbrezina:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)

Comment from jhrozek at 2017-10-19 21:12:09

We've decided to rename the 1.15.4 release to 1.16.0, but since we need to release 1.16.0 tomorrow, I'm bumping this ticket to 1.16.1

Comment from jhrozek at 2017-10-19 21:12:15

Metadata Update from @jhrozek:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)

Comment from jhrozek at 2017-10-19 21:14:07

Metadata Update from @jhrozek:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)
Issue set to the milestone: SSSD 1.16.1 (was: SSSD 1.15.4)

Comment from chrismwheeler at 2017-11-07 16:57:45

I see you have very low timeout 1 second which from the logs doesn't seem to be enough. Can you try to set dns_resolver_timeout to some higher value? It also should be bigger then dns_resolver_op_timeout which was not very clear until we mentioned it in documentation in recent version. See: https://pagure.io/SSSD/sssd/issue/3217#comment-223033
Set it for example to dns_resolver_timeout = 6, dns_resolver_op_timeout = 3.

Unfortunately this doesn't resolve the Kerberos failover issues.

Still seeing port marked as working when this is not true :
(Tue Nov 7 10:03:00 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1093
(Tue Nov 7 10:03:00 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 0 of server 'mgt-dc-001.testdom.local' as 'working'

Which leads to failure without trying the backup server :
(Tue Nov 7 10:03:00 2017) [sssd[be[SANDBOX.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

Comment from jhrozek at 2017-12-15 19:59:08

Metadata Update from @jhrozek:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)
Issue tagged with: postpone-to-2-0

Comment from jhrozek at 2018-01-08 17:56:34

Metadata Update from @jhrozek:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)
Issue untagged with: postpone-to-2-0
Issue set to the milestone: SSSD 2.0 (was: SSSD 1.16.1)

Comment from jhrozek at 2018-08-13 10:14:22

Metadata Update from @jhrozek:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)
Issue set to the milestone: SSSD 2.1 (was: SSSD 2.0)

Comment from jhrozek at 2019-02-22 15:51:32

Metadata Update from @jhrozek:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)
Issue set to the milestone: SSSD 2.2 (was: SSSD 2.1)

Comment from jhrozek at 2019-06-13 23:23:28

Metadata Update from @jhrozek:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)
Issue set to the milestone: SSSD 2.3 (was: SSSD 2.2)

Comment from thalman at 2020-03-11 12:12:47

Metadata Update from @thalman:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)
Issue tagged with: Canditate to close

Comment from pbrezina at 2020-03-11 12:12:56

I'm sorry that this took such a long time. Do you still have issues?

Comment from pbrezina at 2020-03-11 12:12:56

Metadata Update from @pbrezina:

Custom field design_review reset (from false)
Custom field mark reset (from false)
Custom field patch reset (from false)
Custom field review reset (from false)
Custom field sensitive reset (from false)
Custom field testsupdated reset (from false)
Issue untagged with: Canditate to close

Comment from chrismwheeler at 2020-03-11 13:17:27

Fortunately our AD servers haven't gone down, so we haven't experienced it.

Is there reason to believe the issue has been fixed, I could set up a test environment to simulate the issue if required?

The text was updated successfully, but these errors were encountered:

andreboscatto · 2023-11-20T14:30:16Z

Dear Contributor/User,

Recognizing the importance of addressing enhancements, bugs, and issues for the SSSD project's quality and reliability, we also need to consider our long-term goals and resource constraints.

After thoughtful consideration, regrettably, we are unable to address this request at this time. To avoid any misconception, we're closing it; however, we encourage continued collaboration and contributions from anyone interested.

We apologize for any inconvenience and appreciate your understanding of our resource limitations. While you're welcome to open a new issue (or reopen this one), immediate attention may not be guaranteed due to competing priorities.

Thank you once again for sharing your feedback. We look forward to ongoing collaboration to deliver the best possible solutions, supporting in any way we can.

Best regards,
André Boscatto

pbrezina added the Candidate to close label Jun 20, 2023

andreboscatto closed this as not planned Won't fix, can't repro, duplicate, stale Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSSD AD Failover Failure #4345

SSSD AD Failover Failure #4345

sssd-bot commented May 2, 2020

andreboscatto commented Nov 20, 2023