Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSSD AD Failover Failure #4345

Closed
sssd-bot opened this issue May 2, 2020 · 1 comment
Closed

SSSD AD Failover Failure #4345

sssd-bot opened this issue May 2, 2020 · 1 comment

Comments

@sssd-bot
Copy link

sssd-bot commented May 2, 2020

Cloned from Pagure issue: https://pagure.io/SSSD/sssd/issue/3312

  • Created at 2017-02-20 11:32:58 by chrismwheeler
  • Assigned to nobody

I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

So it then tries to reset all services, as 'no server' is available :

(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks

Comments


Comment from chrismwheeler at 2017-02-20 11:33:26

Config and Debug data
Auth_Debug.zip


Comment from jhrozek at 2017-02-22 18:25:30

Hmm, you might be right. I would expect this code:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'                                                                                        
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'                                                                              
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_mod_ccname] (0x4000): Save ccname [FILE:/tmp/krb5cc_1763401190_ech4R3] for user [gb103690].                                                                                       
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): start ldb transaction (nesting: 0)                                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): start ldb transaction (nesting: 1)                                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x111c640                                                                                                                            
                                                                                                                                                                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x111c770                                                                                                                             
                                                                                                                                                                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Running timer event 0x111c640 "ltdb_callback"                                                                                                                           
                                                                                                                                                                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Destroying timer event 0x111c770 "ltdb_timeout"                                                                                                                         
                                                                                                                                                                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): Ending timer event 0x111c640 "ltdb_callback"                                                                                                                            
                                                                                                                                                                                                                                             
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): commit ldb transaction (nesting: 1)                                                                                                                                     
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [ldb] (0x4000): commit ldb transaction (nesting: 0)                                                                                                                                     
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!                                                                                                                      
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [check_wait_queue] (0x1000): Wait queue for user [gb103690] is empty.                                                                                                                   
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_queue_done] (0x1000): krb5_auth_queue request [0x1118f00] done.                                                                                                              
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [be_pam_handler_callback] (0x0100): Backend returned: (1, 9, <NULL>) [Provider is Offline]  

To reset the offline status when the port is marked as working..


Comment from lslebodn at 2017-02-23 15:01:50

Fields changed

description: I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :

(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!

So it then tries to reset all services, as 'no server' is available :

(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks => I am attempting to understand the failure of our Red Hat Linux devices to correctly fail over when using SSSD authentication to AD, the logs of which show marking inaccessible devices as 'working', and then after failure not attempting to communicate with the accessible one.

My test setup, on IBM SoftLayer, involves a single Red Hat 6.x VM, authtest.testdom.local (IP 192.168.100.50), set up on a different VLAN to the AD servers, mgt-dc-001.testdom.local (192.168.200.133) and mgt-dc-002.testdom.local (192.168.200.134). The traffic is allowed or blocked by a network device, infra-vyatta.testdom.local, which allows me to simulate server failure. (Note - IPs and Hostnames have been changed for security)

I have currently got logging on maximum, and have captured log files, and network traffic, whilst all communication with the server mgt-dc-001.testdom.local is blocked by Vyatta, but all communication with server mgt-dc-002.testdom.local is allowed.

The attached logs show two events, the start of sssd service at 10:53:11, and the attempt to log on at 10:54:21.

What I would like to understand is the following.

Why is Port 88 of mgt-dc-001 marked as working, when the network communication attached clearly shows that it is not responding:

{{{
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1039
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 88 of server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [set_server_common_status] (0x0100): Marking server 'mgt-dc-001.testdom.local' as 'working'
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0400): Marking port 88 of duplicate server 'mgt-dc-001.testdom.local' as 'working'
}}}

Because it is not responding, Kerberos drops into offline mode, rather than looking to mgt-dc-002.testdom.local :
{{{
(Fri Feb 17 10:54:21 2017) [sssd[be[TESTDOM.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!
}}}

So it then tries to reset all services, as 'no server' is available :

{{{
(Fri Feb 17 10:54:48 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_reset_services] (0x1000): Resetting all servers in all services
}}}

Whenever it attempts to check port 88 on mgt-dc-001 it seems to mark it working, never says it checks mgt-dc-002 and keeps falling back to offline authentication, which then doesn't work.

Also, for the AD and LDAP services it seems to be doing a lot of marking port 0 not working, but if a port is specified, it cannot resolve the name, as it attempts to do DNS resolution with the port attached ! I'm not sure if this is related.

What we expect is: SSSD to find that port 88 on mgt-dc-001.testdom.local is not working, and to mark it "not working" and to then fail over to mgt-dc-002. Occasionally, this seems to happen, but very occasionally. From the network communication you can see that it attempts port 88 communication to mgt-dc-002, which is returned, and port 88 communication to mgt-dc-001 which is not. You can see after LDAP communication on 389 fails to mgt-dc-001, it successfully connects to mgt-dc-002. You can see that DNS communication with mgt-dc-002 is succeeding.

I think I have attached all relevant configuration and log files.

Am I doing something obviously wrong, or is the failover in this version not working as it should ?

Thanks


Comment from jhrozek at 2017-02-23 15:39:52

We're take a look once the core features for 1.15 are done, but as I said, this looks like a bug.

milestone: NEEDS_TRIAGE => SSSD 1.15.3


Comment from chrismwheeler at 2017-02-24 14:40:01

Metadata Update from @ChrisMWheeler:

  • Issue set to the milestone: SSSD 1.15.3

Comment from jhrozek at 2017-03-15 11:26:36

Metadata Update from @jhrozek:

  • Custom field design_review reset
  • Custom field mark reset
  • Custom field patch reset
  • Custom field review reset
  • Custom field sensitive reset
  • Custom field testsupdated reset
  • Issue close_status updated to: None
  • Issue set to the milestone: SSSD 1.15.4 (was: SSSD 1.15.3)

Comment from pbrezina at 2017-09-14 09:48:12

Hi, I'm sorry that it took so long, but unfortunately the debug logs are not available anymore. Do you still have this issue? Could you please attach new logs?

However, from the description, it does look similar as https://pagure.io/SSSD/sssd/issue/3401


Comment from pbrezina at 2017-09-14 09:48:12

Metadata Update from @pbrezina:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)

Comment from chrismwheeler at 2017-09-14 10:48:28

Files re-uploaded on request:

Auth_Debug.zip

It does indeed look similar.


Comment from pbrezina at 2017-09-22 16:49:35

I see you have very low timeout 1 second which from the logs doesn't seem to be enough. Can you try to set dns_resolver_timeout to some higher value? It also should be bigger then dns_resolver_op_timeout which was not very clear until we mentioned it in documentation in recent version. See: https://pagure.io/SSSD/sssd/issue/3217#comment-223033

Set it for example to dns_resolver_timeout = 6, dns_resolver_op_timeout = 3.


Comment from pbrezina at 2017-09-22 16:49:36

Metadata Update from @pbrezina:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)

Comment from jhrozek at 2017-10-19 21:12:09

We've decided to rename the 1.15.4 release to 1.16.0, but since we need to release 1.16.0 tomorrow, I'm bumping this ticket to 1.16.1


Comment from jhrozek at 2017-10-19 21:12:15

Metadata Update from @jhrozek:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)

Comment from jhrozek at 2017-10-19 21:14:07

Metadata Update from @jhrozek:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)
  • Issue set to the milestone: SSSD 1.16.1 (was: SSSD 1.15.4)

Comment from chrismwheeler at 2017-11-07 16:57:45

I see you have very low timeout 1 second which from the logs doesn't seem to be enough. Can you try to set dns_resolver_timeout to some higher value? It also should be bigger then dns_resolver_op_timeout which was not very clear until we mentioned it in documentation in recent version. See: https://pagure.io/SSSD/sssd/issue/3217#comment-223033
Set it for example to dns_resolver_timeout = 6, dns_resolver_op_timeout = 3.

Unfortunately this doesn't resolve the Kerberos failover issues.

Still seeing port marked as working when this is not true :
(Tue Nov 7 10:03:00 2017) [sssd[be[TESTDOM.LOCAL]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/krb5/krb5_auth.c: krb5_auth_done: 1093
(Tue Nov 7 10:03:00 2017) [sssd[be[TESTDOM.LOCAL]]] [fo_set_port_status] (0x0100): Marking port 0 of server 'mgt-dc-001.testdom.local' as 'working'

Which leads to failure without trying the backup server :
(Tue Nov 7 10:03:00 2017) [sssd[be[SANDBOX.LOCAL]]] [krb5_auth_done] (0x0100): Backend is marked offline, retry later!


Comment from jhrozek at 2017-12-15 19:59:08

Metadata Update from @jhrozek:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)
  • Issue tagged with: postpone-to-2-0

Comment from jhrozek at 2018-01-08 17:56:34

Metadata Update from @jhrozek:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)
  • Issue untagged with: postpone-to-2-0
  • Issue set to the milestone: SSSD 2.0 (was: SSSD 1.16.1)

Comment from jhrozek at 2018-08-13 10:14:22

Metadata Update from @jhrozek:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)
  • Issue set to the milestone: SSSD 2.1 (was: SSSD 2.0)

Comment from jhrozek at 2019-02-22 15:51:32

Metadata Update from @jhrozek:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)
  • Issue set to the milestone: SSSD 2.2 (was: SSSD 2.1)

Comment from jhrozek at 2019-06-13 23:23:28

Metadata Update from @jhrozek:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)
  • Issue set to the milestone: SSSD 2.3 (was: SSSD 2.2)

Comment from thalman at 2020-03-11 12:12:47

Metadata Update from @thalman:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)
  • Issue tagged with: Canditate to close

Comment from pbrezina at 2020-03-11 12:12:56

I'm sorry that this took such a long time. Do you still have issues?


Comment from pbrezina at 2020-03-11 12:12:56

Metadata Update from @pbrezina:

  • Custom field design_review reset (from false)
  • Custom field mark reset (from false)
  • Custom field patch reset (from false)
  • Custom field review reset (from false)
  • Custom field sensitive reset (from false)
  • Custom field testsupdated reset (from false)
  • Issue untagged with: Canditate to close

Comment from chrismwheeler at 2020-03-11 13:17:27

Fortunately our AD servers haven't gone down, so we haven't experienced it.

Is there reason to believe the issue has been fixed, I could set up a test environment to simulate the issue if required?

@andreboscatto
Copy link
Contributor

Dear Contributor/User,

Recognizing the importance of addressing enhancements, bugs, and issues for the SSSD project's quality and reliability, we also need to consider our long-term goals and resource constraints.

After thoughtful consideration, regrettably, we are unable to address this request at this time. To avoid any misconception, we're closing it; however, we encourage continued collaboration and contributions from anyone interested.

We apologize for any inconvenience and appreciate your understanding of our resource limitations. While you're welcome to open a new issue (or reopen this one), immediate attention may not be guaranteed due to competing priorities.

Thank you once again for sharing your feedback. We look forward to ongoing collaboration to deliver the best possible solutions, supporting in any way we can.

Best regards,
André Boscatto

@andreboscatto andreboscatto closed this as not planned Won't fix, can't repro, duplicate, stale Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants