Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ad: use parallel cldap ping for site discovery #5300

Closed
wants to merge 7 commits into from

Conversation

pbrezina
Copy link
Member

This PR adds support for CLDAP (connectionless LDAP over UDP) and makes
use of it during site discovery.

The discovery process is improved, it now sends the CLDAP ping to all
discovered domain controllers at once to avoid potential timeouts when
some of them are unreachable (which is quite common in real environment).
The first received reply is used.

The netlogon information is now requested only when SSSD starts and when
we are recovering from offline state To avoid swamping network with many
ping. This is quite fine given the site and forest information is not
dynamic but usually stable.

Additional improvement is that if we already know client's site we ping
only in-site controllers first. We were already doing that but now we
also fallback to off-site controllers in case none in-site controller
is unreachable. Previously we would just fail.

There is a bug in openldap [1] that makes SSSD wait for timeout in-case none
of the controller is reachable, but as long as at least one is reachble
the resolution will be fast.

Resolves:
#3743
#5215

[1] https://bugs.openldap.org/show_bug.cgi?id=9328

@abbra
Copy link
Contributor

abbra commented Sep 17, 2020

@pbrezina I see linking failures:

/usr/bin/ld: /tmp/nss-srv-tests.hGpNig.ltrans0.ltrans.o: in function `__wrap_sss_packet_get_body':
/builddir/build/BUILD/sssd-2.3.2/src/tests/cmocka/test_nss_srv.c:103: undefined reference to `sss_packet_get_body'
/usr/bin/ld: /tmp/nss-srv-tests.hGpNig.ltrans1.ltrans.o: in function `__wrap_sss_cmd_done':
/builddir/build/BUILD/sssd-2.3.2/src/tests/cmocka/test_nss_srv.c:139: undefined reference to `sss_packet_get_body'
/usr/bin/ld: /tmp/nss-srv-tests.hGpNig.ltrans1.ltrans.o: in function `__wrap_sss_ncache_check_user':
/builddir/build/BUILD/sssd-2.3.2/src/tests/cmocka/test_nss_srv.c:170: undefined reference to `sss_ncache_check_user'
/usr/bin/ld: /tmp/nss-srv-tests.hGpNig.ltrans1.ltrans.o: in function `__wrap_sss_ncache_check_upn':
/builddir/build/BUILD/sssd-2.3.2/src/tests/cmocka/test_nss_srv.c:185: undefined reference to `sss_ncache_check_upn'
/usr/bin/ld: /tmp/nss-srv-tests.hGpNig.ltrans1.ltrans.o: in function `__wrap_sss_ncache_check_uid':
/builddir/build/BUILD/sssd-2.3.2/src/tests/cmocka/test_nss_srv.c:200: undefined reference to `sss_ncache_check_uid'
/usr/bin/ld: /tmp/nss-srv-tests.hGpNig.ltrans1.ltrans.o: in function `__wrap_sss_ncache_check_sid':
/builddir/build/BUILD/sssd-2.3.2/src/tests/cmocka/test_nss_srv.c:213: undefined reference to `sss_ncache_check_sid'
/usr/bin/ld: /tmp/nss-srv-tests.hGpNig.ltrans1.ltrans.o: in function `__wrap_sss_ncache_check_cert':
/builddir/build/BUILD/sssd-2.3.2/src/tests/cmocka/test_nss_srv.c:226: undefined reference to `sss_ncache_check_cert'
collect2: error: ld returned 1 exit status

@alexey-tikhonov
Copy link
Member

@pbrezina I see linking failures:

/usr/bin/ld: /tmp/nss-srv-tests.hGpNig.ltrans0.ltrans.o: in function `__wrap_sss_packet_get_body':

...

cmocka based tests aren't compatible with LTO linker flag.

This only affects few tests.

In Fedora spec-file it is worked around quite bluntly:
https://src.fedoraproject.org/rpms/sssd/c/ca22aded04cabb6a71acb2b9271497b104bd6189?branch=master

@sumit-bose
Copy link
Contributor

Hi,

thanks for the patches. My tests with various configurations with multiple existing, not-existing or not responding DCs went well. Coverity didn't had anything to complain either.

While reading the patches I was wondering if it would make sense to create a macro for "cldap" since the string is used in a comparison? What do you think? This can be extended of course to the "ldap", "ldaps" and "ldapi" strings when used to indicate an LDAP schema. Since this won't improve the readability I'm fine with keeping it as it is.

bye,
Sumit

@sumit-bose
Copy link
Contributor

Hi,

maybe there should be some limit in the for-loop so that SSSD does not accidentally tries to ping hundreds of DC in a larger AD environment.

adcli implements a scheme described in https://winprotocoldoc.blob.core.windows.net/productionwindowsarchives/WinArchive/%5bMS-DISO%5d.pdf section 5.4.5.3 where first 5 DC are pinged with a 0.4s timeout, then the next 5 are pinged with a 0.2s timeout and finally all other with a timeout of 0.1s. Unfortunately I haven't found any newer document how Windows clients are doing CLDAP pings.

bye,
Sumit

@pbrezina
Copy link
Member Author

Do you mean there is 100ms timeout for the DC's to deliver a reply? That's quite low, isn't it?

All Windows clients uses CLDAP (UDP) for LDAP ping. Even though AD
also supports LDAP ping over TCP IPA does not therefore it is crusial
for us to perform the ping over CLDAP protocol.

Resolves:
SSSD#5215
@sumit-bose
Copy link
Contributor

Do you mean there is 100ms timeout for the DC's to deliver a reply? That's quite low, isn't it?

If I understand the MFST document correctly the client still sends the pings sequentially and the timeout is the time beofre sending the next ping. I guess the client would still accept replies from other DC pinged earlier.

So it idea is more the other way round. The DCs are split in batches to not ping all DCs at once. Then give the first batch of DC a longer time to reply to increase the chance that the reply is received in that time, for the second batch wait already a bit shorter and for the rest even shorter to make sure that the waiting time does not depend too much on the number of DC.

bye,
Sumit

@pbrezina
Copy link
Member Author

So do you suggest to implement 3 batches that are sent at time T (5 servers), T+400ms (5 servers), T+600ms (remainder) or to iterate over it sequentially with slow delay before going to next dc (to ping first five servers will take 4005, the next five 2005 and the remaining 100*X)?

@sumit-bose
Copy link
Contributor

So do you suggest to implement 3 batches that are sent at time T (5 servers), T+400ms (5 servers), T+600ms (remainder) or to iterate over it sequentially with slow delay before going to next dc (to ping first five servers will take 400_5, the next five 200_5 and the remaining 100*X)?

Hi,

I was thinking of keeping the parallel pings but doing it in batches, so that in the typical/positive case we do not have to send pings to all DCs.

bye,
Sumit

@pbrezina
Copy link
Member Author

Done. I added ad_cldap_ping_parallel_batch and logic around it.

@sumit-bose
Copy link
Contributor

Hi,

thanks for the changes, the patches are working well and with debug_microseconds = True one can actually see the time SSSD is waiting for replies for the batches.

To make sure people will understand in future why the batch sizes and waiting times were chosen this way I wonder if you can add a reference to [MS-DISO] section 5.4.5.3 to the comment?

bye,
Sumit

Previous implementation would not fallback to the off-site domain
controllers. This would cause problems if the site actually changed.
Site and forest information is stable not dynamic. To avoid spamming
network with cldap pings all the time we will renew netlogon information
only when SSSD starts and when we are recovering from an offline state
to detect possible change (e.g. user moves to another location with laptop).
@pbrezina
Copy link
Member Author

pbrezina commented Oct 1, 2020

Done. I only changed comment in ad_cldap_ping_parallel_batch.

@sumit-bose
Copy link
Contributor

Hi,

thanks, ACK.

bye,
Sumit

@pbrezina pbrezina added the Ready to push Ready to push label Oct 2, 2020
@pbrezina
Copy link
Member Author

pbrezina commented Oct 2, 2020

Pushed PR: #5300

  • master
    • f0d6507 - tevent: correctly handle req timeout error
    • 9fdf5cf - ad: renew site information only when SSSD was previously offline
    • a62a13a - man: fix typo in failover description
    • fcfd834 - ad: if all in-site dc are unreachable try off-site controllers
    • 1889ca6 - ad: connect to the first available server for cldap ping
    • 8265674 - ad: use cldap for site and forrest discover (perform CLDAP ping)
    • 414593c - ldap: add support for cldap and udp connections

@pbrezina pbrezina added Pushed and removed Accepted labels Oct 2, 2020
@sumit-bose
Copy link
Contributor

Hi @pbrezina,

I came across an issue with current_forest and current_site. After some calls to ad_cldap_ping_send() with renew_site == False the value was just garbage. I guess something like

diff --git a/src/providers/ad/ad_cldap_ping.c b/src/providers/ad/ad_cldap_ping.c
index cd23065..ac6f72d 100644
--- a/src/providers/ad/ad_cldap_ping.c
+++ b/src/providers/ad/ad_cldap_ping.c
@@ -597,8 +597,8 @@ struct tevent_req *ad_cldap_ping_send(TALLOC_CTX *mem_ctx,
     }
 
     if (!srv_ctx->renew_site) {
-        state->site = srv_ctx->current_site;
-        state->forest = srv_ctx->current_forest;
+        state->site = talloc_strdup(state, srv_ctx->current_site);
+        state->forest = talloc_strdup(state, srv_ctx->current_forest);
         DEBUG(SSSDBG_TRACE_FUNC,
               "CLDAP ping is not necessary, using site '%s' and forest '%s'\n",
               state->site != NULL ? state->site : "unknown",

is needed to make sure that due to the talloc_steal() in the *_recv() function the values from src_ctx do not loose their parent.

bye,
Sumit

@alexey-tikhonov
Copy link
Member

JFTR, the above comment was addressed via 37ba37a

@pbrezina pbrezina deleted the adsite-cldap-parallel branch April 13, 2022 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants