Database chaining doesn't fail to other servers in farm when bind fails and disallowing anonymous bind #4666

glbyers · 2021-03-09T22:33:50Z

In our environment, we'd like to use a chaining backend to push BIND operations up to masters by way of the consumer (rather than client referral). We'd like to do this to ensure password lockout attributes are propagated to all consumers equally via our standard replication agreements. This is described here - https://directory.fedoraproject.org/docs/389ds/howto/howto-chainonupdate.html.

NOTE, we do not have hubs in our topology. Just masters and consumers, so no intermediate chaining.

We tested this process in our environment and it worked beautifully until we took it to production. Currently, we have just 2 masters and they are both sitting on some over-subscribed hardware that suffers from I/O starvation at certain times of the day. The plan is to scale out our masters eventually, but we're a little hamstrung with other projects and priorities. It worked extremely well until that time of day when masters suffered from I/O starvation, and hence, very long I/O wait times. This is generally short lived and happens at alternate times of the day for each of the masters. However, it seems that once both nsfarmservers have "failed", there is never any attempt by the consumer to retry them. This leads to bind errors as follows;

ldapwhoami -x -D "<binddn>" -W
Enter LDAP Password:
ldap_bind: Operations error (1)
        additional info: FARM SERVER TEMPORARY UNAVAILABLE

Except it is not temporary. It never recovers, even though all members of nsfarmservers are now healthy again.

I tested various combinations of the chaining tuning params without success and after further debugging, confirmed that it always starts after a bind operation timeout. Looking into the chaining plugin code, I see that on operation timeout results in a call to cb_ping_farm to see if we can find another server in the pool that is available. However, it performs this operation (the comment is telling);


    /* NOTE: This will fail if we implement the ability to disable
       anonymous bind */
    rc = ldap_search_ext_s(ld, NULL, LDAP_SCOPE_BASE, "objectclass=*", attrs, 1, NULL,
                           NULL, &timeout, 1, &result);
    if (LDAP_SUCCESS != rc) {
        slapi_ldap_unbind(ld);
        cb_update_failed_conn_cpt(cb);
        return LDAP_SERVER_DOWN;
    }

So basically, because we've disallowed anonymous bind for anything but rootdse, it will always fail to find another available server. I have confirmed this by allowing anonymous bind on our masters while the issue was present, then subsequent binds on the consumers start working again.

I made & tested the following change in our environment to ensure the search test in ns_farm_ping always uses the rootdse, for which we allow anonymous binds (via the nsslapd-allow-anonymous-access attribute in cn=config);

diff -urN a/ldap/servers/plugins/chainingdb/cb_conn_stateless.c b/ldap/servers/plugins/chainingdb/cb_conn_stateless.c
--- a/ldap/servers/plugins/chainingdb/cb_conn_stateless.c       2020-03-17 04:52:57.000000000 +1000
+++ b/ldap/servers/plugins/chainingdb/cb_conn_stateless.c       2021-03-08 14:04:48.413647052 +1000
@@ -883,7 +883,7 @@
     /* NOTE: This will fail if we implement the ability to disable
        anonymous bind */
-    rc = ldap_search_ext_s(ld, NULL, LDAP_SCOPE_BASE, "objectclass=*", attrs, 1, NULL,
+    rc = ldap_search_ext_s(ld, "", LDAP_SCOPE_BASE, "objectclass=*", attrs, 1, NULL,
                            NULL, &timeout, 1, &result);
     if (LDAP_SUCCESS != rc) {
         slapi_ldap_unbind(ld);

My tests have all been successful.

I am running the stress tool on both our development masters to simulate I/O starvation (stress --io 1 --hdd 1 --hdd-bytes 2G), and on one of the clients, I run some simple code in a loop to trigger the original problem;

import ldap
import getpass
import random

if __name__ == "__main__":
    binddn = '<bind-id>'
    bindpw = getpass.getpass()

    while True:
        r = int(random.random()*100)
        l = ldap.initialize("ldaps://<consumer>")
        try:
            if r > 10:
                l.simple_bind_s(binddn, bindpw)
            else:
                l.simple_bind_s(binddn, "bogus")
        except ldap.INVALID_CREDENTIALS:
            continue
        except ldap.OPERATIONS_ERROR as err:
            if err.args[0].get('info') == 'FARM SERVER TEMPORARY UNAVAILABLE':
                raise
            print(err)
            continue
        except ldap.LDAPError as err:
            raise

        print(l.whoami_s())
        l.unbind_s()

The text was updated successfully, but these errors were encountered:

Firstyear · 2021-03-10T01:06:24Z

@glbyers which version did you say you needed this fixed for?

glbyers · 2021-03-10T01:53:12Z

@glbyers which version did you say you needed this fixed for?

@Firstyear that would help, wouldn't it.. Sorry!

We're running 1.3.10, but I did notice this bug is still relevant in all 1.4 versions too.

Firstyear · 2021-03-10T02:13:26Z

I think that to fix this we'll need to add a new config option. I don't think we've done a 1.3 release in a long time ... so I'm not sure if the fix would land there. Are you doing custom 1.3.10 builds? Or using pkgs from distro?

@mreynolds389 Ithink we'll need a new option in the chaining db that is a boolean of if we should use rootdse or the target dn as the check dn. boolean means less surface area to test and a bit easier to document. Alternately we can make this a config option where we add a check-target-dn instead. I suspect if this is a new option we'll probably target 1.4.5+ or 2.x here?

Firstyear · 2021-03-10T02:26:54Z

@progier389 as well, would be good to know what you think about the config if it should be boolean or free text.

mreynolds389 · 2021-03-10T02:34:26Z

I'm fine adding a new config option to chaining. We don't really have many tests for chaining at this time anyway. Can someone summarize what the config option would do?

Firstyear · 2021-03-10T02:41:45Z

@mreynolds389 There are two options:

On the chaining config an option of nsPingRootDSE: true|false. This would change the ping DN between "" for root dse, or the dn of the chaining target.

The other option is nsPingDn: <dn>. This would change the ping dn to the configured dn.

I'd also happily add some chaining tests in this process :) I don't think this will be a hard issue to resolve (thanks to @glbyers amazing research)

The only question is which version we try to land this in :)

EDIT: thes new options would be put onto the chaining config itself.

mreynolds389 · 2021-03-10T02:52:34Z

The only question is which version we try to land this in :)

Well 1.3.10 is no longer maintained. We can push the fix to that branch, but it's not going to land in any "official" build.

I'm fine with this landing in Fedora 32 which is 389-ds-base-1.4.3.x

Firstyear · 2021-03-10T02:54:03Z

Lets target 1.4.3 then, if @glbyers is willing to do a custom build we can do the backport to there too.

I also realised I phrased my options wrong. It's OR not AND. So nsPingDn OR nsPingRootDSE. @mreynolds389

glbyers · 2021-03-10T03:07:35Z

I think that to fix this we'll need to add a new config option. I don't think we've done a 1.3 release in a long time ... so I'm not sure if the fix would land there. Are you doing custom 1.3.10 builds? Or using pkgs from distro?

@Firstyear, we run 389-ds-base in rhel7 (from their base repo). However, we're not running either their IPA solution or RHDS, so it is completely unsupported. We acknowledge that and have enabled anonymous binds against our masters as a workaround. We have tight ACIs, so this was an acceptable workaround for us, even if not ideal. At some point in the near future, we'll be moving to 1.4

Firstyear · 2021-03-10T03:12:02Z

Rights lets focus on 1.4 then. :)

progier389 · 2021-03-10T13:00:55Z

As I told in the the mailing list, I do not think that it is the right way to fix that issue.

IMHO we should keep searching for the chaining backend DN but accept other return code than LDAP_SUCCESS
typically LDAP_INAPPROPRIATE_AUTH and LDAP_NO_SUCH_OBJECT (to catch the nsslapd-allow-anonymous-access: off case and the acl deny cases)

Here are some reasons (stronger than those I thought in the mail -;)):

Avoid having to manage a new config param (and make life easier for administrator)
The proposed fix in incomplete:
it fails if nsslapd-allow-anonymous-access: off
Using the chaining backend DN allows to detect that the server as unavailable if suffix is in referral mode
even in acl deny case (but not in nsslapd-allow-anonymous-access: off case because LDAP_INAPPROPRIATE_AUTH
error is returned before mapping tree selection)

Note: For the test case, I think that
we should check all combinations nsslapd-allow-anonymous-access: off and acl allow/deny read access on backend suffix
an easiest method to make a server unresponsive is to suspend its process with signal SIGSTP
and resume it with SIGCONT

Firstyear · 2021-03-11T01:12:28Z

This is a good thought actually @progier389. So long as we get any response we know the server is there - so that would mean the server is alive at the least.

I think I'll take your approach since it does not require adding more config options :) Appreciate your advice mate!

Firstyear · 2021-03-12T04:28:17Z

@glbyers Can I confirm one extra detail? What is the nsmultiplexorbinddn you are using in the chaining configuration? I'm building a reproduction test case now and want to be sure I have an accurate test for your issue. Thanks!

Firstyear · 2021-03-12T04:57:22Z

#4669

Started to add a rough test case here, but it's not failing so Ithink I'm missing something here.

glbyers · 2021-03-12T06:08:06Z

@glbyers Can I confirm one extra detail? What is the nsmultiplexorbinddn you are using in the chaining configuration? I'm building a reproduction test case now and want to be sure I have an accurate test for your issue. Thanks!

Hi @Firstyear. I've documented below how I configured this. In addition, you'll need to create enough I/O stress on the masters so that a single BIND request eventually times out (operation timeout). Once that occurs, you'll note the issue occuring.

## Disable anonymous binds;
dn: cn=config
changetype: modify
replace: nsslapd-allow-anonymous-access
nsslapd-allow-anonymous-access: rootdse

## On masters, create a dedicated user for chaining backend
dn: cn=proxyauth,cn=config
objectClass: person
objectClass: top
cn: proxyauth
sn: manager
userPassword: xxxx

## Add ACI to suffix;
dn: <suffix>
changetype: modify
add: aci
aci: (targetattr = "*")(version 3.0; acl "Proxied authorization for database links"; allow (proxy) (userdn = "ldap:///cn=proxyauth,cn=config");)

## On all consumers, create chaining backend;
dn: cn=chainbe1,cn=chaining database,cn=plugins,cn=config
objectclass: top
objectclass: extensibleObject
objectclass: nsBackendInstance
nsslapd-suffix: <suffix>
nsfarmserverurl: ldaps://<master1>:636 <master2>:636/
nsMultiplexorBindDN: cn=proxyauth,cn=config
nsMultiplexorCredentials: <bindpw>
nsCheckLocalACI: on
nsConnectionLife: 30
cn: chainbe1

## On all consumers, add the backend and repl_chain_on_update function
dn: cn="<suffix>",cn=mapping tree,cn=config
changetype: modify
add: nsslapd-backend
nsslapd-backend: chainbe1
-
add: nsslapd-distribution-plugin
nsslapd-distribution-plugin: libreplication-plugin
-
add: nsslapd-distribution-funct
nsslapd-distribution-funct: repl_chain_on_update

## On all servers, enable global pasword policy
dn: cn=config
changetype: modify
replace: passwordIsGlobalPolicy
passwordIsGlobalPolicy: on

Firstyear · 2021-03-15T02:44:25Z

Hey @glbyers, it took me a bit but I think I know why I was unable to reproduce this - and I have a work around for you.

Because the call to ldap_search_ext_s is setting the DN to NULL, this means it uses the value from /etc/openldap/ldap.conf for BASE as the target DN to access.

This means you can either:

unset /etc/openldap/ldap.conf BASE, meaning that it defaults to "", or the rootdse.
You can set LDAPBASE in the systemd environment overrides by:

systemctl edit dirsrv@instancename

[Service]
Environment="LDAPBASE=''"

Both of these allow you to change the basedn that's targeted for this check.

@progier389 because of this, I think we should actually change the CB ping_farm code to:

Explicitly use the configured suffix as the target, rather than relying on environmental configuration
We should use the configured multiplexor credentials to perform the ping farm check.

Thoughts? Both of these changes won't change the configuration, but I'm not sure if this is considered too much of a "change" of behaviour.

PS: I can still add the rc == LDAP_INAPPROPRIATE_AUTH when anonymous is used.

progier389 · 2021-03-15T14:19:14Z

I agree.
P.S I do not think that the rc == LDAP_INAPPROPRIATE_AUTH test is still useful because chained operation would then fail anyway

Bug Description: cb_ping_farm had a combination of issues that made it possible to fail in high load or odd situations. First it used anonymous binds instead of the same credentials as the chaining process. Second it used a NULL search DN, meaning it would use the default BASE configured in /etc/openldap/ldap.conf. Depending on per-site configuration this could cause the cb_ping_farm check to fail infinitly until restart of the instance. Fix Description: Change chaining cb_ping_farm to bind with the same credentials as the chaining configuration, and change the target base dn to the DN of the suffix that we are chaining to. fixes: 389ds#4666 Author: William Brown <william@blackhats.net.au> Review by: ???

Firstyear · 2021-03-16T03:54:08Z

@progier389 Thanks for your help with this! I've updated this PR with the suggestions we have discussed. It's probably worth @mreynolds389 being involved to know what version we should target this PR into as it is a behavioural change (so we may consider keeping it as 2.x only) as we now have a work around for 1.4.x and 1.3.x.

glbyers · 2021-03-16T04:46:19Z

Hey @glbyers, it took me a bit but I think I know why I was unable to reproduce this - and I have a work around for you.

Because the call to ldap_search_ext_s is setting the DN to NULL, this means it uses the value from /etc/openldap/ldap.conf for BASE as the target DN to access.

This means you can either:
* unset /etc/openldap/ldap.conf BASE, meaning that it defaults to "", or the rootdse.

* You can set LDAPBASE in the systemd environment overrides by:
systemctl edit dirsrv@instancename

[Service]
Environment="LDAPBASE=''"
Both of these allow you to change the basedn that's targeted for this check.

Thanks @Firstyear. I will set LDAPBASE as an environment variable in the service & test in our dev environment. Good find!

@progier389

#4669) Bug Description: cb_ping_farm had a combination of issues that made it possible to fail in high load or odd situations. First it used anonymous binds instead of the same credentials as the chaining process. Second it used a NULL search DN, meaning it would use the default BASE configured in /etc/openldap/ldap.conf. Depending on per-site configuration this could cause the cb_ping_farm check to fail infinitly until restart of the instance. Fix Description: Change chaining cb_ping_farm to bind with the same credentials as the chaining configuration, and change the target base dn to the DN of the suffix that we are chaining to. fixes: #4666 Author: William Brown <william@blackhats.net.au> Review by: @progier389

mreynolds389 · 2023-03-01T15:16:38Z

1432752..cdf5ca5 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

glbyers changed the title ~~Database chaining doesn't fail to otrher servers in farm when bind fails and disallowing anonymous bind~~ Database chaining doesn't fail to other servers in farm when bind fails and disallowing anonymous bind Mar 9, 2021

Firstyear added this to the 1.4.3 milestone Mar 10, 2021

Firstyear self-assigned this Mar 10, 2021

Firstyear closed this as completed in 8069de9 Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database chaining doesn't fail to other servers in farm when bind fails and disallowing anonymous bind #4666

Database chaining doesn't fail to other servers in farm when bind fails and disallowing anonymous bind #4666

glbyers commented Mar 9, 2021

Firstyear commented Mar 10, 2021

glbyers commented Mar 10, 2021 •

edited

Firstyear commented Mar 10, 2021

Firstyear commented Mar 10, 2021

mreynolds389 commented Mar 10, 2021

Firstyear commented Mar 10, 2021 •

edited

mreynolds389 commented Mar 10, 2021

Firstyear commented Mar 10, 2021

glbyers commented Mar 10, 2021

Firstyear commented Mar 10, 2021

progier389 commented Mar 10, 2021

Firstyear commented Mar 11, 2021

Firstyear commented Mar 12, 2021

Firstyear commented Mar 12, 2021

glbyers commented Mar 12, 2021

Firstyear commented Mar 15, 2021

progier389 commented Mar 15, 2021

Firstyear commented Mar 16, 2021

glbyers commented Mar 16, 2021

mreynolds389 commented Mar 1, 2023

Database chaining doesn't fail to other servers in farm when bind fails and disallowing anonymous bind #4666

Database chaining doesn't fail to other servers in farm when bind fails and disallowing anonymous bind #4666

Comments

glbyers commented Mar 9, 2021

Firstyear commented Mar 10, 2021

glbyers commented Mar 10, 2021 • edited

Firstyear commented Mar 10, 2021

Firstyear commented Mar 10, 2021

mreynolds389 commented Mar 10, 2021

Firstyear commented Mar 10, 2021 • edited

mreynolds389 commented Mar 10, 2021

Firstyear commented Mar 10, 2021

glbyers commented Mar 10, 2021

Firstyear commented Mar 10, 2021

progier389 commented Mar 10, 2021

Firstyear commented Mar 11, 2021

Firstyear commented Mar 12, 2021

Firstyear commented Mar 12, 2021

glbyers commented Mar 12, 2021

Firstyear commented Mar 15, 2021

progier389 commented Mar 15, 2021

Firstyear commented Mar 16, 2021

glbyers commented Mar 16, 2021

mreynolds389 commented Mar 1, 2023

glbyers commented Mar 10, 2021 •

edited

Firstyear commented Mar 10, 2021 •

edited