Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database chaining doesn't fail to other servers in farm when bind fails and disallowing anonymous bind #4666

Closed
glbyers opened this issue Mar 9, 2021 · 20 comments
Assignees
Milestone

Comments

@glbyers
Copy link

glbyers commented Mar 9, 2021

In our environment, we'd like to use a chaining backend to push BIND operations up to masters by way of the consumer (rather than client referral). We'd like to do this to ensure password lockout attributes are propagated to all consumers equally via our standard replication agreements. This is described here - https://directory.fedoraproject.org/docs/389ds/howto/howto-chainonupdate.html.

NOTE, we do not have hubs in our topology. Just masters and consumers, so no intermediate chaining.

We tested this process in our environment and it worked beautifully until we took it to production. Currently, we have just 2 masters and they are both sitting on some over-subscribed hardware that suffers from I/O starvation at certain times of the day. The plan is to scale out our masters eventually, but we're a little hamstrung with other projects and priorities. It worked extremely well until that time of day when masters suffered from I/O starvation, and hence, very long I/O wait times. This is generally short lived and happens at alternate times of the day for each of the masters. However, it seems that once both nsfarmservers have "failed", there is never any attempt by the consumer to retry them. This leads to bind errors as follows;

ldapwhoami -x -D "<binddn>" -W
Enter LDAP Password:
ldap_bind: Operations error (1)
        additional info: FARM SERVER TEMPORARY UNAVAILABLE

Except it is not temporary. It never recovers, even though all members of nsfarmservers are now healthy again.

I tested various combinations of the chaining tuning params without success and after further debugging, confirmed that it always starts after a bind operation timeout. Looking into the chaining plugin code, I see that on operation timeout results in a call to cb_ping_farm to see if we can find another server in the pool that is available. However, it performs this operation (the comment is telling);


    /* NOTE: This will fail if we implement the ability to disable
       anonymous bind */
    rc = ldap_search_ext_s(ld, NULL, LDAP_SCOPE_BASE, "objectclass=*", attrs, 1, NULL,
                           NULL, &timeout, 1, &result);
    if (LDAP_SUCCESS != rc) {
        slapi_ldap_unbind(ld);
        cb_update_failed_conn_cpt(cb);
        return LDAP_SERVER_DOWN;
    }

So basically, because we've disallowed anonymous bind for anything but rootdse, it will always fail to find another available server. I have confirmed this by allowing anonymous bind on our masters while the issue was present, then subsequent binds on the consumers start working again.

I made & tested the following change in our environment to ensure the search test in ns_farm_ping always uses the rootdse, for which we allow anonymous binds (via the nsslapd-allow-anonymous-access attribute in cn=config);

diff -urN a/ldap/servers/plugins/chainingdb/cb_conn_stateless.c b/ldap/servers/plugins/chainingdb/cb_conn_stateless.c
--- a/ldap/servers/plugins/chainingdb/cb_conn_stateless.c       2020-03-17 04:52:57.000000000 +1000
+++ b/ldap/servers/plugins/chainingdb/cb_conn_stateless.c       2021-03-08 14:04:48.413647052 +1000
@@ -883,7 +883,7 @@
     /* NOTE: This will fail if we implement the ability to disable
        anonymous bind */
-    rc = ldap_search_ext_s(ld, NULL, LDAP_SCOPE_BASE, "objectclass=*", attrs, 1, NULL,
+    rc = ldap_search_ext_s(ld, "", LDAP_SCOPE_BASE, "objectclass=*", attrs, 1, NULL,
                            NULL, &timeout, 1, &result);
     if (LDAP_SUCCESS != rc) {
         slapi_ldap_unbind(ld);

My tests have all been successful.

I am running the stress tool on both our development masters to simulate I/O starvation (stress --io 1 --hdd 1 --hdd-bytes 2G), and on one of the clients, I run some simple code in a loop to trigger the original problem;

import ldap
import getpass
import random

if __name__ == "__main__":
    binddn = '<bind-id>'
    bindpw = getpass.getpass()

    while True:
        r = int(random.random()*100)
        l = ldap.initialize("ldaps://<consumer>")
        try:
            if r > 10:
                l.simple_bind_s(binddn, bindpw)
            else:
                l.simple_bind_s(binddn, "bogus")
        except ldap.INVALID_CREDENTIALS:
            continue
        except ldap.OPERATIONS_ERROR as err:
            if err.args[0].get('info') == 'FARM SERVER TEMPORARY UNAVAILABLE':
                raise
            print(err)
            continue
        except ldap.LDAPError as err:
            raise

        print(l.whoami_s())
        l.unbind_s()
@glbyers glbyers changed the title Database chaining doesn't fail to otrher servers in farm when bind fails and disallowing anonymous bind Database chaining doesn't fail to other servers in farm when bind fails and disallowing anonymous bind Mar 9, 2021
@Firstyear
Copy link
Contributor

@glbyers which version did you say you needed this fixed for?

@glbyers
Copy link
Author

glbyers commented Mar 10, 2021

@glbyers which version did you say you needed this fixed for?

@Firstyear that would help, wouldn't it.. Sorry!

We're running 1.3.10, but I did notice this bug is still relevant in all 1.4 versions too.

@Firstyear
Copy link
Contributor

I think that to fix this we'll need to add a new config option. I don't think we've done a 1.3 release in a long time ... so I'm not sure if the fix would land there. Are you doing custom 1.3.10 builds? Or using pkgs from distro?

@mreynolds389 Ithink we'll need a new option in the chaining db that is a boolean of if we should use rootdse or the target dn as the check dn. boolean means less surface area to test and a bit easier to document. Alternately we can make this a config option where we add a check-target-dn instead. I suspect if this is a new option we'll probably target 1.4.5+ or 2.x here?

@Firstyear
Copy link
Contributor

@progier389 as well, would be good to know what you think about the config if it should be boolean or free text.

@mreynolds389
Copy link
Contributor

I'm fine adding a new config option to chaining. We don't really have many tests for chaining at this time anyway. Can someone summarize what the config option would do?

@Firstyear
Copy link
Contributor

Firstyear commented Mar 10, 2021

@mreynolds389 There are two options:

On the chaining config an option of nsPingRootDSE: true|false. This would change the ping DN between "" for root dse, or the dn of the chaining target.

The other option is nsPingDn: <dn>. This would change the ping dn to the configured dn.

I'd also happily add some chaining tests in this process :) I don't think this will be a hard issue to resolve (thanks to @glbyers amazing research)

The only question is which version we try to land this in :)

EDIT: thes new options would be put onto the chaining config itself.

@mreynolds389
Copy link
Contributor

The only question is which version we try to land this in :)

Well 1.3.10 is no longer maintained. We can push the fix to that branch, but it's not going to land in any "official" build.

I'm fine with this landing in Fedora 32 which is 389-ds-base-1.4.3.x

@Firstyear
Copy link
Contributor

Lets target 1.4.3 then, if @glbyers is willing to do a custom build we can do the backport to there too.

I also realised I phrased my options wrong. It's OR not AND. So nsPingDn OR nsPingRootDSE. @mreynolds389

@Firstyear Firstyear added this to the 1.4.3 milestone Mar 10, 2021
@Firstyear Firstyear self-assigned this Mar 10, 2021
@glbyers
Copy link
Author

glbyers commented Mar 10, 2021

I think that to fix this we'll need to add a new config option. I don't think we've done a 1.3 release in a long time ... so I'm not sure if the fix would land there. Are you doing custom 1.3.10 builds? Or using pkgs from distro?

@Firstyear, we run 389-ds-base in rhel7 (from their base repo). However, we're not running either their IPA solution or RHDS, so it is completely unsupported. We acknowledge that and have enabled anonymous binds against our masters as a workaround. We have tight ACIs, so this was an acceptable workaround for us, even if not ideal. At some point in the near future, we'll be moving to 1.4

@Firstyear
Copy link
Contributor

Rights lets focus on 1.4 then. :)

@progier389
Copy link
Contributor

As I told in the the mailing list, I do not think that it is the right way to fix that issue.

IMHO we should keep searching for the chaining backend DN but accept other return code than LDAP_SUCCESS
typically LDAP_INAPPROPRIATE_AUTH and LDAP_NO_SUCH_OBJECT (to catch the nsslapd-allow-anonymous-access: off case and the acl deny cases)

Here are some reasons (stronger than those I thought in the mail -;)):

  1. Avoid having to manage a new config param (and make life easier for administrator)

  2. The proposed fix in incomplete:
    it fails if nsslapd-allow-anonymous-access: off

  3. Using the chaining backend DN allows to detect that the server as unavailable if suffix is in referral mode
    even in acl deny case (but not in nsslapd-allow-anonymous-access: off case because LDAP_INAPPROPRIATE_AUTH
    error is returned before mapping tree selection)

    Note: For the test case, I think that
    we should check all combinations nsslapd-allow-anonymous-access: off and acl allow/deny read access on backend suffix
    an easiest method to make a server unresponsive is to suspend its process with signal SIGSTP
    and resume it with SIGCONT

@Firstyear
Copy link
Contributor

This is a good thought actually @progier389. So long as we get any response we know the server is there - so that would mean the server is alive at the least.

I think I'll take your approach since it does not require adding more config options :) Appreciate your advice mate!

@Firstyear
Copy link
Contributor

@glbyers Can I confirm one extra detail? What is the nsmultiplexorbinddn you are using in the chaining configuration? I'm building a reproduction test case now and want to be sure I have an accurate test for your issue. Thanks!

@Firstyear
Copy link
Contributor

#4669

Started to add a rough test case here, but it's not failing so Ithink I'm missing something here.

@glbyers
Copy link
Author

glbyers commented Mar 12, 2021

@glbyers Can I confirm one extra detail? What is the nsmultiplexorbinddn you are using in the chaining configuration? I'm building a reproduction test case now and want to be sure I have an accurate test for your issue. Thanks!

Hi @Firstyear. I've documented below how I configured this. In addition, you'll need to create enough I/O stress on the masters so that a single BIND request eventually times out (operation timeout). Once that occurs, you'll note the issue occuring.

## Disable anonymous binds;
dn: cn=config
changetype: modify
replace: nsslapd-allow-anonymous-access
nsslapd-allow-anonymous-access: rootdse

## On masters, create a dedicated user for chaining backend
dn: cn=proxyauth,cn=config
objectClass: person
objectClass: top
cn: proxyauth
sn: manager
userPassword: xxxx

## Add ACI to suffix;
dn: <suffix>
changetype: modify
add: aci
aci: (targetattr = "*")(version 3.0; acl "Proxied authorization for database links"; allow (proxy) (userdn = "ldap:///cn=proxyauth,cn=config");)

## On all consumers, create chaining backend;
dn: cn=chainbe1,cn=chaining database,cn=plugins,cn=config
objectclass: top
objectclass: extensibleObject
objectclass: nsBackendInstance
nsslapd-suffix: <suffix>
nsfarmserverurl: ldaps://<master1>:636 <master2>:636/
nsMultiplexorBindDN: cn=proxyauth,cn=config
nsMultiplexorCredentials: <bindpw>
nsCheckLocalACI: on
nsConnectionLife: 30
cn: chainbe1

## On all consumers, add the backend and repl_chain_on_update function
dn: cn="<suffix>",cn=mapping tree,cn=config
changetype: modify
add: nsslapd-backend
nsslapd-backend: chainbe1
-
add: nsslapd-distribution-plugin
nsslapd-distribution-plugin: libreplication-plugin
-
add: nsslapd-distribution-funct
nsslapd-distribution-funct: repl_chain_on_update

## On all servers, enable global pasword policy
dn: cn=config
changetype: modify
replace: passwordIsGlobalPolicy
passwordIsGlobalPolicy: on

@Firstyear
Copy link
Contributor

Hey @glbyers, it took me a bit but I think I know why I was unable to reproduce this - and I have a work around for you.

Because the call to ldap_search_ext_s is setting the DN to NULL, this means it uses the value from /etc/openldap/ldap.conf for BASE as the target DN to access.

This means you can either:

  • unset /etc/openldap/ldap.conf BASE, meaning that it defaults to "", or the rootdse.
  • You can set LDAPBASE in the systemd environment overrides by:
systemctl edit dirsrv@instancename

[Service]
Environment="LDAPBASE=''"

Both of these allow you to change the basedn that's targeted for this check.

@progier389 because of this, I think we should actually change the CB ping_farm code to:

  • Explicitly use the configured suffix as the target, rather than relying on environmental configuration
  • We should use the configured multiplexor credentials to perform the ping farm check.

Thoughts? Both of these changes won't change the configuration, but I'm not sure if this is considered too much of a "change" of behaviour.

PS: I can still add the rc == LDAP_INAPPROPRIATE_AUTH when anonymous is used.

@progier389
Copy link
Contributor

I agree.
P.S I do not think that the rc == LDAP_INAPPROPRIATE_AUTH test is still useful because chained operation would then fail anyway

Firstyear added a commit to Firstyear/389-ds-base that referenced this issue Mar 16, 2021
Bug Description: cb_ping_farm had a combination of issues that made it
possible to fail in high load or odd situations. First it used anonymous
binds instead of the same credentials as the chaining process. Second
it used a NULL search DN, meaning it would use the default BASE configured
in /etc/openldap/ldap.conf. Depending on per-site configuration this
could cause the cb_ping_farm check to fail infinitly until restart
of the instance.

Fix Description: Change chaining cb_ping_farm to bind with the same
credentials as the chaining configuration, and change the target base
dn to the DN of the suffix that we are chaining to.

fixes: 389ds#4666

Author: William Brown <william@blackhats.net.au>

Review by: ???
@Firstyear
Copy link
Contributor

@progier389 Thanks for your help with this! I've updated this PR with the suggestions we have discussed. It's probably worth @mreynolds389 being involved to know what version we should target this PR into as it is a behavioural change (so we may consider keeping it as 2.x only) as we now have a work around for 1.4.x and 1.3.x.

@glbyers
Copy link
Author

glbyers commented Mar 16, 2021

Hey @glbyers, it took me a bit but I think I know why I was unable to reproduce this - and I have a work around for you.

Because the call to ldap_search_ext_s is setting the DN to NULL, this means it uses the value from /etc/openldap/ldap.conf for BASE as the target DN to access.

This means you can either:

* unset /etc/openldap/ldap.conf BASE, meaning that it defaults to "", or the rootdse.

* You can set LDAPBASE in the systemd environment overrides by:
systemctl edit dirsrv@instancename

[Service]
Environment="LDAPBASE=''"

Both of these allow you to change the basedn that's targeted for this check.

Thanks @Firstyear. I will set LDAPBASE as an environment variable in the service & test in our dev environment. Good find!

mreynolds389 pushed a commit that referenced this issue Mar 1, 2023
#4669)

Bug Description: cb_ping_farm had a combination of issues that made it
possible to fail in high load or odd situations. First it used anonymous
binds instead of the same credentials as the chaining process. Second
it used a NULL search DN, meaning it would use the default BASE configured
in /etc/openldap/ldap.conf. Depending on per-site configuration this
could cause the cb_ping_farm check to fail infinitly until restart
of the instance.

Fix Description: Change chaining cb_ping_farm to bind with the same
credentials as the chaining configuration, and change the target base
dn to the DN of the suffix that we are chaining to.

fixes: #4666

Author: William Brown <william@blackhats.net.au>

Review by: @progier389
@mreynolds389
Copy link
Contributor

1432752..cdf5ca5 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants