New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database chaining doesn't fail to other servers in farm when bind fails and disallowing anonymous bind #4666
Comments
@glbyers which version did you say you needed this fixed for? |
@Firstyear that would help, wouldn't it.. Sorry! We're running 1.3.10, but I did notice this bug is still relevant in all 1.4 versions too. |
I think that to fix this we'll need to add a new config option. I don't think we've done a 1.3 release in a long time ... so I'm not sure if the fix would land there. Are you doing custom 1.3.10 builds? Or using pkgs from distro? @mreynolds389 Ithink we'll need a new option in the chaining db that is a boolean of if we should use rootdse or the target dn as the check dn. boolean means less surface area to test and a bit easier to document. Alternately we can make this a config option where we add a check-target-dn instead. I suspect if this is a new option we'll probably target 1.4.5+ or 2.x here? |
@progier389 as well, would be good to know what you think about the config if it should be boolean or free text. |
I'm fine adding a new config option to chaining. We don't really have many tests for chaining at this time anyway. Can someone summarize what the config option would do? |
@mreynolds389 There are two options: On the chaining config an option of The other option is I'd also happily add some chaining tests in this process :) I don't think this will be a hard issue to resolve (thanks to @glbyers amazing research) The only question is which version we try to land this in :) EDIT: thes new options would be put onto the chaining config itself. |
Well 1.3.10 is no longer maintained. We can push the fix to that branch, but it's not going to land in any "official" build. I'm fine with this landing in Fedora 32 which is 389-ds-base-1.4.3.x |
Lets target 1.4.3 then, if @glbyers is willing to do a custom build we can do the backport to there too. I also realised I phrased my options wrong. It's OR not AND. So nsPingDn OR nsPingRootDSE. @mreynolds389 |
@Firstyear, we run 389-ds-base in rhel7 (from their base repo). However, we're not running either their IPA solution or RHDS, so it is completely unsupported. We acknowledge that and have enabled anonymous binds against our masters as a workaround. We have tight ACIs, so this was an acceptable workaround for us, even if not ideal. At some point in the near future, we'll be moving to 1.4 |
Rights lets focus on 1.4 then. :) |
As I told in the the mailing list, I do not think that it is the right way to fix that issue. IMHO we should keep searching for the chaining backend DN but accept other return code than LDAP_SUCCESS Here are some reasons (stronger than those I thought in the mail -;)):
|
This is a good thought actually @progier389. So long as we get any response we know the server is there - so that would mean the server is alive at the least. I think I'll take your approach since it does not require adding more config options :) Appreciate your advice mate! |
@glbyers Can I confirm one extra detail? What is the nsmultiplexorbinddn you are using in the chaining configuration? I'm building a reproduction test case now and want to be sure I have an accurate test for your issue. Thanks! |
Started to add a rough test case here, but it's not failing so Ithink I'm missing something here. |
Hi @Firstyear. I've documented below how I configured this. In addition, you'll need to create enough I/O stress on the masters so that a single BIND request eventually times out (operation timeout). Once that occurs, you'll note the issue occuring.
|
Hey @glbyers, it took me a bit but I think I know why I was unable to reproduce this - and I have a work around for you. Because the call to ldap_search_ext_s is setting the DN to NULL, this means it uses the value from /etc/openldap/ldap.conf for BASE as the target DN to access. This means you can either:
Both of these allow you to change the basedn that's targeted for this check. @progier389 because of this, I think we should actually change the CB ping_farm code to:
Thoughts? Both of these changes won't change the configuration, but I'm not sure if this is considered too much of a "change" of behaviour. PS: I can still add the rc == LDAP_INAPPROPRIATE_AUTH when anonymous is used. |
I agree. |
Bug Description: cb_ping_farm had a combination of issues that made it possible to fail in high load or odd situations. First it used anonymous binds instead of the same credentials as the chaining process. Second it used a NULL search DN, meaning it would use the default BASE configured in /etc/openldap/ldap.conf. Depending on per-site configuration this could cause the cb_ping_farm check to fail infinitly until restart of the instance. Fix Description: Change chaining cb_ping_farm to bind with the same credentials as the chaining configuration, and change the target base dn to the DN of the suffix that we are chaining to. fixes: 389ds#4666 Author: William Brown <william@blackhats.net.au> Review by: ???
@progier389 Thanks for your help with this! I've updated this PR with the suggestions we have discussed. It's probably worth @mreynolds389 being involved to know what version we should target this PR into as it is a behavioural change (so we may consider keeping it as 2.x only) as we now have a work around for 1.4.x and 1.3.x. |
Thanks @Firstyear. I will set LDAPBASE as an environment variable in the service & test in our dev environment. Good find! |
#4669) Bug Description: cb_ping_farm had a combination of issues that made it possible to fail in high load or odd situations. First it used anonymous binds instead of the same credentials as the chaining process. Second it used a NULL search DN, meaning it would use the default BASE configured in /etc/openldap/ldap.conf. Depending on per-site configuration this could cause the cb_ping_farm check to fail infinitly until restart of the instance. Fix Description: Change chaining cb_ping_farm to bind with the same credentials as the chaining configuration, and change the target base dn to the DN of the suffix that we are chaining to. fixes: #4666 Author: William Brown <william@blackhats.net.au> Review by: @progier389
In our environment, we'd like to use a chaining backend to push BIND operations up to masters by way of the consumer (rather than client referral). We'd like to do this to ensure password lockout attributes are propagated to all consumers equally via our standard replication agreements. This is described here - https://directory.fedoraproject.org/docs/389ds/howto/howto-chainonupdate.html.
NOTE, we do not have hubs in our topology. Just masters and consumers, so no intermediate chaining.
We tested this process in our environment and it worked beautifully until we took it to production. Currently, we have just 2 masters and they are both sitting on some over-subscribed hardware that suffers from I/O starvation at certain times of the day. The plan is to scale out our masters eventually, but we're a little hamstrung with other projects and priorities. It worked extremely well until that time of day when masters suffered from I/O starvation, and hence, very long I/O wait times. This is generally short lived and happens at alternate times of the day for each of the masters. However, it seems that once both nsfarmservers have "failed", there is never any attempt by the consumer to retry them. This leads to bind errors as follows;
Except it is not temporary. It never recovers, even though all members of nsfarmservers are now healthy again.
I tested various combinations of the chaining tuning params without success and after further debugging, confirmed that it always starts after a bind operation timeout. Looking into the chaining plugin code, I see that on operation timeout results in a call to cb_ping_farm to see if we can find another server in the pool that is available. However, it performs this operation (the comment is telling);
So basically, because we've disallowed anonymous bind for anything but rootdse, it will always fail to find another available server. I have confirmed this by allowing anonymous bind on our masters while the issue was present, then subsequent binds on the consumers start working again.
I made & tested the following change in our environment to ensure the search test in ns_farm_ping always uses the rootdse, for which we allow anonymous binds (via the nsslapd-allow-anonymous-access attribute in cn=config);
My tests have all been successful.
I am running the stress tool on both our development masters to simulate I/O starvation (stress --io 1 --hdd 1 --hdd-bytes 2G), and on one of the clients, I run some simple code in a loop to trigger the original problem;
The text was updated successfully, but these errors were encountered: