New failover implementation#8566
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a new failover mechanism for SSSD, introducing prioritized server groups, parallelized candidate server discovery, and a transaction-based API for automated retries. It also provides a minimal provider implementation to demonstrate the new architecture. Critical logic bugs were identified in the server group resolution logic, where duplicate detection causes premature loop exit, and in the address change detection function, which currently returns inverted results.
| #include "providers/failover/ldap/failover_ldap.h" | ||
|
|
||
| static errno_t | ||
| find_password_expiration_attributes(TALLOC_CTX *mem_ctx, |
Check warning
Code scanning / CodeQL
Poorly documented large function Warning
| switch (ar->entry_type & BE_REQ_TYPE_MASK) { | ||
| case BE_REQ_SERVICES: | ||
| DEBUG(SSSDBG_TRACE_FUNC, "Executing BE_REQ_SERVICES request\n"); | ||
|
|
||
| subreq = minimal_services_get_send(state, be_ctx->ev, fctx, id_ctx, | ||
| sdom, ar->filter_value, | ||
| ar->extra_value, ar->filter_type, | ||
| noexist_delete); | ||
| break; | ||
| default: /*fail*/ | ||
| ret = EINVAL; | ||
| state->err = "Invalid request type"; | ||
| DEBUG(SSSDBG_OP_FAILURE, | ||
| "Unexpected request type: 0x%X [%s:%s] in %s\n", | ||
| ar->entry_type, ar->filter_value, | ||
| ar->extra_value?ar->extra_value:"-", | ||
| ar->domain); | ||
| goto done; | ||
| } |
Check notice
Code scanning / CodeQL
No trivial switch statements Note
| switch (state->ar->entry_type & BE_REQ_TYPE_MASK) { | ||
| case BE_REQ_SERVICES: | ||
| err = "Service lookup failed"; | ||
| ret = minimal_services_get_recv(subreq); | ||
| break; | ||
| default: /* fail */ | ||
| ret = EINVAL; | ||
| break; | ||
| } |
Check notice
Code scanning / CodeQL
No trivial switch statements Note
| // TODO handle how to yield ERR_SERVER_FAILED | ||
| // ret = sdap_id_op_done(state->op, ret, &dp_error); | ||
| // if (dp_error == DP_ERR_OK && ret != EOK) { | ||
| // /* retry */ | ||
| // ret = minimal_services_get_retry(req); | ||
| // if (ret != EOK) { | ||
| // tevent_req_error(req, ret); | ||
| // return; | ||
| // } |
Check notice
Code scanning / CodeQL
Commented-out code Note
| // /* Return to the mainloop to retry */ | ||
| // return; | ||
| // } | ||
| // state->sdap_ret = ret; |
Check notice
Code scanning / CodeQL
Commented-out code Note
|
@pbrezina, is it expected CI fails to build? |
0570a63 to
2a2c475
Compare
|
@pbrezina, |
|
FreeBSD CI doesn't have required headers installed: While 'minimal' isn't going to be merged in main repo branches, this 'fail to build' can hide other issues. |
1fe8626 to
ed511ad
Compare
…ider (cherry picked from commit 0f5f3b6)
so it can be directly modified
…can easily pass new fctx
So it can be modified later.
…kup and user authentication
…pec file Add the sssd-minimal provider package to the spec file following the same pattern as other providers (ldap, ipa, ad, etc.). This packages the libsss_minimal.so library that was added in recent commits. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Now it is fixed. There were missing headers in noinst_HEADERS and some other problems. I reordered the commits and every commit for testing only is clearly marked to not go to master. Reviewer needs to pay attention only to the "failover" commit, other commits are just for testing and a demostration. |
And also disable codeql for the minimal provider. The provider is for testing only, it does not make sense to fix any issue there.
This crafts and implements the new failover interface, it does not provide complete implementation of the failover mechanism yet. It brings the code to a state were the public and private interfaces are stable, working and testable so the following tasks can be split and work on in parallel. What is missing at this state: - server configuration and discovery (failover_server_group/batch/vtable_op) - server selection mechanism (sss_failover_vtable_op_server_next) - kerberos authentication - sharing servers between IPA/AD LDAP and KDC - online/offline callbacks (resolve callback should not be needed) But especially it is possible to start refactoring SSSD code to start using the new failover implementation.
|
@pbrezina, would it be difficult to include a 'system' test using "minimal" provider and covering any failover scenario? If it's difficult then disregard as test would be discarded eventually. |
|
|
||
| ### Failover Context | ||
|
|
||
| * [sss_failover.c]() |
There was a problem hiding this comment.
Actual files are named without 'sss_' prefix.
Periodic refreshes are also not yet implemented, right? |
| } | ||
|
|
||
| /* Switch the attempt_req state to caller_req state so it is used seamlessly | ||
| * by the user. This is quite a hack and the attempt_state must stay |
There was a problem hiding this comment.
Indeed. No other reasonable way?
| errno_t ret; | ||
|
|
||
| state = tevent_req_data(req, struct sss_failover_transaction_state); | ||
| state->attempts++; |
There was a problem hiding this comment.
Would it make sense to limit attempts?
| } | ||
|
|
||
| void | ||
| sss_failover_server_mark_reachable(struct sss_failover_server *srv) |
There was a problem hiding this comment.
Looks like it's not used anywhere.
Shouldn't it be called from sss_failover_ping_done()?
|
|
||
| state->current_group++; | ||
| ret = sss_failover_refresh_candidates_group_next(req); | ||
| if (ret != EOK) { |
There was a problem hiding this comment.
I think it should also return for ret == EOK as well?
Otherwise it will be tevent_req_done(req); below?
| bool addr_changed, | ||
| bool reuse_connection, | ||
| bool authenticate_connection, | ||
| bool read_rootdse, |
There was a problem hiding this comment.
Those are pretty much LDAP specific.
Is this a good fit for abstract vtable API?
|
I did only overview / preliminary round and not sure at all if any of my comments are valid. |
This pull request is intended to be a start of a "failover" feature branch where other developers will be able to contribute.
The main failover logic works, compiles and can be tested using a "minimal" provider that is included as an example. The purpose of the "minimal" provider is only to test the failover without the need to port full provider code and itwill be removed prior pushing the contents to the master branch. See how to set it up in
minimal-provider-notes.txtand see the switch to new failover in commitminimal: switch to new failover for service lookup and user authentication- this is the minimal set of changes to get it working, but the real port should get and will require more refactoring.The work is still not finished and there is missing functionality. This functionality, however, can be implemented in small areas of code and should not require larger changes or glues in the whole code base, so this is ready for review. Remaining work is tracked at [1]. Feel free to take any of these tickets and open new tickets when you find something missing.
When reviewing, you can start with
src/providers/failover/readme.mdthat provides high level documentation of the code. And of course do not forget the design page [2].Thanks, Pavel