New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After cleanALLruv, replication is looping on keep alive DEL #2522
Comments
Comment from tbordaz (@tbordaz) at 2017-11-22 12:23:17 |
Comment from tbordaz (@tbordaz) at 2017-11-22 12:23:18 Metadata Update from @tbordaz:
|
Comment from tbordaz (@tbordaz) at 2017-11-22 12:31:24 Note that the problem is not fatal for replication but can create long delay to replicate. Long delay of replication. If we have one replica that does not receive direct update, the DEL of the keep alive will remain its maxcsn. In conclusion, the problem may create delay to replicate |
Comment from tbordaz (@tbordaz) at 2017-11-22 12:33:50 Some possibility to fix it:
|
Comment from firstyear (@Firstyear) at 2017-11-22 14:01:50 I think that perhaps the second is a better idea. Because in general if we have a non-keep alive that's deleted twice we'll hit this too. Batch deletes could be worse. So I think we shuold have the replicated DEL NOOP able to write to the RUV, because we really have processed that CL and up to that CSN, we should not keep replaying it. To put a different way, if this was another operation type, and we needed to NOOP, would we let it replay over and over? I think we would let it write the ruv. I think that option one is a "duct tape" solution for a problem that could occur in other ways, which is why I suggest we fix it properly above. However, if we did point 1 and 2 I would also be happy with this solution .... Finally, I think we should delete the keep alive, because we are saying "the master is gone, lets rcemove it". So deleting the keep alive is correct. Hope that helps! |
Comment from tbordaz (@tbordaz) at 2017-11-22 14:15:43 @Firstyear thanks for you comments. Allowing NOOP to update the RUV looks the most generic solution. I think it will work but it is more difficult to evaluate the impacts. The two others solutions are more specific to cleanAllRuv. |
Comment from lkrispen (@elkris) at 2017-11-22 14:21:13 The problem is not specific to DEL and not to cleanallruv. The other scenario I know of is after total init. There are cases where the keepalive entry is contained inthe total update, but it is sent again after the init is complete. Also there can be MODs to entries before they were sent, the mods will also be replayed after the init. About the other solutions: the keep alive del triggered by cleanallruv and inside the mmr plugin, so the FIXUP flag could be a solution to prevent it from being replicated. Similar to prevent replicting delete of tombstones. I agree that it should be removed since it is an artefact for a removed master. One option could be to delete it only on the server where the cleanallruv task is started. I think we can distinguish between a cleanallruv started directly or received as extop. |
Comment from lkrispen (@elkris) at 2017-11-22 14:27:33 @tbordaz Just read your comment after sending mine. I think you have a good point, we should avoid to have many dels, even if we could manage the repeated repl problem by a NOOp mechanism. (we should investigate this independently) So your idea is interesting. If receiving and starting a cleanallruv, delete the keepalive immediately. it will gwt a csn with a rid which will not be removed and will be replicated. |
Comment from tbordaz (@tbordaz) at 2017-11-24 08:23:55 No conclusion yet for the final fix but there are multiple possibilities to fix it. Regarding workaround A simple workaround is to do, after a cleanAllRuv (or a series of cleanAllRuv), a dummy update on each replica. The dummy update (not filtered by fractional) will have a csn larger than the csn of keep alive DEL, so next replication session will start from this dummy update. |
Comment from firstyear (@Firstyear) at 2017-11-24 09:01:07
Yes, I completely agree with this.
I think this is also a good suggestion. Can I suggest that the fix be:
I think this is a nice, clean and complete solution to the situation. :) |
Comment from lkrispen (@elkris) at 2017-11-24 09:23:58
I agree with this, but I would split it.
|
Comment from tbordaz (@tbordaz) at 2017-11-24 09:34:34 @elkris I agree with your proposal to split the issue. We can keep this ticket to limited number of DEL on cleanAllRuv Now I have two concern:
|
Comment from firstyear (@Firstyear) at 2017-11-24 10:33:40 Sounds great! Can you open the NOOP ticket so we don't forget it :) Thanks. |
Comment from tbordaz (@tbordaz) at 2017-11-24 11:01:04 @Firstyear @elkris new ticket for NOOP handling is #2525 |
Comment from tbordaz (@tbordaz) at 2017-11-24 11:01:29 Metadata Update from @tbordaz:
|
Comment from tbordaz (@tbordaz) at 2017-12-01 16:40:28 |
Comment from tbordaz (@tbordaz) at 2017-12-01 16:40:44 Metadata Update from @tbordaz:
|
Comment from lkrispen (@elkris) at 2017-12-01 16:52:19 The patch looks good. And I agree that setting the flag always to true if the task is created at startup in replica_check_for_tasks, but would it be so difficult to just append the flag to the taskinfo stored in the dse.ldif and parse it like the other attrs ? |
Comment from tbordaz (@tbordaz) at 2017-12-01 17:15:03 @elkris thanks for the review. |
Comment from firstyear (@Firstyear) at 2017-12-01 17:19:42 But then we are exposing an external implementation of the del of the keep alive to the configuration. That doesn't seem right to me. I think it's better if we set the data->original_task flag based on origin of the task, not from attrs in the config .... |
Comment from lkrispen (@elkris) at 2017-12-01 17:24:16 I think you miss the point, if the server crashes or is shutdown before the task is completed it will read the stored task description in the replica object and recreate the task. But then it no longer knows if it is origin or not. We already temporariliy store the task attributes like RID,force,.. in an encoded attr in the replica object, which is there as long as the task lives. It is nothing you can configure. |
Comment from firstyear (@Firstyear) at 2017-12-01 17:33:20 Hmmm okay. Maybe I am missing something then. I will see the final patch and investigate more :) Thank you, |
Comment from tbordaz (@tbordaz) at 2018-01-10 15:40:05 |
Comment from tbordaz (@tbordaz) at 2018-01-10 15:43:49 |
Comment from tbordaz (@tbordaz) at 2018-01-10 15:47:21 new patch and test case implementing @elkris proposal (#2522#comment-482445) to store into the attribute nsds5ReplicaCleanRUV (of the replica) the flag that the cleanAllRuv was received directly (Original) or through extop (Propagated) The test case is in a separated patch because it will need python3 rework |
Comment from spichugi (@droideck) at 2018-01-10 16:10:33 As we agreed before (in irc), we can push the C code first (after the review by devs) and the test case code is NOGO for now. The test case code need to be ported to Python 3, cleaned from boilerplate code, the docstrings should be in the right format and the test function and file names should be changed. All points are discussed in #2568#comment-486206 Once we'll go with BZ verification, we can refactor and push the patch. |
Comment from lkrispen (@elkris) at 2018-01-10 16:14:33 I think in the patch the comment starting at line 149 is no longer valid, you now know if it is original or not |
Comment from tbordaz (@tbordaz) at 2018-01-10 16:44:05 Oppss ! you are right. here is the updated patch |
Comment from lkrispen (@elkris) at 2018-01-11 17:29:17 Metadata Update from @elkris:
|
Comment from tbordaz (@tbordaz) at 2018-01-12 15:22:10 To ssh://pagure.io/389-ds-base.git |
Comment from tbordaz (@tbordaz) at 2018-01-12 15:23:33 Metadata Update from @tbordaz:
|
Comment from tbordaz (@tbordaz) at 2018-01-12 15:23:40 Metadata Update from @tbordaz:
|
Comment from tbordaz (@tbordaz) at 2018-01-12 16:53:36 To ssh://pagure.io/389-ds-base.git |
Comment from tbordaz (@tbordaz) at 2018-01-12 17:46:29 To ssh://pagure.io/389-ds-base.git |
Comment from tbordaz (@tbordaz) at 2019-03-25 10:39:23 Test case pushed on behalf @aadhikari . Thanks |
Comment from tbordaz (@tbordaz) at 2019-03-25 10:39:44 Metadata Update from @tbordaz:
|
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49463
Issue Description
When cleanAllRuv is launched, it spawn cleanAllRuv on all replicas.
Each replica will clean its changelog and database RUV but in addition will DEL the keep alive entry of the target ReplicaID.
So for the same entry (keep alive) there will be as many DEL as there are replicas
Those DEL are replicated. The DEL will attempt to DEL an entry that is already a tombstone.
Locally this update is considered as a NOOP.
The problem is that the NOOP does not update the RUV. So the supplier will loop sending the DEL.
Package Version and Platform
This problem is a side effect of https://fedorahosted.org/389/ticket/48278
It exists since 1.3.6
Steps to reproduce
Run the attached test case
Actual results
DEL of the cleaned RID can be replicated several times
Expected results
Either the DEL should not be replicated, or it should update the RUV
The text was updated successfully, but these errors were encountered: