After cleanALLruv, replication is looping on keep alive DEL #2522

389-ds-bot · 2020-09-13T02:48:05Z

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49463

Created at 2017-11-22 12:20:36 by tbordaz (@tbordaz)
Closed at 2019-03-25 10:39:43 as fixed
Assigned to tbordaz (@tbordaz)
Associated bugzillas
- https://bugzilla.redhat.com/show_bug.cgi?id=1516309

Issue Description

When cleanAllRuv is launched, it spawn cleanAllRuv on all replicas.
Each replica will clean its changelog and database RUV but in addition will DEL the keep alive entry of the target ReplicaID.
So for the same entry (keep alive) there will be as many DEL as there are replicas

Those DEL are replicated. The DEL will attempt to DEL an entry that is already a tombstone.
Locally this update is considered as a NOOP.

The problem is that the NOOP does not update the RUV. So the supplier will loop sending the DEL.

Package Version and Platform

This problem is a side effect of https://fedorahosted.org/389/ticket/48278
It exists since 1.3.6

Steps to reproduce

Run the attached test case

Actual results

DEL of the cleaned RID can be replicated several times

Expected results

Either the DEL should not be replicated, or it should update the RUV

389-ds-bot · 2020-09-13T02:48:07Z

Comment from tbordaz (@tbordaz) at 2017-11-22 12:23:17

389-ds-bot · 2020-09-13T02:48:07Z

Comment from tbordaz (@tbordaz) at 2017-11-22 12:23:18

Metadata Update from @tbordaz:

Custom field component adjusted to None
Custom field origin adjusted to None
Custom field reviewstatus adjusted to None
Custom field type adjusted to None
Custom field version adjusted to None

389-ds-bot · 2020-09-13T02:48:08Z

Comment from tbordaz (@tbordaz) at 2017-11-22 12:31:24

Note that the problem is not fatal for replication but can create long delay to replicate.
It is not fatal because as soon as a new direct update happen after the DEL, the new direct update will be replicated and the DEL will not be replayed.

Long delay of replication. If we have one replica that does not receive direct update, the DEL of the keep alive will remain its maxcsn.
So it will systematically start its replication session from this maxcsn.

In conclusion, the problem may create delay to replicate

389-ds-bot · 2020-09-13T02:48:09Z

Comment from tbordaz (@tbordaz) at 2017-11-22 12:33:50

Some possibility to fix it:

flag the keep alive DEL as a OP_FLAG_REPL_FIXUP, so that it will not go into the CL/RUV
Allows replicated NOOP to update the RUV
Do not delete keep alive entry in cleanAllRuv
others... ?

389-ds-bot · 2020-09-13T02:48:10Z

Comment from firstyear (@Firstyear) at 2017-11-22 14:01:50

I think that perhaps the second is a better idea. Because in general if we have a non-keep alive that's deleted twice we'll hit this too. Batch deletes could be worse.

So I think we shuold have the replicated DEL NOOP able to write to the RUV, because we really have processed that CL and up to that CSN, we should not keep replaying it.

To put a different way, if this was another operation type, and we needed to NOOP, would we let it replay over and over? I think we would let it write the ruv.

I think that option one is a "duct tape" solution for a problem that could occur in other ways, which is why I suggest we fix it properly above. However, if we did point 1 and 2 I would also be happy with this solution ....

Finally, I think we should delete the keep alive, because we are saying "the master is gone, lets rcemove it". So deleting the keep alive is correct.

Hope that helps!

389-ds-bot · 2020-09-13T02:48:11Z

Comment from tbordaz (@tbordaz) at 2017-11-22 14:15:43

@Firstyear thanks for you comments.

Allowing NOOP to update the RUV looks the most generic solution. I think it will work but it is more difficult to evaluate the impacts.

The two others solutions are more specific to cleanAllRuv.
Also I think we can also implement both.
I do not like the idea of spreading DEL. If we have 60 replica, it will end into replicating 60 different DEL to all of the replicas. A possibility is to do the DEL immediately on the host receiving the cleanAllRuv and schedule (slapi-eq-once) the DEL on the others. Giving the time for the keep alive to be a tombstone.
In addition to this we can use fixup flag so that we are sure the DEL is replicated by cleanALLRuv but not by regular replication.

389-ds-bot · 2020-09-13T02:48:12Z

Comment from lkrispen (@elkris) at 2017-11-22 14:21:13

The problem is not specific to DEL and not to cleanallruv. The other scenario I know of is after total init. There are cases where the keepalive entry is contained inthe total update, but it is sent again after the init is complete. Also there can be MODs to entries before they were sent, the mods will also be replayed after the init.
And maybe there are other cases as well. I think that updating the RUV for already seen updates is the right thing to do, we have to be sure to handle all scenarios without unexpected side effect. And we need to be aware that it can affect all types of operations and it affects the flow of mod operations considerably.

About the other solutions:

the keep alive del triggered by cleanallruv and inside the mmr plugin, so the FIXUP flag could be a solution to prevent it from being replicated. Similar to prevent replicting delete of tombstones.

I agree that it should be removed since it is an artefact for a removed master. One option could be to delete it only on the server where the cleanallruv task is started. I think we can distinguish between a cleanallruv started directly or received as extop.

389-ds-bot · 2020-09-13T02:48:13Z

Comment from lkrispen (@elkris) at 2017-11-22 14:27:33

@tbordaz Just read your comment after sending mine.

I think you have a good point, we should avoid to have many dels, even if we could manage the repeated repl problem by a NOOp mechanism. (we should investigate this independently)

So your idea is interesting. If receiving and starting a cleanallruv, delete the keepalive immediately. it will gwt a csn with a rid which will not be removed and will be replicated.
As an extra check we could avoid this in cleanallruv triggered by an extop

389-ds-bot · 2020-09-13T02:48:14Z

Comment from tbordaz (@tbordaz) at 2017-11-24 08:23:55

No conclusion yet for the final fix but there are multiple possibilities to fix it.

Regarding workaround

A simple workaround is to do, after a cleanAllRuv (or a series of cleanAllRuv), a dummy update on each replica. The dummy update (not filtered by fractional) will have a csn larger than the csn of keep alive DEL, so next replication session will start from this dummy update.

389-ds-bot · 2020-09-13T02:48:15Z

Comment from firstyear (@Firstyear) at 2017-11-24 09:01:07

The problem is not specific to DEL and not to cleanallruv. The other scenario I know of is after total init. There are cases where the keepalive entry is contained inthe total update, but it is sent again after the init is complete. Also there can be MODs to entries before they were sent, the mods will also be replayed after the init.
And maybe there are other cases as well. I think that updating the RUV for already seen updates is the right thing to do, we have to be sure to handle all scenarios without unexpected side effect. And we need to be aware that it can affect all types of operations and it affects the flow of mod operations considerably.

Yes, I completely agree with this.

About the other solutions:
the keep alive del triggered by cleanallruv and inside the mmr plugin, so the FIXUP flag could be a solution to prevent it from being replicated. Similar to prevent replicting delete of tombstones.
I agree that it should be removed since it is an artefact for a removed master. One option could be to delete it only on the server where the cleanallruv task is started. I think we can distinguish between a cleanallruv started directly or received as extop.

I think this is also a good suggestion.

Can I suggest that the fix be:

Allow no-op (ie double del) to update ruv
cleanallruv only deletes keep alive from master the task started on: replicated (extop receivers) do not alter the keep alive entry

I think this is a nice, clean and complete solution to the situation. :)

389-ds-bot · 2020-09-13T02:48:16Z

Comment from lkrispen (@elkris) at 2017-11-24 09:23:58

Can I suggest that the fix be:

Allow no-op (ie double del) to update ruv
cleanallruv only deletes keep alive from master the task started on: replicated (extop receivers) do not alter the keep alive entry

I think this is a nice, clean and complete solution to the situation. :)

I agree with this, but I would split it.

Hanlde the RUV update for NOOPs ina sepatate ticket, it needs extra consideration and has to cover non-allruv scenarios.
implement one of the other suggestions to get immediate relief in the cleanallruv case. I would like to avoid multiple deletes, so if it is possible to delete it only on teh task origin I would prefer this

389-ds-bot · 2020-09-13T02:48:16Z

Comment from tbordaz (@tbordaz) at 2017-11-24 09:34:34

@elkris I agree with your proposal to split the issue.
Updating ruv on NOOP looks an approved idea but its impact are going further than cleanAllRuv.
It is a generic fix and needs more tests

We can keep this ticket to limited number of DEL on cleanAllRuv

Now I have two concern:

Having a single DEL will not fix the current issue (we do need the update RUV on NOOP)
detecting if cleanAllRuv is local or propagated may be difficult. For example, cleanallruv task is recorded in the replica (cn=config) entry until its completion. if the server is restarted before it completes, the task is restarted but then we need something (flag) to know if it was local or propagated task.

389-ds-bot · 2020-09-13T02:48:17Z

Comment from firstyear (@Firstyear) at 2017-11-24 10:33:40

Sounds great! Can you open the NOOP ticket so we don't forget it :) Thanks.

389-ds-bot · 2020-09-13T02:48:18Z

Comment from tbordaz (@tbordaz) at 2017-11-24 11:01:04

@Firstyear @elkris new ticket for NOOP handling is #2525

389-ds-bot · 2020-09-13T02:48:19Z

Comment from tbordaz (@tbordaz) at 2017-11-24 11:01:29

Metadata Update from @tbordaz:

Custom field origin adjusted to IPA (was: None)
Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1516309

389-ds-bot · 2020-09-13T02:48:20Z

Comment from tbordaz (@tbordaz) at 2017-12-01 16:40:28

389-ds-bot · 2020-09-13T02:48:21Z

Comment from tbordaz (@tbordaz) at 2017-12-01 16:40:44

Metadata Update from @tbordaz:

Custom field reviewstatus adjusted to review (was: None)

389-ds-bot · 2020-09-13T02:48:22Z

Comment from lkrispen (@elkris) at 2017-12-01 16:52:19

The patch looks good.

And I agree that setting the flag always to true if the task is created at startup in replica_check_for_tasks, but would it be so difficult to just append the flag to the taskinfo stored in the dse.ldif and parse it like the other attrs ?

389-ds-bot · 2020-09-13T02:48:23Z

Comment from tbordaz (@tbordaz) at 2017-12-01 17:15:03

@elkris thanks for the review.
It is indeed a good idea. I will rework the patch. Thanks !!!

389-ds-bot · 2020-09-13T02:48:26Z

Comment from firstyear (@Firstyear) at 2017-12-01 17:19:42

But then we are exposing an external implementation of the del of the keep alive to the configuration. That doesn't seem right to me.

I think it's better if we set the data->original_task flag based on origin of the task, not from attrs in the config ....

389-ds-bot · 2020-09-13T02:48:27Z

Comment from lkrispen (@elkris) at 2017-12-01 17:24:16

I think you miss the point, if the server crashes or is shutdown before the task is completed it will read the stored task description in the replica object and recreate the task. But then it no longer knows if it is origin or not.

We already temporariliy store the task attributes like RID,force,.. in an encoded attr in the replica object, which is there as long as the task lives. It is nothing you can configure.

389-ds-bot · 2020-09-13T02:48:27Z

Comment from firstyear (@Firstyear) at 2017-12-01 17:33:20

Hmmm okay. Maybe I am missing something then.

I will see the final patch and investigate more :)

Thank you,

389-ds-bot · 2020-09-13T02:48:28Z

Comment from tbordaz (@tbordaz) at 2018-01-10 15:40:05

389-ds-bot · 2020-09-13T02:48:29Z

Comment from tbordaz (@tbordaz) at 2018-01-10 15:43:49

389-ds-bot · 2020-09-13T02:48:30Z

Comment from tbordaz (@tbordaz) at 2018-01-10 15:47:21

new patch and test case implementing @elkris proposal (#2522#comment-482445) to store into the attribute nsds5ReplicaCleanRUV (of the replica) the flag that the cleanAllRuv was received directly (Original) or through extop (Propagated)

The test case is in a separated patch because it will need python3 rework

389-ds-bot · 2020-09-13T02:48:31Z

Comment from spichugi (@droideck) at 2018-01-10 16:10:33

As we agreed before (in irc), we can push the C code first (after the review by devs) and the test case code is NOGO for now.

The test case code need to be ported to Python 3, cleaned from boilerplate code, the docstrings should be in the right format and the test function and file names should be changed.

All points are discussed in #2568#comment-486206

Once we'll go with BZ verification, we can refactor and push the patch.
Or maybe Thierry will fix it later. :)

389-ds-bot · 2020-09-13T02:48:32Z

Comment from lkrispen (@elkris) at 2018-01-10 16:14:33

I think in the patch the comment starting at line 149 is no longer valid, you now know if it is original or not

389-ds-bot · 2020-09-13T02:48:33Z

Comment from tbordaz (@tbordaz) at 2018-01-10 16:44:05

Oppss ! you are right. here is the updated patch

389-ds-bot · 2020-09-13T02:48:34Z

Comment from lkrispen (@elkris) at 2018-01-11 17:29:17

Metadata Update from @elkris:

Custom field reviewstatus adjusted to ack (was: review)

389-ds-bot · 2020-09-13T02:48:35Z

Comment from tbordaz (@tbordaz) at 2018-01-12 15:22:10

To ssh://pagure.io/389-ds-base.git
7e27fac..8448369 master -> master

389-ds-bot · 2020-09-13T02:48:35Z

Comment from tbordaz (@tbordaz) at 2018-01-12 15:23:33

Metadata Update from @tbordaz:

Issue set to the milestone: 1.3.6.0

389-ds-bot · 2020-09-13T02:48:36Z

Comment from tbordaz (@tbordaz) at 2018-01-12 15:23:40

Metadata Update from @tbordaz:

Issue assigned to tbordaz

389-ds-bot · 2020-09-13T02:48:37Z

Comment from tbordaz (@tbordaz) at 2018-01-12 16:53:36

To ssh://pagure.io/389-ds-base.git
26f37b8..f1373f0 389-ds-base-1.3.6 -> 389-ds-base-1.3.6

389-ds-bot · 2020-09-13T02:48:38Z

Comment from tbordaz (@tbordaz) at 2018-01-12 17:46:29

To ssh://pagure.io/389-ds-base.git
cf6d12c..2fa64a7 389-ds-base-1.3.7 -> 389-ds-base-1.3.7

389-ds-bot · 2020-09-13T02:48:39Z

Comment from tbordaz (@tbordaz) at 2019-03-25 10:39:23

Test case pushed on behalf @aadhikari . Thanks
37f919a..28a5ddb master

389-ds-bot · 2020-09-13T02:48:40Z

Comment from tbordaz (@tbordaz) at 2019-03-25 10:39:44

Metadata Update from @tbordaz:

Issue close_status updated to: fixed
Issue status updated to: Closed (was: Open)

389-ds-bot added the closed: fixed Migration flag - Issue label Sep 13, 2020

389-ds-bot added this to the 1.3.6.0 milestone Sep 13, 2020

389-ds-bot closed this as completed Sep 13, 2020

389-ds-bot mentioned this issue Sep 13, 2020

PR - Ticket 49463 After cleanALLruv, replication is looping on keep alive DEL #3281

Closed

After cleanALLruv, replication is looping on keep alive DEL #2522

After cleanALLruv, replication is looping on keep alive DEL #2522

Comments

389-ds-bot commented Sep 13, 2020

Issue Description

Package Version and Platform

Steps to reproduce

Actual results

Expected results

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 • edited

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 • edited

389-ds-bot commented Sep 13, 2020 • edited

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 •

edited

389-ds-bot commented Sep 13, 2020 •

edited

389-ds-bot commented Sep 13, 2020 •

edited