PR - Ticket 49463 After cleanALLruv, replication is looping on keep alive DEL #3281

389-ds-bot · 2020-09-13T05:07:03Z

Cloned from Pagure Pull-Request: https://pagure.io/389-ds-base/pull-request/50222

Created at 2019-02-14 16:08:04 by aadhikari
Merged at 2019-03-25 10:35:54

Bug Description: When cleanAllRuv is launched, it spawn cleanAllRuv on all replicas.
Each replica will clean its changelog and database RUV but in addition
will DEL the keep alive entry of the target ReplicaID.

Fix Description: Test case cover all the scenario to be tested for the fix.

Resolves: #2522

Review by: ??

389-ds-bot · 2020-09-13T05:07:05Z

Comment from tbordaz (@tbordaz) at 2019-02-14 16:40:32

The test looks good. Should it be integrated into replication test suite or kept into a regression ticket test case ?

389-ds-bot · 2020-09-13T05:07:05Z

Comment from firstyear (@Firstyear) at 2019-02-15 01:10:31

I don't think you need any of this logging boilerplate, it,s in topologies now.

389-ds-bot · 2020-09-13T05:07:06Z

Comment from firstyear (@Firstyear) at 2019-02-15 01:11:00

There is already a userAccount create helper in the code you can use.

389-ds-bot · 2020-09-13T05:07:07Z

Comment from firstyear (@Firstyear) at 2019-02-15 01:11:12

Please don't use getEntry, it's going to be removed.

389-ds-bot · 2020-09-13T05:07:08Z

Comment from firstyear (@Firstyear) at 2019-02-15 01:11:42

There is an errorlog parsing and manipulation api already in lib389 that allows applying a seanch to the log to find errors.

389-ds-bot · 2020-09-13T05:07:09Z

Comment from firstyear (@Firstyear) at 2019-02-15 01:12:08

Don't use raw modify_s. We also already have replica objects you can use to setup replication and fractional attributes, so you should use those.

389-ds-bot · 2020-09-13T05:07:09Z

Comment from firstyear (@Firstyear) at 2019-02-15 01:12:53

Don't use raw modify, you should be updating the various objects. Don't use ensure_bytes outside of core lib389 (and even then it should be rare). Basiccaly, if you ever need "ensure bytes", you probably are doing something in a "non-correct" way,

389-ds-bot · 2020-09-13T05:07:10Z

Comment from firstyear (@Firstyear) at 2019-02-15 01:13:05

There is an access log searching api in the lib389 code already.

389-ds-bot · 2020-09-13T05:07:11Z

Comment from firstyear (@Firstyear) at 2019-02-15 01:13:50

Why is this sleep present? If you are waiting for replication, there is a replication wait task. You should never need a sleep in tests

389-ds-bot · 2020-09-13T05:07:12Z

Comment from firstyear (@Firstyear) at 2019-02-15 01:14:30

Remove un-used boilerplate from copy-paste.

389-ds-bot · 2020-09-13T05:07:12Z

Comment from firstyear (@Firstyear) at 2019-02-15 01:16:09

I want to point out generally, that if you are adding "helpers" to make this test case easier, it's likely they will be needed to make other tests easier too. This mean creating user, log checks etc. It also is likely that these things already exist in lib389, because you probably aren't the first person to require this kind of interaction. There is lots of functionality in lib389 to make writing tests quick, simple, and compact. It's probably worth your time to have a quick read of the library to find many things that will help you. :)

389-ds-bot · 2020-09-13T05:07:13Z

Comment from aadhikari at 2019-02-15 05:25:47

I want to point out generally, that if you are adding "helpers" to make this test case easier, it's likely they will be needed to make other tests easier too. This mean creating user, log checks etc. It also is likely that these things already exist in lib389, because you probably aren't the first person to require this kind of interaction. There is lots of functionality in lib389 to make writing tests quick, simple, and compact. It's probably worth your time to have a quick read of the library to find many things that will help you. :)

@Firstyear It will be really nice if you a draft a mail which consists of all the things which will be either removed or should not be done. Like, DO's and DONT. It will really help anyone who is writing test cases. It's confusing when we are using modify_s like everywhere in the tests so it will be really nice to have an example or convection which we all can follow. Also for eg: getEntry will be removed, so what should be its replacement. We probably can fix many tests which are already in the upstream. Thanks!

389-ds-bot · 2020-09-13T05:07:14Z

Comment from firstyear (@Firstyear) at 2019-02-15 05:31:33

I'm literally adding code level warnings for this, so they can be found easier during devel and testing. #3285 Soon I'm going to start addressing these warnings in existing tests (but it will take ... a long time ....).
As well, @droideck has a document on how to write tests I think with lib389.

The tl;dr is basically, that everything should be passed through a DSLdapObject(s) type, so you don't "get" and entry, you would do like:

thing = DSLdapObjects().get(name)

And similar. There are plenty of reasons for this. It makes more compact tests, it allows encapsulating logic into the code (IE what attributes can I set? there are get/set for that!). It handles python3 for you, it handles lots of sticky cases with renames and modifies, it has a stateful version (ensure this thing exists rather than needing to check etc).

Often lots of these things already exist, but it's a matter of familiarity. Part of it is you making an effort to read and understand what the libraries provide. Part of it is reviews like this, where I can explain how the library works (I wrote most of it I think ...). Every time I give these reviews, they are times to learn and then you don't make the mistake again.
Part of it is also that soon, I'll start to embed warnings through the libraries so we'll have an automated list of bad behaviours we can check.

389-ds-bot · 2020-09-13T05:07:15Z

Comment from aadhikari at 2019-02-15 05:50:07

@Firstyear many time our whole focus is on quick porting as this was already in the ticket (https://pagure.iohttps://fedorapeople.org/groups/389ds/github_attachments/a51cc3d9137740fe66aa5bb98ef78e38f8c7937d8d197cfb1cc86a1c2b23a1a5-0001-Ticket-49463-testcase.patch)

so we quickly wanna have this, and I see your point we will do the changes but instead of you taking efforts while reviewing, we can save both efforts and time if the information is passed before anything and I understand there are so many cases but at least some of the legacy one can be mentioned so we can have quality code for the first review itself :) BTW the how to write test using lib389 is little old and would require updating as I can statement like: standalone.modify_s(USER_DN, [(ldap.MOD_REPLACE, 'jpegPhoto', ensure_bytes(var_with_content)]) which I think we wanna avoid. If there is any new document other than this: https://fedorapeople.org/~droideck/html/guidelines.html then please do share It will be really helpful.

389-ds-bot · 2020-09-13T05:07:16Z

Comment from tbordaz (@tbordaz) at 2019-02-15 08:57:29

Just for curiosity, what is the recommended way to retrieve an entry knowing its DN ?

389-ds-bot · 2020-09-13T05:07:17Z

Comment from tbordaz (@tbordaz) at 2019-02-15 09:05:04

I agree that sleep is a bad idea as it slows does a full suite.
I do not recall exactly why it was needed. My guess is that previous updates creates keep_alive entries, and the sleep gives time to replicate them everywhere. So that later during cleanallruv, each replica will do a DEL of the removed keep_alive (when the bug is not fixed).

So the right approach would be to do a search of the keep_alive entries, have we got a lib389 helper for that ?

389-ds-bot · 2020-09-13T05:07:18Z

Comment from tbordaz (@tbordaz) at 2019-02-15 09:08:20

Here to be honest I do not recall or guess why it is needed. The RA threads should be stopped when the DEL return.
You may try remove it

389-ds-bot · 2020-09-13T05:07:19Z

Comment from aadhikari at 2019-02-15 12:02:51

I agree that sleep is a bad idea as it slows does a full suite.
I do not recall exactly why it was needed. My guess is that previous updates creates keep_alive entries, and the sleep gives time to replicate them everywhere. So that later during cleanallruv, each replica will do a DEL of the removed keep_alive (when the bug is not fixed).
So the right approach would be to do a search of the keep_alive entries, have we got a lib389 helper for that ?

Sure I will remove the sleep there.

389-ds-bot · 2020-09-13T05:07:19Z

Comment from firstyear (@Firstyear) at 2019-02-18 00:37:20

so we quickly wanna have this, and I see your point we will do the changes but

Quick always leads to rushed results and mistakes. It is better to take your time and you will have a better experience. There are long term consequences to "quick" and "rush", one of which is reviews take longer.

If there is any new document other than this: https://fedorapeople.org/~droideck/html/guidelines.html then please do share It will be really helpful.

If you want to see "good" tests, look at src/lib389/lib389/tests/idm/
These all exclusively use lib389.

389-ds-bot · 2020-09-13T05:07:20Z

Comment from aadhikari at 2019-02-19 13:22:03

rebased onto 09be3e19c93e484204a95f4e3ec63eb0c60135d3

389-ds-bot · 2020-09-13T05:07:21Z

Comment from aadhikari at 2019-02-19 14:47:17

rebased onto 7f0b51298a5c38575e6196cadd6548407e66e641

389-ds-bot · 2020-09-13T05:07:21Z

Comment from aadhikari at 2019-02-19 14:58:36

@Firstyear, we have made some changes suggested by you. Can you help here, what would be the best way to avoid these sleeps, @tbordaz had asked and explained the purpose of it. Thanks!

389-ds-bot · 2020-09-13T05:07:22Z

Comment from firstyear (@Firstyear) at 2019-02-20 00:52:12

@aadhikari In the replication libraries of lib389, there is a "wait for replication" function you can call that waits for a replication to event, and then continues. Use that please.

389-ds-bot · 2020-09-13T05:07:23Z

Comment from firstyear (@Firstyear) at 2019-02-20 01:00:54

More specific:

repman = ReplicationManager(DEFAULT_SUFFIX)
repman.wait_for_replication(from_instance, to_instance)

IE

repman.wait_for_replication(topology.m1, topology.m2)

389-ds-bot · 2020-09-13T05:07:24Z

Comment from firstyear (@Firstyear) at 2019-02-20 01:01:19

The entire REplicationManager class may just help you in general to understand replication configuration, it's really good.

389-ds-bot · 2020-09-13T05:07:25Z

Comment from aadhikari at 2019-02-20 19:13:52

rebased onto ab6559c2cd1b9b5d3f714183c49e848776828f91

389-ds-bot · 2020-09-13T05:07:25Z

Comment from aadhikari at 2019-02-20 19:19:59

@Firstyear changes have been made, please have a look

389-ds-bot · 2020-09-13T05:07:38Z

Comment from aadhikari at 2019-02-22 08:50:05

This is an infinite loop risk if the pattern isn't found. Put the errorlog buffered to false, then you don't need a loop at all.

@Firstyear there is no buffered to false for the error log, just for access log. Also with looping, it takes an iteration of 318 to actually have the results in the log. Restarting the instance either in the loop or outside didn't help. So have used a small sleep time.

389-ds-bot · 2020-09-13T05:07:39Z

Comment from aadhikari at 2019-02-22 08:50:24

Remove boilerplate.

Do you mean the comments?

389-ds-bot · 2020-09-13T05:07:40Z

Comment from aadhikari at 2019-02-22 08:58:08

Can you comment with an example string of what this is attempting to match in the assert positive and negative case?

@Firstyear a sample entry: 4:5c6fa6fe000000040000:no:1, so here 5c6fa6fe000000040000 will be changing, but rest will be same, so we are checking do we a have an entry which starts with 4: and ends with "any:no:1"(excluding any). replica.get_attr_val_utf8('nsds5replicacleanruv') do have the actual value. We are just comparing it with the regex. @tbordaz can you please give more details here, I might be missing out something particularly.

389-ds-bot · 2020-09-13T05:07:40Z

Comment from firstyear (@Firstyear) at 2019-02-25 02:53:47

@aadhikari I meant to put that into the code so we can see what you are trying to do and why...

389-ds-bot · 2020-09-13T05:07:42Z

Comment from firstyear (@Firstyear) at 2019-02-25 02:54:26

Remove boilerplate.

Do you mean the comments?

Yes

Also, I think the errorlog isn't buffered now I check, only the access log is, so I'm still not sure this sleep is needed.

389-ds-bot · 2020-09-13T05:07:42Z

Comment from aadhikari at 2019-02-25 05:55:27

@aadhikari I meant to put that into the code so we can see what you are trying to do and why...
@Firstyear we are already printing the value in the code.

389-ds-bot · 2020-09-13T05:07:43Z

Comment from aadhikari at 2019-02-25 05:58:04

This is an infinite loop risk if the pattern isn't found. Put the errorlog buffered to false, then you don't need a loop at all.

@Firstyear there is no buffered to false for the error log, just for access log. Also with looping, it takes an iteration of 318 to actually have the results in the log. Restarting the instance either in the loop or outside didn't help. So have used a small sleep time.

@Firstyear yeah there is no error log buffer in the source, I talked to @droideck for this issue even he can't find it in the source. Above are some observation please have a look. I didn't find any solution. personally not a fan of sleep :)

389-ds-bot · 2020-09-13T05:07:44Z

Comment from firstyear (@Firstyear) at 2019-02-25 06:07:14

Reading log.c in the main server core, errorlog is not buffered. However that doesn't mean that the content is flushed to disk after an operation. The right solution could be to fsync on the errorlog after a write, but that imposes a performance penalty. But sleep also doesn't fix the issue either. Perhaps we need to ammend the lib389 errorlog parsing function to fsync before read?

389-ds-bot · 2020-09-13T05:07:45Z

Comment from aadhikari at 2019-02-25 06:16:17

Reading log.c in the main server core, errorlog is not buffered. However that doesn't mean that the content is flushed to disk after an operation. The right solution could be to fsync on the errorlog after a write, but that imposes a performance penalty. But sleep also doesn't fix the issue either. Perhaps we need to ammend the lib389 errorlog parsing function to fsync before read?

@Firstyear sleep give us little time to read the content, also it's only failing when we search for .Original task deletes Keep alive entry .4. If you run the code you will these logs are generated after a great pause.

389-ds-bot · 2020-09-13T05:07:45Z

Comment from firstyear (@Firstyear) at 2019-02-25 06:20:39

Let me think about this for a bit. I'm wondering if the issue is some async replication behaviour and if there is another way to wait on it.

389-ds-bot · 2020-09-13T05:07:46Z

Comment from aadhikari at 2019-02-25 08:51:15

rebased onto 86359a5f69e4874b74bdcfd433fd2581ed3175b1

389-ds-bot · 2020-09-13T05:07:47Z

Comment from firstyear (@Firstyear) at 2019-02-26 03:16:38

May be better to put the sleep after the .match()? So it checks "as soon" as possible.
An option to speed this up could be to have a back off too. Right now this is 50 seconds max, so you could do something like

time.sleep( (i/2.0)**2 )

That should give you a sleep timer of

>>> for i in range(0, 10):
...     ((i/2.0) ** 2)
...
0.0
0.25
1.0
2.25
4.0
6.25
9.0
12.25
16.0
20.25

The idea being you do many quicker checks to start, then "back off" if it's taking longer.

389-ds-bot · 2020-09-13T05:07:48Z

Comment from firstyear (@Firstyear) at 2019-02-26 03:22:03

Hmmm, now I think about it, you should not use this cleanAllRuv, but you should use 💯

from lib389.tasks import CleanAllRUVTask
task = CleanAllRUVTask(M1).create(properties={'replica-force-cleaning': 'yes'}
task.wait()

Look at your pytest warnings: You should have a bunch of them from the cleanruv task as it uses deprecated lib389 apis. The above example should work without warnings.

389-ds-bot · 2020-09-13T05:07:48Z

Comment from firstyear (@Firstyear) at 2019-02-26 03:23:01

Why not set a timeout on the task and with pytest.raises()

389-ds-bot · 2020-09-13T05:07:49Z

Comment from aadhikari at 2019-02-26 07:43:25

rebased onto 8db9c5cf6f2f926a39a2d7a87f8726253fa9d46d

389-ds-bot · 2020-09-13T05:07:50Z

Comment from aadhikari at 2019-02-26 07:45:33

@Firstyear Changes have been made, but now the whole test is taking more time to complete.

389-ds-bot · 2020-09-13T05:07:51Z

Comment from firstyear (@Firstyear) at 2019-02-26 07:52:36

Can you debug and print what timing event was the one taken for the error log?

389-ds-bot · 2020-09-13T05:07:52Z

Comment from aadhikari at 2019-02-27 07:49:27

Can you debug and print what timing event was the one taken for the error log?
@Firstyear what exactly you are stating here? The overall timing of the execution has increased. Earlier with the loop, it uses to take 1 iteration or max 2 but now I can see 6 iterations for the error log to be generated. But I still think this must not be the real reason to increase the overall time to this much.

389-ds-bot · 2020-09-13T05:07:53Z

Comment from tbordaz (@tbordaz) at 2019-02-27 09:58:00

My understanding is that replica.present will do an active lookup (up to 20 SRCH on the replica) without pause.
It looks very intensive, could not we pause between each search (may be 0.1s)

389-ds-bot · 2020-09-13T05:07:54Z

Comment from aadhikari at 2019-02-28 07:49:45

My understanding is that replica.present will do an active lookup (up to 20 SRCH on the replica) without pause.
It looks very intensive, could not we pause between each search (may be 0.1s)

@tbordaz pause as in a time.sleep?

389-ds-bot · 2020-09-13T05:07:55Z

Comment from tbordaz (@tbordaz) at 2019-02-28 09:55:35

@aadhikari , yes. It is very convenient to put some delays during some specific steps of testcase. So far having to wait 1s instead of few msec was not a big deal, but as many testcases do the same at the end it ends to be slow.
I do not know how we can sleep for ms, but in this case it would be useful instead of active srch

389-ds-bot · 2020-09-13T05:07:56Z

Comment from firstyear (@Firstyear) at 2019-03-01 02:54:11

You can put in floats in python for the sleep, ie 0.5.

Generally though, I think it would be great if we could find a way to "block until the task is ready for check" to avoid sleeps.

389-ds-bot · 2020-09-13T05:07:57Z

Comment from aadhikari at 2019-03-01 05:27:09

You can put in floats in python for the sleep, ie 0.5.
Generally though, I think it would be great if we could find a way to "block until the task is ready for check" to avoid sleeps.
@Firstyear that's why i used a loop, to avoid sleep as per our discussion of not using it wherever we can, but as per @tbordaz suggestions I will be using sleep with someone float value.

389-ds-bot · 2020-09-13T05:07:57Z

Comment from aadhikari at 2019-03-18 12:21:34

rebased onto d8936be2785c3e722de13f8b63edc7270e327b0a

389-ds-bot · 2020-09-13T05:07:58Z

Comment from aadhikari at 2019-03-18 12:23:57

@Firstyear @tbordaz Changes have been made accordingly.

389-ds-bot · 2020-09-13T05:07:59Z

Comment from firstyear (@Firstyear) at 2019-03-19 02:52:13

looks good from here, @tbordaz if you are happy, ack and merge thanks :)

389-ds-bot · 2020-09-13T05:08:00Z

Comment from aadhikari at 2019-03-25 10:27:41

rebased onto 28a5ddb

389-ds-bot · 2020-09-13T05:08:01Z

Comment from tbordaz (@tbordaz) at 2019-03-25 10:31:26

@aadhikari great job. ACK

389-ds-bot · 2020-09-13T05:08:02Z

Comment from tbordaz (@tbordaz) at 2019-03-25 10:35:54

Pull-Request has been merged by tbordaz

389-ds-bot · 2020-09-13T05:08:04Z

Patch
50222.patch

389-ds-bot added merged Migration flag - PR pr Migration flag - PR labels Sep 13, 2020

389-ds-bot closed this as completed Sep 13, 2020

PR - Ticket 49463 After cleanALLruv, replication is looping on keep alive DEL #3281

PR - Ticket 49463 After cleanALLruv, replication is looping on keep alive DEL #3281

Comments

389-ds-bot commented Sep 13, 2020 • edited

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 • edited

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 •

edited

389-ds-bot commented Sep 13, 2020 •

edited