Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR - Ticket 49463 After cleanALLruv, replication is looping on keep alive DEL #3281

Closed
389-ds-bot opened this issue Sep 13, 2020 · 71 comments
Closed
Labels
merged Migration flag - PR pr Migration flag - PR

Comments

@389-ds-bot
Copy link

389-ds-bot commented Sep 13, 2020

Cloned from Pagure Pull-Request: https://pagure.io/389-ds-base/pull-request/50222

  • Created at 2019-02-14 16:08:04 by aadhikari
  • Merged at 2019-03-25 10:35:54

Bug Description: When cleanAllRuv is launched, it spawn cleanAllRuv on all replicas.
Each replica will clean its changelog and database RUV but in addition
will DEL the keep alive entry of the target ReplicaID.

Fix Description: Test case cover all the scenario to be tested for the fix.

Resolves: #2522

Review by: ??

@389-ds-bot 389-ds-bot added merged Migration flag - PR pr Migration flag - PR labels Sep 13, 2020
@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2019-02-14 16:40:32

The test looks good. Should it be integrated into replication test suite or kept into a regression ticket test case ?

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-15 01:10:31

I don't think you need any of this logging boilerplate, it,s in topologies now.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-15 01:11:00

There is already a userAccount create helper in the code you can use.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-15 01:11:12

Please don't use getEntry, it's going to be removed.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-15 01:11:42

There is an errorlog parsing and manipulation api already in lib389 that allows applying a seanch to the log to find errors.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-15 01:12:08

Don't use raw modify_s. We also already have replica objects you can use to setup replication and fractional attributes, so you should use those.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-15 01:12:53

Don't use raw modify, you should be updating the various objects. Don't use ensure_bytes outside of core lib389 (and even then it should be rare). Basiccaly, if you ever need "ensure bytes", you probably are doing something in a "non-correct" way,

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-15 01:13:05

There is an access log searching api in the lib389 code already.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-15 01:13:50

Why is this sleep present? If you are waiting for replication, there is a replication wait task. You should never need a sleep in tests

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-15 01:14:30

Remove un-used boilerplate from copy-paste.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-15 01:16:09

I want to point out generally, that if you are adding "helpers" to make this test case easier, it's likely they will be needed to make other tests easier too. This mean creating user, log checks etc. It also is likely that these things already exist in lib389, because you probably aren't the first person to require this kind of interaction. There is lots of functionality in lib389 to make writing tests quick, simple, and compact. It's probably worth your time to have a quick read of the library to find many things that will help you. :)

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-15 05:25:47

I want to point out generally, that if you are adding "helpers" to make this test case easier, it's likely they will be needed to make other tests easier too. This mean creating user, log checks etc. It also is likely that these things already exist in lib389, because you probably aren't the first person to require this kind of interaction. There is lots of functionality in lib389 to make writing tests quick, simple, and compact. It's probably worth your time to have a quick read of the library to find many things that will help you. :)

@Firstyear It will be really nice if you a draft a mail which consists of all the things which will be either removed or should not be done. Like, DO's and DONT. It will really help anyone who is writing test cases. It's confusing when we are using modify_s like everywhere in the tests so it will be really nice to have an example or convection which we all can follow. Also for eg: getEntry will be removed, so what should be its replacement. We probably can fix many tests which are already in the upstream. Thanks!

@389-ds-bot
Copy link
Author

389-ds-bot commented Sep 13, 2020

Comment from firstyear (@Firstyear) at 2019-02-15 05:31:33

I'm literally adding code level warnings for this, so they can be found easier during devel and testing. #3285 Soon I'm going to start addressing these warnings in existing tests (but it will take ... a long time ....).
As well, @droideck has a document on how to write tests I think with lib389.

The tl;dr is basically, that everything should be passed through a DSLdapObject(s) type, so you don't "get" and entry, you would do like:

thing = DSLdapObjects().get(name)

And similar. There are plenty of reasons for this. It makes more compact tests, it allows encapsulating logic into the code (IE what attributes can I set? there are get/set for that!). It handles python3 for you, it handles lots of sticky cases with renames and modifies, it has a stateful version (ensure this thing exists rather than needing to check etc).

Often lots of these things already exist, but it's a matter of familiarity. Part of it is you making an effort to read and understand what the libraries provide. Part of it is reviews like this, where I can explain how the library works (I wrote most of it I think ...). Every time I give these reviews, they are times to learn and then you don't make the mistake again.
Part of it is also that soon, I'll start to embed warnings through the libraries so we'll have an automated list of bad behaviours we can check.

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-15 05:50:07

@Firstyear many time our whole focus is on quick porting as this was already in the ticket (https://pagure.iohttps://fedorapeople.org/groups/389ds/github_attachments/a51cc3d9137740fe66aa5bb98ef78e38f8c7937d8d197cfb1cc86a1c2b23a1a5-0001-Ticket-49463-testcase.patch)

so we quickly wanna have this, and I see your point we will do the changes but instead of you taking efforts while reviewing, we can save both efforts and time if the information is passed before anything and I understand there are so many cases but at least some of the legacy one can be mentioned so we can have quality code for the first review itself :) BTW the how to write test using lib389 is little old and would require updating as I can statement like: standalone.modify_s(USER_DN, [(ldap.MOD_REPLACE, 'jpegPhoto', ensure_bytes(var_with_content)]) which I think we wanna avoid. If there is any new document other than this: https://fedorapeople.org/~droideck/html/guidelines.html then please do share It will be really helpful.

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2019-02-15 08:57:29

Just for curiosity, what is the recommended way to retrieve an entry knowing its DN ?

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2019-02-15 09:05:04

I agree that sleep is a bad idea as it slows does a full suite.
I do not recall exactly why it was needed. My guess is that previous updates creates keep_alive entries, and the sleep gives time to replicate them everywhere. So that later during cleanallruv, each replica will do a DEL of the removed keep_alive (when the bug is not fixed).

So the right approach would be to do a search of the keep_alive entries, have we got a lib389 helper for that ?

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2019-02-15 09:08:20

Here to be honest I do not recall or guess why it is needed. The RA threads should be stopped when the DEL return.
You may try remove it

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-15 12:02:51

I agree that sleep is a bad idea as it slows does a full suite.
I do not recall exactly why it was needed. My guess is that previous updates creates keep_alive entries, and the sleep gives time to replicate them everywhere. So that later during cleanallruv, each replica will do a DEL of the removed keep_alive (when the bug is not fixed).
So the right approach would be to do a search of the keep_alive entries, have we got a lib389 helper for that ?

Sure I will remove the sleep there.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-18 00:37:20

so we quickly wanna have this, and I see your point we will do the changes but

Quick always leads to rushed results and mistakes. It is better to take your time and you will have a better experience. There are long term consequences to "quick" and "rush", one of which is reviews take longer.

If there is any new document other than this: https://fedorapeople.org/~droideck/html/guidelines.html then please do share It will be really helpful.

If you want to see "good" tests, look at src/lib389/lib389/tests/idm/
These all exclusively use lib389.

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-19 13:22:03

rebased onto 09be3e19c93e484204a95f4e3ec63eb0c60135d3

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-19 14:47:17

rebased onto 7f0b51298a5c38575e6196cadd6548407e66e641

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-19 14:58:36

@Firstyear, we have made some changes suggested by you. Can you help here, what would be the best way to avoid these sleeps, @tbordaz had asked and explained the purpose of it. Thanks!

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-20 00:52:12

@aadhikari In the replication libraries of lib389, there is a "wait for replication" function you can call that waits for a replication to event, and then continues. Use that please.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-20 01:00:54

More specific:

repman = ReplicationManager(DEFAULT_SUFFIX)
repman.wait_for_replication(from_instance, to_instance)

IE

repman.wait_for_replication(topology.m1, topology.m2)

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-20 01:01:19

The entire REplicationManager class may just help you in general to understand replication configuration, it's really good.

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-20 19:13:52

rebased onto ab6559c2cd1b9b5d3f714183c49e848776828f91

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-20 19:19:59

@Firstyear changes have been made, please have a look

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-22 08:50:05

This is an infinite loop risk if the pattern isn't found. Put the errorlog buffered to false, then you don't need a loop at all.

@Firstyear there is no buffered to false for the error log, just for access log. Also with looping, it takes an iteration of 318 to actually have the results in the log. Restarting the instance either in the loop or outside didn't help. So have used a small sleep time.

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-22 08:50:24

Remove boilerplate.

Do you mean the comments?

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-22 08:58:08

Can you comment with an example string of what this is attempting to match in the assert positive and negative case?

@Firstyear a sample entry: 4:5c6fa6fe000000040000:no:1, so here 5c6fa6fe000000040000 will be changing, but rest will be same, so we are checking do we a have an entry which starts with 4: and ends with "any:no:1"(excluding any). replica.get_attr_val_utf8('nsds5replicacleanruv') do have the actual value. We are just comparing it with the regex. @tbordaz can you please give more details here, I might be missing out something particularly.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-25 02:53:47

@aadhikari I meant to put that into the code so we can see what you are trying to do and why...

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-25 02:54:26

Remove boilerplate.

Do you mean the comments?

Yes

Also, I think the errorlog isn't buffered now I check, only the access log is, so I'm still not sure this sleep is needed.

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-25 05:55:27

@aadhikari I meant to put that into the code so we can see what you are trying to do and why...
@Firstyear we are already printing the value in the code.

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-25 05:58:04

This is an infinite loop risk if the pattern isn't found. Put the errorlog buffered to false, then you don't need a loop at all.

@Firstyear there is no buffered to false for the error log, just for access log. Also with looping, it takes an iteration of 318 to actually have the results in the log. Restarting the instance either in the loop or outside didn't help. So have used a small sleep time.

@Firstyear yeah there is no error log buffer in the source, I talked to @droideck for this issue even he can't find it in the source. Above are some observation please have a look. I didn't find any solution. personally not a fan of sleep :)

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-25 06:07:14

Reading log.c in the main server core, errorlog is not buffered. However that doesn't mean that the content is flushed to disk after an operation. The right solution could be to fsync on the errorlog after a write, but that imposes a performance penalty. But sleep also doesn't fix the issue either. Perhaps we need to ammend the lib389 errorlog parsing function to fsync before read?

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-25 06:16:17

Reading log.c in the main server core, errorlog is not buffered. However that doesn't mean that the content is flushed to disk after an operation. The right solution could be to fsync on the errorlog after a write, but that imposes a performance penalty. But sleep also doesn't fix the issue either. Perhaps we need to ammend the lib389 errorlog parsing function to fsync before read?

@Firstyear sleep give us little time to read the content, also it's only failing when we search for .Original task deletes Keep alive entry .4. If you run the code you will these logs are generated after a great pause.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-25 06:20:39

Let me think about this for a bit. I'm wondering if the issue is some async replication behaviour and if there is another way to wait on it.

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-25 08:51:15

rebased onto 86359a5f69e4874b74bdcfd433fd2581ed3175b1

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-26 03:16:38

May be better to put the sleep after the .match()? So it checks "as soon" as possible.
An option to speed this up could be to have a back off too. Right now this is 50 seconds max, so you could do something like

time.sleep( (i/2.0)**2 )

That should give you a sleep timer of

>>> for i in range(0, 10):
...     ((i/2.0) ** 2)
...
0.0
0.25
1.0
2.25
4.0
6.25
9.0
12.25
16.0
20.25

The idea being you do many quicker checks to start, then "back off" if it's taking longer.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-26 03:22:03

Hmmm, now I think about it, you should not use this cleanAllRuv, but you should use 💯

from lib389.tasks import CleanAllRUVTask
task = CleanAllRUVTask(M1).create(properties={'replica-force-cleaning': 'yes'}
task.wait()

Look at your pytest warnings: You should have a bunch of them from the cleanruv task as it uses deprecated lib389 apis. The above example should work without warnings.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-26 03:23:01

Why not set a timeout on the task and with pytest.raises()

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-26 07:43:25

rebased onto 8db9c5cf6f2f926a39a2d7a87f8726253fa9d46d

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-26 07:45:33

@Firstyear Changes have been made, but now the whole test is taking more time to complete.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-02-26 07:52:36

Can you debug and print what timing event was the one taken for the error log?

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-27 07:49:27

Can you debug and print what timing event was the one taken for the error log?
@Firstyear what exactly you are stating here? The overall timing of the execution has increased. Earlier with the loop, it uses to take 1 iteration or max 2 but now I can see 6 iterations for the error log to be generated. But I still think this must not be the real reason to increase the overall time to this much.

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2019-02-27 09:58:00

My understanding is that replica.present will do an active lookup (up to 20 SRCH on the replica) without pause.
It looks very intensive, could not we pause between each search (may be 0.1s)

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-02-28 07:49:45

My understanding is that replica.present will do an active lookup (up to 20 SRCH on the replica) without pause.
It looks very intensive, could not we pause between each search (may be 0.1s)

@tbordaz pause as in a time.sleep?

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2019-02-28 09:55:35

@aadhikari , yes. It is very convenient to put some delays during some specific steps of testcase. So far having to wait 1s instead of few msec was not a big deal, but as many testcases do the same at the end it ends to be slow.
I do not know how we can sleep for ms, but in this case it would be useful instead of active srch

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-03-01 02:54:11

You can put in floats in python for the sleep, ie 0.5.

Generally though, I think it would be great if we could find a way to "block until the task is ready for check" to avoid sleeps.

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-03-01 05:27:09

You can put in floats in python for the sleep, ie 0.5.
Generally though, I think it would be great if we could find a way to "block until the task is ready for check" to avoid sleeps.
@Firstyear that's why i used a loop, to avoid sleep as per our discussion of not using it wherever we can, but as per @tbordaz suggestions I will be using sleep with someone float value.

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-03-18 12:21:34

rebased onto d8936be2785c3e722de13f8b63edc7270e327b0a

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-03-18 12:23:57

@Firstyear @tbordaz Changes have been made accordingly.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-03-19 02:52:13

looks good from here, @tbordaz if you are happy, ack and merge thanks :)

@389-ds-bot
Copy link
Author

Comment from aadhikari at 2019-03-25 10:27:41

rebased onto 28a5ddb

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2019-03-25 10:31:26

@aadhikari great job. ACK

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2019-03-25 10:35:54

Pull-Request has been merged by tbordaz

@389-ds-bot
Copy link
Author

Patch
50222.patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged Migration flag - PR pr Migration flag - PR
Projects
None yet
Development

No branches or pull requests

1 participant