New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR - Ticket 49463 After cleanALLruv, replication is looping on keep alive DEL #3281
Comments
Comment from tbordaz (@tbordaz) at 2019-02-14 16:40:32 The test looks good. Should it be integrated into replication test suite or kept into a regression ticket test case ? |
Comment from firstyear (@Firstyear) at 2019-02-15 01:10:31 I don't think you need any of this logging boilerplate, it,s in topologies now. |
Comment from firstyear (@Firstyear) at 2019-02-15 01:11:00 There is already a userAccount create helper in the code you can use. |
Comment from firstyear (@Firstyear) at 2019-02-15 01:11:12 Please don't use getEntry, it's going to be removed. |
Comment from firstyear (@Firstyear) at 2019-02-15 01:11:42 There is an errorlog parsing and manipulation api already in lib389 that allows applying a seanch to the log to find errors. |
Comment from firstyear (@Firstyear) at 2019-02-15 01:12:08 Don't use raw modify_s. We also already have replica objects you can use to setup replication and fractional attributes, so you should use those. |
Comment from firstyear (@Firstyear) at 2019-02-15 01:12:53 Don't use raw modify, you should be updating the various objects. Don't use ensure_bytes outside of core lib389 (and even then it should be rare). Basiccaly, if you ever need "ensure bytes", you probably are doing something in a "non-correct" way, |
Comment from firstyear (@Firstyear) at 2019-02-15 01:13:05 There is an access log searching api in the lib389 code already. |
Comment from firstyear (@Firstyear) at 2019-02-15 01:13:50 Why is this sleep present? If you are waiting for replication, there is a replication wait task. You should never need a sleep in tests |
Comment from firstyear (@Firstyear) at 2019-02-15 01:14:30 Remove un-used boilerplate from copy-paste. |
Comment from firstyear (@Firstyear) at 2019-02-15 01:16:09 I want to point out generally, that if you are adding "helpers" to make this test case easier, it's likely they will be needed to make other tests easier too. This mean creating user, log checks etc. It also is likely that these things already exist in lib389, because you probably aren't the first person to require this kind of interaction. There is lots of functionality in lib389 to make writing tests quick, simple, and compact. It's probably worth your time to have a quick read of the library to find many things that will help you. :) |
Comment from aadhikari at 2019-02-15 05:25:47
@Firstyear It will be really nice if you a draft a mail which consists of all the things which will be either removed or should not be done. Like, DO's and DONT. It will really help anyone who is writing test cases. It's confusing when we are using modify_s like everywhere in the tests so it will be really nice to have an example or convection which we all can follow. Also for eg: getEntry will be removed, so what should be its replacement. We probably can fix many tests which are already in the upstream. Thanks! |
Comment from firstyear (@Firstyear) at 2019-02-15 05:31:33 I'm literally adding code level warnings for this, so they can be found easier during devel and testing. #3285 Soon I'm going to start addressing these warnings in existing tests (but it will take ... a long time ....). The tl;dr is basically, that everything should be passed through a DSLdapObject(s) type, so you don't "get" and entry, you would do like:
And similar. There are plenty of reasons for this. It makes more compact tests, it allows encapsulating logic into the code (IE what attributes can I set? there are get/set for that!). It handles python3 for you, it handles lots of sticky cases with renames and modifies, it has a stateful version (ensure this thing exists rather than needing to check etc). Often lots of these things already exist, but it's a matter of familiarity. Part of it is you making an effort to read and understand what the libraries provide. Part of it is reviews like this, where I can explain how the library works (I wrote most of it I think ...). Every time I give these reviews, they are times to learn and then you don't make the mistake again. |
Comment from aadhikari at 2019-02-15 05:50:07 @Firstyear many time our whole focus is on quick porting as this was already in the ticket (https://pagure.iohttps://fedorapeople.org/groups/389ds/github_attachments/a51cc3d9137740fe66aa5bb98ef78e38f8c7937d8d197cfb1cc86a1c2b23a1a5-0001-Ticket-49463-testcase.patch) so we quickly wanna have this, and I see your point we will do the changes but instead of you taking efforts while reviewing, we can save both efforts and time if the information is passed before anything and I understand there are so many cases but at least some of the legacy one can be mentioned so we can have quality code for the first review itself :) BTW the how to write test using lib389 is little old and would require updating as I can statement like: standalone.modify_s(USER_DN, [(ldap.MOD_REPLACE, 'jpegPhoto', ensure_bytes(var_with_content)]) which I think we wanna avoid. If there is any new document other than this: https://fedorapeople.org/~droideck/html/guidelines.html then please do share It will be really helpful. |
Comment from tbordaz (@tbordaz) at 2019-02-15 08:57:29 Just for curiosity, what is the recommended way to retrieve an entry knowing its DN ? |
Comment from tbordaz (@tbordaz) at 2019-02-15 09:05:04 I agree that sleep is a bad idea as it slows does a full suite. So the right approach would be to do a search of the keep_alive entries, have we got a lib389 helper for that ? |
Comment from tbordaz (@tbordaz) at 2019-02-15 09:08:20 Here to be honest I do not recall or guess why it is needed. The RA threads should be stopped when the DEL return. |
Comment from aadhikari at 2019-02-15 12:02:51
Sure I will remove the sleep there. |
Comment from firstyear (@Firstyear) at 2019-02-18 00:37:20
Quick always leads to rushed results and mistakes. It is better to take your time and you will have a better experience. There are long term consequences to "quick" and "rush", one of which is reviews take longer.
If you want to see "good" tests, look at src/lib389/lib389/tests/idm/ |
Comment from aadhikari at 2019-02-19 13:22:03 rebased onto 09be3e19c93e484204a95f4e3ec63eb0c60135d3 |
Comment from aadhikari at 2019-02-19 14:47:17 rebased onto 7f0b51298a5c38575e6196cadd6548407e66e641 |
Comment from aadhikari at 2019-02-19 14:58:36 @Firstyear, we have made some changes suggested by you. Can you help here, what would be the best way to avoid these sleeps, @tbordaz had asked and explained the purpose of it. Thanks! |
Comment from firstyear (@Firstyear) at 2019-02-20 00:52:12 @aadhikari In the replication libraries of lib389, there is a "wait for replication" function you can call that waits for a replication to event, and then continues. Use that please. |
Comment from firstyear (@Firstyear) at 2019-02-20 01:00:54 More specific:
IE
|
Comment from firstyear (@Firstyear) at 2019-02-20 01:01:19 The entire REplicationManager class may just help you in general to understand replication configuration, it's really good. |
Comment from aadhikari at 2019-02-20 19:13:52 rebased onto ab6559c2cd1b9b5d3f714183c49e848776828f91 |
Comment from aadhikari at 2019-02-20 19:19:59 @Firstyear changes have been made, please have a look |
Comment from aadhikari at 2019-02-22 08:50:05
@Firstyear there is no buffered to false for the error log, just for access log. Also with looping, it takes an iteration of 318 to actually have the results in the log. Restarting the instance either in the loop or outside didn't help. So have used a small sleep time. |
Comment from aadhikari at 2019-02-22 08:50:24
Do you mean the comments? |
Comment from aadhikari at 2019-02-22 08:58:08
@Firstyear a sample entry: 4:5c6fa6fe000000040000:no:1, so here 5c6fa6fe000000040000 will be changing, but rest will be same, so we are checking do we a have an entry which starts with 4: and ends with "any:no:1"(excluding any). replica.get_attr_val_utf8('nsds5replicacleanruv') do have the actual value. We are just comparing it with the regex. @tbordaz can you please give more details here, I might be missing out something particularly. |
Comment from firstyear (@Firstyear) at 2019-02-25 02:53:47 @aadhikari I meant to put that into the code so we can see what you are trying to do and why... |
Comment from firstyear (@Firstyear) at 2019-02-25 02:54:26
Yes Also, I think the errorlog isn't buffered now I check, only the access log is, so I'm still not sure this sleep is needed. |
Comment from aadhikari at 2019-02-25 05:55:27
|
Comment from aadhikari at 2019-02-25 05:58:04
@Firstyear yeah there is no error log buffer in the source, I talked to @droideck for this issue even he can't find it in the source. Above are some observation please have a look. I didn't find any solution. personally not a fan of sleep :) |
Comment from firstyear (@Firstyear) at 2019-02-25 06:07:14 Reading log.c in the main server core, errorlog is not buffered. However that doesn't mean that the content is flushed to disk after an operation. The right solution could be to fsync on the errorlog after a write, but that imposes a performance penalty. But sleep also doesn't fix the issue either. Perhaps we need to ammend the lib389 errorlog parsing function to fsync before read? |
Comment from aadhikari at 2019-02-25 06:16:17
@Firstyear sleep give us little time to read the content, also it's only failing when we search for .Original task deletes Keep alive entry .4. If you run the code you will these logs are generated after a great pause. |
Comment from firstyear (@Firstyear) at 2019-02-25 06:20:39 Let me think about this for a bit. I'm wondering if the issue is some async replication behaviour and if there is another way to wait on it. |
Comment from aadhikari at 2019-02-25 08:51:15 rebased onto 86359a5f69e4874b74bdcfd433fd2581ed3175b1 |
Comment from firstyear (@Firstyear) at 2019-02-26 03:16:38 May be better to put the sleep after the .match()? So it checks "as soon" as possible.
That should give you a sleep timer of
The idea being you do many quicker checks to start, then "back off" if it's taking longer. |
Comment from firstyear (@Firstyear) at 2019-02-26 03:22:03 Hmmm, now I think about it, you should not use this cleanAllRuv, but you should use 💯
Look at your pytest warnings: You should have a bunch of them from the cleanruv task as it uses deprecated lib389 apis. The above example should work without warnings. |
Comment from firstyear (@Firstyear) at 2019-02-26 03:23:01 Why not set a timeout on the task and with pytest.raises() |
Comment from aadhikari at 2019-02-26 07:43:25 rebased onto 8db9c5cf6f2f926a39a2d7a87f8726253fa9d46d |
Comment from aadhikari at 2019-02-26 07:45:33 @Firstyear Changes have been made, but now the whole test is taking more time to complete. |
Comment from firstyear (@Firstyear) at 2019-02-26 07:52:36 Can you debug and print what timing event was the one taken for the error log? |
Comment from aadhikari at 2019-02-27 07:49:27
|
Comment from tbordaz (@tbordaz) at 2019-02-27 09:58:00 My understanding is that replica.present will do an active lookup (up to 20 SRCH on the replica) without pause. |
Comment from aadhikari at 2019-02-28 07:49:45
@tbordaz pause as in a time.sleep? |
Comment from tbordaz (@tbordaz) at 2019-02-28 09:55:35 @aadhikari , yes. It is very convenient to put some delays during some specific steps of testcase. So far having to wait 1s instead of few msec was not a big deal, but as many testcases do the same at the end it ends to be slow. |
Comment from firstyear (@Firstyear) at 2019-03-01 02:54:11 You can put in floats in python for the sleep, ie 0.5. Generally though, I think it would be great if we could find a way to "block until the task is ready for check" to avoid sleeps. |
Comment from aadhikari at 2019-03-01 05:27:09
|
Comment from aadhikari at 2019-03-18 12:21:34 rebased onto d8936be2785c3e722de13f8b63edc7270e327b0a |
Comment from aadhikari at 2019-03-18 12:23:57 @Firstyear @tbordaz Changes have been made accordingly. |
Comment from firstyear (@Firstyear) at 2019-03-19 02:52:13 looks good from here, @tbordaz if you are happy, ack and merge thanks :) |
Comment from aadhikari at 2019-03-25 10:27:41 rebased onto 28a5ddb |
Comment from tbordaz (@tbordaz) at 2019-03-25 10:31:26 @aadhikari great job. ACK |
Comment from tbordaz (@tbordaz) at 2019-03-25 10:35:54 Pull-Request has been merged by tbordaz |
Patch |
Cloned from Pagure Pull-Request: https://pagure.io/389-ds-base/pull-request/50222
Bug Description: When cleanAllRuv is launched, it spawn cleanAllRuv on all replicas.
Each replica will clean its changelog and database RUV but in addition
will DEL the keep alive entry of the target ReplicaID.
Fix Description: Test case cover all the scenario to be tested for the fix.
Resolves: #2522
Review by: ??
The text was updated successfully, but these errors were encountered: