New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is ldbm_txn_ruv_modify_context still required ? #564
Comments
Comment from tbordaz (@tbordaz) at 2013-02-06 21:51:18 == Here is the current status ==
== Here are the next steps ==
|
Comment from tbordaz (@tbordaz) at 2013-02-07 02:18:17 Review of the fix for ticket564 |
Comment from rmeggins (@richm) at 2013-02-08 02:14:07 The code to detect and handle the RUV in the disorderly shutdown case looks good. I would change this:
Not sure what the current rules are for using C++ style comments in C code, but to maximize portability, please use C style comments As for the modify_context code - if we are getting rid of that, I would like to see a better way to make sure we update the database RUV inside the operation transaction. |
Comment from tbordaz (@tbordaz) at 2013-02-08 21:26:55 Thanks for the review. I will modify the disordely shutdown part accordingly. Since 543633 (1.2.8) database RUV update is done inside the op txn. In order to be sure that update of the RUV stays inside the op txn, we may |
Comment from tbordaz (@tbordaz) at 2013-02-08 21:41:48 Oppss I forgot... an other option, is to create a configuration parameters that enable/disable the update of the RUV entry in the op txn. |
Comment from nhosoi (@nhosoi) at 2013-02-09 06:05:10 I ran quite an intensive test using the servers applied your patch.
|
Comment from lkrispen (@elkris) at 2013-02-11 21:42:13 Noriko, did you also try to repeatedly kill M1, to see if it will correctly recreate its ruv and keep the changelog ? |
Comment from nhosoi (@nhosoi) at 2013-02-11 22:33:40 Replying to [comment:10 elkris]:
I'd say that side is not tested enough. I killed just a couple of since I had to restart ldclt each time. I have to ask Thierry to do more testing on it... ;) |
Comment from tbordaz (@tbordaz) at 2013-02-13 20:50:08 Noriko, I want to thank you so much for the extensive tests you did. I know how difficult it is to test this corner case, and the effort it requires to run/check such tests for a long time. I sincerely appreciate your help. About the test Ludwig suggested. I did them during the unit tests so I have not a real available test case. I instrumented the code to trigger a crash (on demand) during backend update (ldbm_back_modify). The crash occured after txn_open, updates of DB+index but before the txn commit. A test I did not do, M1 under ldclt load, stop M2, keeping M1 under ldctl load. kill M1. best regards |
Comment from tbordaz (@tbordaz) at 2013-02-14 22:05:19 == Here is the current status ==
== Here are the next steps ==
|
Comment from tbordaz (@tbordaz) at 2013-02-27 22:48:22 '''Here is the current status'''
'''Here are the next steps'''
|
Comment from tbordaz (@tbordaz) at 2013-03-01 20:10:20 '''Here is the current status'''
'''Here are the next steps'''
|
Comment from tbordaz (@tbordaz) at 2013-03-26 21:05:43 '''Here are the next steps'''
|
Comment from tbordaz (@tbordaz) at 2013-04-24 18:16:21 '''Here is the current status'''
The called script is /usr/bin/repl-monitor.pl. It retrieves all the replicas on a given instance and for each of them it searches the replication status with the following lookup:
'''Here are the next steps'''
|
Comment from tbordaz (@tbordaz) at 2013-04-24 21:34:19 Here is the current status
Here are the next steps
|
Comment from rmeggins (@richm) at 2013-04-24 23:53:56 Replying to [comment:20 tbordaz]:
That's not good. What if a customer has custom written monitoring apps that use this RUV entry?
|
Comment from tbordaz (@tbordaz) at 2013-04-25 19:28:01
|
Comment from tbordaz (@tbordaz) at 2013-06-05 17:48:43 attachment |
Comment from tbordaz (@tbordaz) at 2013-06-07 22:51:23 git merge ticket564 git push origin master commit 7f5268f |
Comment from tbordaz (@tbordaz) at 2017-02-11 23:09:16 Metadata Update from @tbordaz:
|
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/564
It was introduced in the fix for 543633, to fix this situation:
The incremental protocol will continue once tickled as described above, but the last change made before the crash will be missing and this is not detected by the protocol.
But with the current code, the changelog is written inside the same transaction as the changes to the database, so there cannot be an inconsistent state afetr recovery, either both are
recovered or both are undone.
It should also be verified if part 1) of the original bug description is still true:
In general we need to find a way to ensure consistency of the dababase ruv with applied changes and changelog after (disorderly) shutdown, if it should be regularily updated from the replica ruv by the housekeeping thread or if it is passible without performance loss inside the modify operation.
It should also be considered if a solution relies on serialization by the backend log or is valid if changes can be applied in parallel.
The text was updated successfully, but these errors were encountered: