New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: auto repair EC pool #6196
osd: auto repair EC pool #6196
Conversation
@guangyy the failure from the bot is not very informative. You can repeat it by running ./run-make-check.sh locally. It will then be easier to figure out which test is responsible for the timeout. |
@dachary , yep, I will do that (looks like it was caused by the new code change). Will update once I have more information. |
state_set(PG_STATE_REPAIR); | ||
scrubber.must_repair = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line should stay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I removed it is because later (when finish scrubbing) I would like distinguish the repair request is manually or auto, and if it is the former case, we always go ahead to fix the corruptions (via recovery), for later one, we put a threshold to cancel the repair for manual triage (e.g. if lots of corruptions in a PG, there might be a broken disk which needs replacement).
I didn't find a better way other using this flag this purpose (adding a new flag should work but seems overkill). Do you see any potential problem by using this flag for such purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep must_scrub, must_deepscrub, and must_repair functioning consistently with each other. All you need to do is make sure that auto_repair is set to false when doing a must_repair, so that later if auto_repair is true you know it can only be an auto_repair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Let me do that then. Thanks.
To be honest I'm not sure I like adding this complication to the scrubbing stuff. In the next release (Jewel) there are going to be improvements to save the objects with scrub errors and allow an administrator to control each individual repair. This change runs counter to that approach. |
Now that I see that this is for EC only, I can see the benefit. |
} | ||
|
||
// the part that actually finalizes a scrub | ||
void PG::scrub_finish() | ||
{ | ||
bool repair = state_test(PG_STATE_REPAIR); | ||
// if the repair request comes from auto-repair and large number of errors, | ||
// we would like to cancel auto-repair | ||
if (repair && !scrubber.must_repair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (repair && scrubber.auto_repair...
Hi @dzafman , updated according to the review comments, please help to review again. Thanks! |
LGTM What about testing? |
Thanks @dzafman , I tested locally (with a vstart cluster), I didn't find a good way to automate the test though and put it into 'make check', the main challenges I came across is that I couldn't find a way to inject the settings of scrub, and at the same time make them take effect immediately (with the default setting, the deep scrub would be scheduled to a week later and the injection of settings (scrub_interval, etc) does not impact that).. Any suggestion? |
set osd_scrub_min_interval, osd_scrub_max_interval and osd_deep_scrub_interval to the same value. Also set osd_scrub_interval_randomize_ratio to 0 to turn of randomization. Setting the intervals to say 120 (or even less) should cause a deep scrub on a test setup to occur every 2 minutes. |
…crubbing Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Fixes: #12754 Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Hi @dzafman , the test cases is added for the change, 'make check' is happy locally on my host. |
@liewegas I actually ran this in wip-zafman-testing but didn't mark it. |
osd: auto repair EC pool Reviewed-by: David Zafman <dzafman@redhat.com>
Fixes: #12754
Signed-off-by: Guang Yang yguang@yahoo-inc.com