Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: auto repair EC pool #6196

Merged
merged 4 commits into from Oct 20, 2015
Merged

osd: auto repair EC pool #6196

merged 4 commits into from Oct 20, 2015

Conversation

guangyy
Copy link
Contributor

@guangyy guangyy commented Oct 7, 2015

Fixes: #12754

Signed-off-by: Guang Yang yguang@yahoo-inc.com

@ghost
Copy link

ghost commented Oct 8, 2015

@guangyy the failure from the bot is not very informative. You can repeat it by running ./run-make-check.sh locally. It will then be easier to figure out which test is responsible for the timeout.

@guangyy
Copy link
Contributor Author

guangyy commented Oct 8, 2015

@dachary , yep, I will do that (looks like it was caused by the new code change). Will update once I have more information.

state_set(PG_STATE_REPAIR);
scrubber.must_repair = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line should stay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I removed it is because later (when finish scrubbing) I would like distinguish the repair request is manually or auto, and if it is the former case, we always go ahead to fix the corruptions (via recovery), for later one, we put a threshold to cancel the repair for manual triage (e.g. if lots of corruptions in a PG, there might be a broken disk which needs replacement).

I didn't find a better way other using this flag this purpose (adding a new flag should work but seems overkill). Do you see any potential problem by using this flag for such purpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep must_scrub, must_deepscrub, and must_repair functioning consistently with each other. All you need to do is make sure that auto_repair is set to false when doing a must_repair, so that later if auto_repair is true you know it can only be an auto_repair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let me do that then. Thanks.

@dzafman
Copy link
Contributor

dzafman commented Oct 8, 2015

To be honest I'm not sure I like adding this complication to the scrubbing stuff. In the next release (Jewel) there are going to be improvements to save the objects with scrub errors and allow an administrator to control each individual repair. This change runs counter to that approach.

@dzafman
Copy link
Contributor

dzafman commented Oct 8, 2015

Now that I see that this is for EC only, I can see the benefit.

@dzafman dzafman self-assigned this Oct 8, 2015
@guangyy
Copy link
Contributor Author

guangyy commented Oct 13, 2015

@dzafman , @tchaikov , thanks for the review:)

}

// the part that actually finalizes a scrub
void PG::scrub_finish()
{
bool repair = state_test(PG_STATE_REPAIR);
// if the repair request comes from auto-repair and large number of errors,
// we would like to cancel auto-repair
if (repair && !scrubber.must_repair
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (repair && scrubber.auto_repair...

@guangyy
Copy link
Contributor Author

guangyy commented Oct 14, 2015

Hi @dzafman , updated according to the review comments, please help to review again. Thanks!

@dzafman
Copy link
Contributor

dzafman commented Oct 14, 2015

LGTM What about testing?

@guangyy
Copy link
Contributor Author

guangyy commented Oct 14, 2015

Thanks @dzafman , I tested locally (with a vstart cluster), I didn't find a good way to automate the test though and put it into 'make check', the main challenges I came across is that I couldn't find a way to inject the settings of scrub, and at the same time make them take effect immediately (with the default setting, the deep scrub would be scheduled to a week later and the injection of settings (scrub_interval, etc) does not impact that).. Any suggestion?

@dzafman
Copy link
Contributor

dzafman commented Oct 14, 2015

set osd_scrub_min_interval, osd_scrub_max_interval and osd_deep_scrub_interval to the same value. Also set osd_scrub_interval_randomize_ratio to 0 to turn of randomization. Setting the intervals to say 120 (or even less) should cause a deep scrub on a test setup to occur every 2 minutes.

Guang Yang added 4 commits October 16, 2015 20:29
…crubbing

Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Fixes: #12754
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
@guangyy
Copy link
Contributor Author

guangyy commented Oct 16, 2015

Hi @dzafman , the test cases is added for the change, 'make check' is happy locally on my host.

@dzafman
Copy link
Contributor

dzafman commented Oct 20, 2015

@liewegas I actually ran this in wip-zafman-testing but didn't mark it.

dzafman added a commit that referenced this pull request Oct 20, 2015
osd: auto repair EC pool

Reviewed-by: David Zafman <dzafman@redhat.com>
@dzafman dzafman merged commit 6e002c6 into ceph:master Oct 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants