osd: auto repair EC pool #6196

guangyy · 2015-10-07T23:13:53Z

Signed-off-by: Guang Yang yguang@yahoo-inc.com

loic-bot · 2015-10-08T04:48:01Z

FAILURE: http://jenkins.ceph.dachary.org/job/ceph/8234/

FAILURE http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/8234/

ghost · 2015-10-08T06:28:12Z

@guangyy the failure from the bot is not very informative. You can repeat it by running ./run-make-check.sh locally. It will then be easier to figure out which test is responsible for the timeout.

guangyy · 2015-10-08T16:07:05Z

@dachary , yep, I will do that (looks like it was caused by the new code change). Will update once I have more information.

dzafman · 2015-10-08T17:03:32Z

src/osd/PG.cc

    state_set(PG_STATE_REPAIR);
-    scrubber.must_repair = false;


I think this line should stay.

The reason I removed it is because later (when finish scrubbing) I would like distinguish the repair request is manually or auto, and if it is the former case, we always go ahead to fix the corruptions (via recovery), for later one, we put a threshold to cancel the repair for manual triage (e.g. if lots of corruptions in a PG, there might be a broken disk which needs replacement).

I didn't find a better way other using this flag this purpose (adding a new flag should work but seems overkill). Do you see any potential problem by using this flag for such purpose?

I'd rather keep must_scrub, must_deepscrub, and must_repair functioning consistently with each other. All you need to do is make sure that auto_repair is set to false when doing a must_repair, so that later if auto_repair is true you know it can only be an auto_repair.

Good point! Let me do that then. Thanks.

dzafman · 2015-10-08T18:14:15Z

To be honest I'm not sure I like adding this complication to the scrubbing stuff. In the next release (Jewel) there are going to be improvements to save the objects with scrub errors and allow an administrator to control each individual repair. This change runs counter to that approach.

dzafman · 2015-10-08T18:15:16Z

Now that I see that this is for EC only, I can see the benefit.

guangyy · 2015-10-13T18:22:15Z

@dzafman , @tchaikov , thanks for the review:)

dzafman · 2015-10-13T18:29:35Z

src/osd/PG.cc

 }

 // the part that actually finalizes a scrub
 void PG::scrub_finish() 
 {
  bool repair = state_test(PG_STATE_REPAIR);
+  // if the repair request comes from auto-repair and large number of errors,
+  // we would like to cancel auto-repair
+  if (repair && !scrubber.must_repair


if (repair && scrubber.auto_repair...

guangyy · 2015-10-14T21:48:32Z

Hi @dzafman , updated according to the review comments, please help to review again. Thanks!

dzafman · 2015-10-14T23:13:01Z

LGTM What about testing?

guangyy · 2015-10-14T23:38:15Z

Thanks @dzafman , I tested locally (with a vstart cluster), I didn't find a good way to automate the test though and put it into 'make check', the main challenges I came across is that I couldn't find a way to inject the settings of scrub, and at the same time make them take effect immediately (with the default setting, the deep scrub would be scheduled to a week later and the injection of settings (scrub_interval, etc) does not impact that).. Any suggestion?

dzafman · 2015-10-14T23:57:38Z

set osd_scrub_min_interval, osd_scrub_max_interval and osd_deep_scrub_interval to the same value. Also set osd_scrub_interval_randomize_ratio to 0 to turn of randomization. Setting the intervals to say 120 (or even less) should cause a deep scrub on a test setup to occur every 2 minutes.

…crubbing Signed-off-by: Guang Yang <yguang@yahoo-inc.com>

Fixes: #12754 Signed-off-by: Guang Yang <yguang@yahoo-inc.com>

Signed-off-by: Guang Yang <yguang@yahoo-inc.com>

guangyy · 2015-10-16T20:36:49Z

Hi @dzafman , the test cases is added for the change, 'make check' is happy locally on my host.

loic-bot · 2015-10-16T23:09:32Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/8434/

dzafman · 2015-10-20T18:08:24Z

@liewegas I actually ran this in wip-zafman-testing but didn't mark it.

osd: auto repair EC pool Reviewed-by: David Zafman <dzafman@redhat.com>

dzafman reviewed Oct 8, 2015
View reviewed changes

dzafman self-assigned this Oct 8, 2015

tchaikov added feature core labels Oct 9, 2015

dzafman reviewed Oct 13, 2015
View reviewed changes

Guang Yang added 4 commits October 16, 2015 20:29

pg: only queue for recovery if there is any objects to repair after s…

1079636

…crubbing Signed-off-by: Guang Yang <yguang@yahoo-inc.com>

pg: add auto-repair for EC pool

8c8e1b7

Fixes: #12754 Signed-off-by: Guang Yang <yguang@yahoo-inc.com>

test: add integration test for the auto repair feature

fde458a

Signed-off-by: Guang Yang <yguang@yahoo-inc.com>

osd: off-by-one when check deep scrubbing

e826213

Signed-off-by: Guang Yang <yguang@yahoo-inc.com>

dzafman added the needs-qa label Oct 19, 2015

liewegas added the wip-sage-testing label Oct 20, 2015

dzafman removed needs-qa wip-sage-testing labels Oct 20, 2015

dzafman added a commit that referenced this pull request Oct 20, 2015

Merge pull request #6196 from guangyy/wip-12754

6e002c6

osd: auto repair EC pool Reviewed-by: David Zafman <dzafman@redhat.com>

dzafman merged commit 6e002c6 into ceph:master Oct 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd: auto repair EC pool #6196

osd: auto repair EC pool #6196

guangyy commented Oct 7, 2015

loic-bot commented Oct 8, 2015

ghost commented Oct 8, 2015

guangyy commented Oct 8, 2015

dzafman Oct 8, 2015

guangyy Oct 13, 2015

dzafman Oct 13, 2015

guangyy Oct 13, 2015

dzafman commented Oct 8, 2015

dzafman commented Oct 8, 2015

guangyy commented Oct 13, 2015

dzafman Oct 13, 2015

guangyy commented Oct 14, 2015

dzafman commented Oct 14, 2015

guangyy commented Oct 14, 2015

dzafman commented Oct 14, 2015

guangyy commented Oct 16, 2015

loic-bot commented Oct 16, 2015

dzafman commented Oct 20, 2015

Navigation Menu

osd: auto repair EC pool #6196

osd: auto repair EC pool #6196

Conversation

guangyy commented Oct 7, 2015

loic-bot commented Oct 8, 2015

ghost commented Oct 8, 2015

guangyy commented Oct 8, 2015

dzafman Oct 8, 2015

Choose a reason for hiding this comment

guangyy Oct 13, 2015

Choose a reason for hiding this comment

dzafman Oct 13, 2015

Choose a reason for hiding this comment

guangyy Oct 13, 2015

Choose a reason for hiding this comment

dzafman commented Oct 8, 2015

dzafman commented Oct 8, 2015

guangyy commented Oct 13, 2015

dzafman Oct 13, 2015

Choose a reason for hiding this comment

guangyy commented Oct 14, 2015

dzafman commented Oct 14, 2015

guangyy commented Oct 14, 2015

dzafman commented Oct 14, 2015

guangyy commented Oct 16, 2015

loic-bot commented Oct 16, 2015

dzafman commented Oct 20, 2015