Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle EC recovery read errors #9304

Merged
merged 9 commits into from Oct 28, 2016
Merged

Handle EC recovery read errors #9304

merged 9 commits into from Oct 28, 2016

Conversation

dzafman
Copy link
Contributor

@dzafman dzafman commented May 24, 2016

I end up with active+degraded pg which doesn't repair if I remove an object on all nodes in a way where it gets recognized as missing. With this fix the OSD doesn't crash. Need to figure out how to get recovery to finish. I assume I can have an error which will have to be discovered later by scrubbing.

@tchaikov
Copy link
Contributor

@dzafman is this supposed to be a (yet completed) fix for http://tracker.ceph.com/issues/13937?

dzafman added a commit to dzafman/ceph that referenced this pull request May 25, 2016
Fixes: ceph#9304

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman
Copy link
Contributor Author

dzafman commented May 25, 2016

@tchaikov I updated the proper commit to show that this pull request does indeed try to address 13937.

@dzafman
Copy link
Contributor Author

dzafman commented May 25, 2016

@athanatos Any suggestions on how to move forward is appreciated.

dzafman added a commit to dzafman/ceph that referenced this pull request May 26, 2016
Fixes: ceph#9304

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman
Copy link
Contributor Author

dzafman commented May 26, 2016

@athanatos I updated to make failed_push() common. I did the following:

created scenario
pg state active+degraded
./rados -p ecpool rm benchmark_data_TrustyTahr_30314_object286 (lost object)
^C
ceph pg #.# scrub (waiting for clean)
DIDN'T START
rados -p ecpool cleanup (hung or was very slow)
^C
ceph pg #.# mark_unfound_lost delete
LOG: pg has 1 objects unfound and apparently lost marking
PG state active+clean+scrubbing

There are 9 removes "waiting for scrub" some removes completed even on the pg going through this
The remove of the lost object itself is "waiting for missing object" but it should have been cleaned up by "mark_unfound_lost delete"

@athanatos
Copy link
Contributor

@dzafman I'm not sure what you mean by the last part. Do you mean that the requests stayed hung?

@dzafman
Copy link
Contributor Author

dzafman commented May 26, 2016

@athanatos Yes, they are coming out of the dump_ops_in_flight

@athanatos
Copy link
Contributor

@dzafman Sounds like a bug in master to me, probably worth tracking down.

@tchaikov tchaikov self-assigned this Jun 3, 2016
dzafman added a commit to dzafman/ceph that referenced this pull request Jun 3, 2016
Fixes: ceph#9304

Signed-off-by: David Zafman <dzafman@redhat.com>
dzafman added a commit that referenced this pull request Jun 9, 2016
Fixes: #9304

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman dzafman force-pushed the wip-13937 branch 2 times, most recently from 3ae9b08 to 48e034b Compare June 9, 2016 20:05
@@ -500,6 +500,8 @@ void ECBackend::continue_recovery_op(
assert(!op.recovery_progress.first);
dout(10) << __func__ << ": canceling recovery op for obj " << op.hoid
<< dendl;
// XXX: Doing this on the read error handling just loops in recovery
// How would this case ever actually happen?
Copy link
Contributor

@athanatos athanatos Jul 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't just use cancel_pull() in the read error handling because you know something the parent doesn't: that some of the sources are faulty. Here, I think we are failing because a source went down (not in the up or acting set, or it would have triggered peering to restart) and we no longer have enough copies. I think it's ok to just cancel the recovery op here because the parent would already have updated its missing_loc etc when it got the log update. Either the object later becomes recoverable again and recovery will retry or not in which case it will remain unfound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added this code in d9106ce to fix tracker 8161. I tried to use an equivalent version in OnRecoveryReadComplete::finish() only to discovery that it didn't work. I'll have to look into whether I can do something like my new _failed_push() to handle this case too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's what I was talking about. Like I said, here, it's ok because ReplicatedPG/PG already knows that the source died and won't retry until the object is found again. You need to call something which updates missing_loc (which I think you do earlier in this series with _failed_push).

Is this code causing a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it now. I didn't see this code fail in case. I'll remove the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K. It probably wouldn't hurt to add an explanatory comment if you want to. I had to spend 20 minutes trying to figure out why it didn't get handled by check_recovery_sources...

@tchaikov
Copy link
Contributor

reviewed.

@dzafman dzafman force-pushed the wip-13937 branch 2 times, most recently from 9b0191f to 82cdb93 Compare October 12, 2016 23:33
{
assert(recovering.count(soid));
recovering.erase(soid);
missing_loc.remove_location(soid, from);
for (list<pg_shard_t>::const_iterator i = from.begin(); i != from.end() ; ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use range-based loop.

recovery_ops.erase(hoid);

list<pg_shard_t> fl;
for (map<pg_shard_t, int>::iterator i = res.errors.begin();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use range-based loop.

@athanatos
Copy link
Contributor

I think this looks ok to me

This reverts commit 5bc5533.

Conflicts:
	src/test/Makefile.am (no longer exists)
	src/test/erasure-code/Makefile.am (no longer exists)

Signed-off-by: David Zafman <dzafman@redhat.com>
Remove unused struct

Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
Fixes: http://tracker.ceph.com/issues/13937

Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
Caused by: 70e000a

Signed-off-by: David Zafman <dzafman@redhat.com>
Low space broke test, saw "flags nearfull,pauserd,pausewr...."

Signed-off-by: David Zafman <dzafman@redhat.com>
Fix for broken test-erasure-code.sh and test-erasure-eio.sh

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman dzafman assigned ghost Oct 28, 2016
@dzafman dzafman changed the title DNM: Handle EC recovery read errors Handle EC recovery read errors Oct 28, 2016
@ghost
Copy link

ghost commented Oct 28, 2016

Reviewed-by: Loic Dachary <ldachary@redhat.com>

@tchaikov
Copy link
Contributor

lgtm also.

@tchaikov tchaikov merged commit e7291ce into ceph:master Oct 28, 2016
@tchaikov
Copy link
Contributor

@dzafman shall we backport this to jewel?

@dzafman dzafman deleted the wip-13937 branch November 2, 2016 00:25
@smithfarm
Copy link
Contributor

@dzafman Ping re: jewel backport. Users are asking about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants