New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mon: Go into ERR state if multiple PGs are stuck inactive #7253
Conversation
137d960
to
f0da9d3
Compare
@@ -1 +1 @@ | |||
Subproject commit 47fbf8c6ae1fb4fca171ac86e98821a67fd32031 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, be careful teh submodule isn't changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. They always sneak in silently.
Hmm, is there any case where we woudln't want to go to ERR if there are PGs stuck inactive? Or where only 1 inactive PG is okay, but 2 isn't? I'd lean toward just adding the err case unconditionally... |
@liewegas I don't know. But I didn't want to change the current way Ceph behaves. I'm fine with going to ERR state if one or more PGs are inactive. But other usecases which are plain RADOS storage might not? Maybe clusters with >1000 OSDs don't want this? We could always make it configurable and set the default to 1. |
9fa3419
to
c5dd81d
Compare
We could always make it configurable and set the default to 1.
That sounds good to me!
|
Should the option be |
3ab05aa
to
e0127bf
Compare
Ok, @liewegas I've set the default to 1. And good feedback @theanalyst, I changed the config option's name. |
@wido Thanks! also the second commit is missing a signed-off-by |
If >=X PGs are stuck inactive longer than 'mon_pg_stuck_threshold' we go into ERR state. This is useful for situations where one or more PGs stay stuck in peering or undersized state due to a OSD failure. RBD volumes can become fully unresponsive if one or more PGs are inactive. Fixes: ceph#13923 Signed-off-by: Wido den Hollander <wido@42on.com>
e0127bf
to
a9addc6
Compare
@theanalyst I rebased and force pushed. It's now one commit and also signed off |
Yep, Looks good |
mon: go into ERR state if multiple PGs are stuck inactive Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Abhishek Lekshmanan <abhishek@suse.com>
If >=X PGs are stuck inactive longer than 'mon_pg_stuck_threshold'
we go into ERR state.
This is useful for situations where one or more PGs stay stuck in
peering or undersized state due to a OSD failure.
RBD volumes can become fully unresponsive if one or more PGs are inactive.
Fixes: #13923