Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mon: Go into ERR state if multiple PGs are stuck inactive #7253

Merged
merged 1 commit into from Jan 27, 2016

Conversation

wido
Copy link
Member

@wido wido commented Jan 15, 2016

If >=X PGs are stuck inactive longer than 'mon_pg_stuck_threshold'
we go into ERR state.

This is useful for situations where one or more PGs stay stuck in
peering or undersized state due to a OSD failure.

RBD volumes can become fully unresponsive if one or more PGs are inactive.

Fixes: #13923

@@ -1 +1 @@
Subproject commit 47fbf8c6ae1fb4fca171ac86e98821a67fd32031
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, be careful teh submodule isn't changed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. They always sneak in silently.

@liewegas
Copy link
Member

Hmm, is there any case where we woudln't want to go to ERR if there are PGs stuck inactive? Or where only 1 inactive PG is okay, but 2 isn't?

I'd lean toward just adding the err case unconditionally...

@wido
Copy link
Member Author

wido commented Jan 16, 2016

@liewegas I don't know. But I didn't want to change the current way Ceph behaves. I'm fine with going to ERR state if one or more PGs are inactive. But other usecases which are plain RADOS storage might not?

Maybe clusters with >1000 OSDs don't want this?

We could always make it configurable and set the default to 1.

@wido wido force-pushed the mon-err-stuck-pg branch 2 times, most recently from 9fa3419 to c5dd81d Compare January 16, 2016 12:02
@liewegas
Copy link
Member

liewegas commented Jan 17, 2016 via email

@theanalyst
Copy link
Member

Should the option be mon_pg_min_inactive_num or just mon_pg_min_inactive since this specifies a minimum number?

@wido
Copy link
Member Author

wido commented Jan 19, 2016

Ok, @liewegas I've set the default to 1.

And good feedback @theanalyst, I changed the config option's name.

@theanalyst
Copy link
Member

@wido Thanks! also the second commit is missing a signed-off-by

If >=X PGs are stuck inactive longer than 'mon_pg_stuck_threshold'
we go into ERR state.

This is useful for situations where one or more PGs stay stuck in
peering or undersized state due to a OSD failure.

RBD volumes can become fully unresponsive if one or more PGs are inactive.

Fixes: ceph#13923
Signed-off-by: Wido den Hollander <wido@42on.com>
@wido
Copy link
Member Author

wido commented Jan 19, 2016

@theanalyst I rebased and force pushed. It's now one commit and also signed off

@theanalyst
Copy link
Member

Yep, Looks good

liewegas added a commit that referenced this pull request Jan 27, 2016
mon: go into ERR state if multiple PGs are stuck inactive

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Abhishek Lekshmanan <abhishek@suse.com>
@liewegas liewegas merged commit e19dd6a into ceph:master Jan 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants