Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: various changes for preventing internal ENOSPC condition #13425

Merged
merged 6 commits into from Mar 6, 2017

Conversation

dzafman
Copy link
Contributor

@dzafman dzafman commented Feb 15, 2017

Don't log "not implemented" since this should never be seen when properly configured
Increase mon_osd_min_in_ratio for most clusters to 75%
Add warning when the usage on different OSDs exceeds mon_warn_osd_usage_percent (10%)

@tchaikov
Copy link
Contributor

[ RUN      ] TestMockImageReplayer.DecodeError
/home/jenkins-build/build/workspace/ceph-pull-requests/src/tools/rbd_mirror/ImageReplayer.cc: In function 'rbd::mirror::ImageReplayer<ImageCtxT>::~ImageReplayer() [with ImageCtxT = librbd::{anonymous}::MockTestImageCtx]' thread 7fdd96cf1c40 time 2017-02-15 03:06:26.332065
/home/jenkins-build/build/workspace/ceph-pull-requests/src/tools/rbd_mirror/ImageReplayer.cc: 316: FAILED assert(m_replay_status_formatter == nullptr)
 ceph version 12.0.0-347-g1953a57 (1953a576ce6c5280ba29c6d5d8774ef9bc6321e2)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7fdd8d1d865e]
 2: (()+0x3c3ac9) [0x7fdd96547ac9]
 3: (()+0x3c3ad9) [0x7fdd96547ad9]
 4: (rbd::mirror::TestMockImageReplayer::TearDown()+0x16) [0x7fdd965530f6]
 5: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x33) [0x7fdd966d99f3]
 6: (testing::Test::Run()+0x80) [0x7fdd966cd620]
 7: (testing::TestInfo::Run()+0x9e) [0x7fdd966cd6fe]
 8: (testing::TestCase::Run()+0xa5) [0x7fdd966cd805]
 9: (testing::internal::UnitTestImpl::RunAllTests()+0x248) [0x7fdd966cdab8]
 10: (testing::UnitTest::Run()+0x54) [0x7fdd966cdd74]
 11: (main()+0xee) [0x7fdd964f83ce]
 12: (__libc_start_main()+0xf5) [0x7fdd8bf5ff45]
 13: (()+0x38349a) [0x7fdd9650749a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

https://jenkins.ceph.com/job/ceph-pull-requests/18471/consoleFull#-243017103d63714d2-c8d8-41fc-a9d4-8dee30be4c32 , @dillaman is this caused by environmental issue, and is hence a false alarm?

retest this please.

@dillaman
Copy link

@tchaikov Yes -- no way that error could be related

OPTION(mon_osd_min_in_ratio, OPT_DOUBLE, .3) // min osds required to be in to mark things out
OPTION(mon_osd_max_small_cluster, OPT_INT, 10) // Maximum number of osds that constitutes a small cluster for min in ratio
OPTION(mon_osd_small_min_in_ratio, OPT_DOUBLE, .30) // min osds required to be in to mark things out in small cluster
OPTION(mon_osd_min_in_ratio, OPT_DOUBLE, .75) // min osds required to be in to mark things out
Copy link
Member

@ktdreyer ktdreyer Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new mon osd min in ratio default will need to be updated in doc/rados/configuration/mon-osd-interaction.rst. And it would be great to document all these new options as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktdreyer Fixed doc for default of ceph_osd_min_in_ratio but we decided not to add any other configs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we would add configs but not document them. Why not document and add a huge warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktdreyer The code doesn't look like that anymore. I commented on it because I did have adjust the default for the config that did change. Look at e8fd3e2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzafman d52105a adds mon_warn_osd_usage_percent without a doc change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktdreyer Oops. Yes thanks.

@dzafman dzafman force-pushed the wip-15912 branch 2 times, most recently from 38d853d to 586f4a3 Compare February 17, 2017 23:13
@dzafman dzafman changed the title Various changes for preventing internal ENOSPC condition [DNM] Various changes for preventing internal ENOSPC condition Feb 17, 2017
@dzafman
Copy link
Contributor Author

dzafman commented Feb 17, 2017

@liewegas I added in a fix for Bluestore and changed its error message.

@dzafman
Copy link
Contributor Author

dzafman commented Feb 18, 2017

@ktdreyer There isn't a document already to add the information about mon_warn_osd_usage_percent so I filed http://tracker.ceph.com/issues/18986

@dzafman dzafman force-pushed the wip-15912 branch 2 times, most recently from 32e8a3c to 178bdd3 Compare February 20, 2017 17:58
detail->push_back(make_pair(HEALTH_WARN, ss.str()));
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PGMonitor is going ot be turned off in luminous, so we probably want to add this to ceph-mgr (unless it's needed for jewel backport... i'm thinking not?)

@@ -626,6 +626,9 @@ OPTION(osd_backfill_full_ratio, OPT_FLOAT, 0.85)
// Seconds to wait before retrying refused backfills
OPTION(osd_backfill_retry_interval, OPT_DOUBLE, 10.0)

// Seconds to wait before retrying refused backfills
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/backfills/recovery/

src/osd/OSD.h Outdated
// -- Recovery Request Scheduling --
Mutex recovery_request_lock;
SafeTimer recovery_request_timer;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to reusing the existing timer and lock (maybe rename it recovery_* instead of backfill_*)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, no real reason not to.

assert(0);
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like enough. We've already gotten the reservation.. what we really need is a way to abort recovery or backfill. Until we have that figured out, I don't think this patch makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you'll need to wire up those events to release the reservations etc.

@liewegas
Copy link
Member

I made some changes to the way these fullness thresholds are handled in #13615. I don't think anything conflicts, but it might make sense to have a discussion about how teh recovery/backfill aborts should work in the context of that change. Basically, I'm wondering if OSDs should either (1) set flags for "too full for backfill" or similar in the OSDMap (siilar to nearfull), or (2) the recovery and backfill reservation requests should include an estimate of the size of the pg. (The latter sounds nicer, except that we don't account for omap space utilization, so it's not a complete solution.)

@dzafman
Copy link
Contributor Author

dzafman commented Feb 24, 2017

@liewegas For item (1), on the one hand it would be nice if nearfull (defaults to 85%) would be the same as "too full for backfill" that also defaults to 85% so we wouldn't need another indicator, on the other hand it might be better to be able to configure backfill separately, so then it would be nice to know when we are too full to start a backfill.

Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
@liewegas
Copy link
Member

liewegas commented Feb 24, 2017 via email

@liewegas
Copy link
Member

liewegas commented Feb 24, 2017 via email

@@ -472,6 +472,8 @@ class FileJournal :

void set_wait_on_full(bool b) { wait_on_full = b; }

off64_t get_journal_size_estimate(void);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for (void).

@dzafman dzafman changed the title [DNM] Various changes for preventing internal ENOSPC condition Various changes for preventing internal ENOSPC condition Feb 24, 2017
@dzafman
Copy link
Contributor Author

dzafman commented Feb 24, 2017

@liewegas I pulled the more complex commit for recovery/backfill. I don't have additional tests for this, but I think including it in a wip-testing branch rados run is sufficient.

@@ -318,6 +318,7 @@ OPTION(mon_crush_min_required_version, OPT_STR, "firefly")
OPTION(mon_warn_on_crush_straw_calc_version_zero, OPT_BOOL, true) // warn if crush straw_calc_version==0
OPTION(mon_warn_on_osd_down_out_interval_zero, OPT_BOOL, true) // warn if 'mon_osd_down_out_interval == 0'
OPTION(mon_warn_on_cache_pools_without_hit_sets, OPT_BOOL, true)
OPTION(mon_warn_osd_usage_percent, OPT_FLOAT, .10) // warn if difference in usage percent between OSDs exceeds specified percent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this to the docs and both of these settings changes to PendingReleaseNotes

.10 seems like a pretty low default for this - 45%-55% is a pretty flat distribution, especially for small clusters. How about something like .4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdurgin Thanks, I was hoping we could come up with a better value for this. .4 sounds reasonable.

if (g_conf->mon_warn_osd_usage_percent) {
float max_osd_perc_avail = 0.0, min_osd_perc_avail = 1.0;
for (auto p = pg_map.osd_stat.begin(); p != pg_map.osd_stat.end(); ++p) {
float perc_avail = ((float)(p->second.kb - p->second.kb_avail)) / ((float)p->second.kb);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should guard against bad stats causing divide by 0

@dzafman dzafman changed the title Various changes for preventing internal ENOSPC condition DNM: Various changes for preventing internal ENOSPC condition Feb 27, 2017
…cent (10%)

Signed-off-by: David Zafman <dzafman@redhat.com>
This really only works after journal drains because
we adjust for the journal.

Signed-off-by: David Zafman <dzafman@redhat.com>
…fs min reserved

This fixes OSD crashes because checking osd_failsafe_full_ratio won't work
without accurate statfs information.

Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman dzafman changed the title DNM: Various changes for preventing internal ENOSPC condition Various changes for preventing internal ENOSPC condition Feb 27, 2017
@liewegas liewegas changed the title Various changes for preventing internal ENOSPC condition osd: various changes for preventing internal ENOSPC condition Feb 28, 2017
@dzafman
Copy link
Contributor Author

dzafman commented Mar 4, 2017

Testing passed, 2 failures in tracker, 2 dead infrastructure

http://pulpito.ceph.com/dzafman-2017-03-03_09:17:05-rados-wip-15912-distro-basic-smithi

@tchaikov tchaikov self-requested a review March 6, 2017 15:50
@@ -318,6 +318,7 @@ OPTION(mon_crush_min_required_version, OPT_STR, "firefly")
OPTION(mon_warn_on_crush_straw_calc_version_zero, OPT_BOOL, true) // warn if crush straw_calc_version==0
OPTION(mon_warn_on_osd_down_out_interval_zero, OPT_BOOL, true) // warn if 'mon_osd_down_out_interval == 0'
OPTION(mon_warn_on_cache_pools_without_hit_sets, OPT_BOOL, true)
OPTION(mon_warn_osd_usage_percent, OPT_FLOAT, .40) // warn if difference in usage percent between OSDs exceeds specified percent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzafman , could you talk more about this change? I didn't catch why the issue could be resolved by increase this option. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mon_warn_osd_usage_percent is a new config parameter that warns when OSDs aren't being evenly used. This can be caused by many different things like not enough PGs, bad weighting or even storing non-OSD data on an OSD disk. Being aware of this situation could allow someone to take action before the most full OSD runs out of space and crashes while the cluster seems to have lots of space available over all.

mon_osd_min_in_ratio was the config parameter that was increased. Say that if a network failure causes the cluster to partition and a significant part of the cluster goes down. If we allow the automatic mechanism to mark too many of the OSDs out an attempt to recover with the remaining nodes could result in running out of space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear, thanks!

@@ -718,6 +718,14 @@ int FileStore::statfs(struct store_statfs_t *buf0)
}
buf0->total = buf.f_blocks * buf.f_bsize;
buf0->available = buf.f_bavail * buf.f_bsize;
// Adjust for writes pending in the journal
if (journal) {
uint64_t estimate = journal->get_journal_size_estimate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzafman, as I thought, this change is used to prevent ENOSPC condition, am I right? Actually, I think this change could resolve the issue partially: statfs will be called in update_osd_stat, which will be called in each heartbeat(5s). But when the thoughput from clients is very high, this buf0 maybe out of data. Do you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liupan1111 It isn't so much that buf0 is out of date it is that the journal may contain data which is pending to be written to the store. It is not up to date with respect to the writes that have been received and not yet flushed to the store. The journal uses a fixed amount of space so it is already accounted for. The function name get_journal_size_estimate() is misleading because it isn't the actual size of the journal file or block device, but rather the amount of unflushed data pending to be written. This data will be consuming space when it flushes so we should reflect that in determining how much more client I/O we should accept.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzafman One question: is there possible repetitive computation for this statfs? The case is:
some data is written into journal disk, and also written to the page cache of data disk(not flushed to physical data disk yet). So these data will be included both in estimate and the used part of buf0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liupan1111 No, the filesystem (e.g. xfs, ext4) which has accepted a write must have allocated the space and kept track of this in memory so statfs is always accurate. The page cache might delay the raw data getting onto the disk, but doesn't affect the filesystem's space accounting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liupan1111 there are two cases:

  1. journal and data are co-located in the same filesystem (fs for short), in which case, FileStore::statfs() should return the space still available for the new data. so
fs.available = fs.all - data.used - journal.max_size // the space used by journal is pre-allocated
data.all = fs.all - journal.max_size
buf0.available = fs.available - journal.used
  1. journal and data are living in different filesystems, in which case,
fs.available = fs.all - data.used
data.all = fs.all
buf0.available = fs.available - journal.used

so i think the change is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is a small time window, before the journal's apply callback is called, and after the journal is committed, where the journal is counted by (data.total - data.available) and (journal.used). is it what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov Yup! Sorry for my poor English, that is really what I want to say.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, osd will report call statfs every 5 second(each heartbeat), so there is another reason for this inaccurate especially when high throughput.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't have an estimation how large the time window is, but i think it's very small. if you believe this would put us in trouble, could you show us more evidence or statistics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just study this PR and found this possible inaccurate, just curious about it. Yes, I agree this window shouldn't be large. I haven't got any evidence about it, but I will keep this in mind, and test it when I meet similar case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
9 participants