New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: various changes for preventing internal ENOSPC condition #13425
Conversation
https://jenkins.ceph.com/job/ceph-pull-requests/18471/consoleFull#-243017103d63714d2-c8d8-41fc-a9d4-8dee30be4c32 , @dillaman is this caused by environmental issue, and is hence a false alarm? retest this please. |
@tchaikov Yes -- no way that error could be related |
src/common/config_opts.h
Outdated
OPTION(mon_osd_min_in_ratio, OPT_DOUBLE, .3) // min osds required to be in to mark things out | ||
OPTION(mon_osd_max_small_cluster, OPT_INT, 10) // Maximum number of osds that constitutes a small cluster for min in ratio | ||
OPTION(mon_osd_small_min_in_ratio, OPT_DOUBLE, .30) // min osds required to be in to mark things out in small cluster | ||
OPTION(mon_osd_min_in_ratio, OPT_DOUBLE, .75) // min osds required to be in to mark things out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new mon osd min in ratio
default will need to be updated in doc/rados/configuration/mon-osd-interaction.rst. And it would be great to document all these new options as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ktdreyer Fixed doc for default of ceph_osd_min_in_ratio but we decided not to add any other configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we would add configs but not document them. Why not document and add a huge warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ktdreyer Oops. Yes thanks.
38d853d
to
586f4a3
Compare
@liewegas I added in a fix for Bluestore and changed its error message. |
@ktdreyer There isn't a document already to add the information about mon_warn_osd_usage_percent so I filed http://tracker.ceph.com/issues/18986 |
32e8a3c
to
178bdd3
Compare
detail->push_back(make_pair(HEALTH_WARN, ss.str())); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PGMonitor is going ot be turned off in luminous, so we probably want to add this to ceph-mgr (unless it's needed for jewel backport... i'm thinking not?)
src/common/config_opts.h
Outdated
@@ -626,6 +626,9 @@ OPTION(osd_backfill_full_ratio, OPT_FLOAT, 0.85) | |||
// Seconds to wait before retrying refused backfills | |||
OPTION(osd_backfill_retry_interval, OPT_DOUBLE, 10.0) | |||
|
|||
// Seconds to wait before retrying refused backfills |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/backfills/recovery/
src/osd/OSD.h
Outdated
// -- Recovery Request Scheduling -- | ||
Mutex recovery_request_lock; | ||
SafeTimer recovery_request_timer; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to reusing the existing timer and lock (maybe rename it recovery_* instead of backfill_*)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, no real reason not to.
src/osd/PrimaryLogPG.cc
Outdated
assert(0); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like enough. We've already gotten the reservation.. what we really need is a way to abort recovery or backfill. Until we have that figured out, I don't think this patch makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you'll need to wire up those events to release the reservations etc.
I made some changes to the way these fullness thresholds are handled in #13615. I don't think anything conflicts, but it might make sense to have a discussion about how teh recovery/backfill aborts should work in the context of that change. Basically, I'm wondering if OSDs should either (1) set flags for "too full for backfill" or similar in the OSDMap (siilar to nearfull), or (2) the recovery and backfill reservation requests should include an estimate of the size of the pg. (The latter sounds nicer, except that we don't account for omap space utilization, so it's not a complete solution.) |
@liewegas For item (1), on the one hand it would be nice if nearfull (defaults to 85%) would be the same as "too full for backfill" that also defaults to 85% so we wouldn't need another indicator, on the other hand it might be better to be able to configure backfill separately, so then it would be nice to know when we are too full to start a backfill. |
Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
Yeah, that was my initial thought. The only issue is that 'nearfull' is
most just a threshold to notify the user "slow down, you're almost full!"
with HEALTH_WARN. Mostly. I think CephFS might also switch to
synchronous writes at this point, but I'm not sure that's actually a good
idea (it's a bit drastic given the spread between nearfull and full).
|
I like all the other patches, BTW; I'd be happy to see those merge while
we figure out a strategy for the hard parts.
|
src/os/filestore/FileJournal.h
Outdated
@@ -472,6 +472,8 @@ class FileJournal : | |||
|
|||
void set_wait_on_full(bool b) { wait_on_full = b; } | |||
|
|||
off64_t get_journal_size_estimate(void); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for (void).
@liewegas I pulled the more complex commit for recovery/backfill. I don't have additional tests for this, but I think including it in a wip-testing branch rados run is sufficient. |
src/common/config_opts.h
Outdated
@@ -318,6 +318,7 @@ OPTION(mon_crush_min_required_version, OPT_STR, "firefly") | |||
OPTION(mon_warn_on_crush_straw_calc_version_zero, OPT_BOOL, true) // warn if crush straw_calc_version==0 | |||
OPTION(mon_warn_on_osd_down_out_interval_zero, OPT_BOOL, true) // warn if 'mon_osd_down_out_interval == 0' | |||
OPTION(mon_warn_on_cache_pools_without_hit_sets, OPT_BOOL, true) | |||
OPTION(mon_warn_osd_usage_percent, OPT_FLOAT, .10) // warn if difference in usage percent between OSDs exceeds specified percent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this to the docs and both of these settings changes to PendingReleaseNotes
.10 seems like a pretty low default for this - 45%-55% is a pretty flat distribution, especially for small clusters. How about something like .4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdurgin Thanks, I was hoping we could come up with a better value for this. .4 sounds reasonable.
src/mon/PGMonitor.cc
Outdated
if (g_conf->mon_warn_osd_usage_percent) { | ||
float max_osd_perc_avail = 0.0, min_osd_perc_avail = 1.0; | ||
for (auto p = pg_map.osd_stat.begin(); p != pg_map.osd_stat.end(); ++p) { | ||
float perc_avail = ((float)(p->second.kb - p->second.kb_avail)) / ((float)p->second.kb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should guard against bad stats causing divide by 0
…cent (10%) Signed-off-by: David Zafman <dzafman@redhat.com>
Fixes: http://tracker.ceph.com/issues/16878 Signed-off-by: David Zafman <dzafman@redhat.com>
This really only works after journal drains because we adjust for the journal. Signed-off-by: David Zafman <dzafman@redhat.com>
…fs min reserved This fixes OSD crashes because checking osd_failsafe_full_ratio won't work without accurate statfs information. Signed-off-by: David Zafman <dzafman@redhat.com>
Testing passed, 2 failures in tracker, 2 dead infrastructure http://pulpito.ceph.com/dzafman-2017-03-03_09:17:05-rados-wip-15912-distro-basic-smithi |
@@ -318,6 +318,7 @@ OPTION(mon_crush_min_required_version, OPT_STR, "firefly") | |||
OPTION(mon_warn_on_crush_straw_calc_version_zero, OPT_BOOL, true) // warn if crush straw_calc_version==0 | |||
OPTION(mon_warn_on_osd_down_out_interval_zero, OPT_BOOL, true) // warn if 'mon_osd_down_out_interval == 0' | |||
OPTION(mon_warn_on_cache_pools_without_hit_sets, OPT_BOOL, true) | |||
OPTION(mon_warn_osd_usage_percent, OPT_FLOAT, .40) // warn if difference in usage percent between OSDs exceeds specified percent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dzafman , could you talk more about this change? I didn't catch why the issue could be resolved by increase this option. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mon_warn_osd_usage_percent is a new config parameter that warns when OSDs aren't being evenly used. This can be caused by many different things like not enough PGs, bad weighting or even storing non-OSD data on an OSD disk. Being aware of this situation could allow someone to take action before the most full OSD runs out of space and crashes while the cluster seems to have lots of space available over all.
mon_osd_min_in_ratio was the config parameter that was increased. Say that if a network failure causes the cluster to partition and a significant part of the cluster goes down. If we allow the automatic mechanism to mark too many of the OSDs out an attempt to recover with the remaining nodes could result in running out of space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear, thanks!
@@ -718,6 +718,14 @@ int FileStore::statfs(struct store_statfs_t *buf0) | |||
} | |||
buf0->total = buf.f_blocks * buf.f_bsize; | |||
buf0->available = buf.f_bavail * buf.f_bsize; | |||
// Adjust for writes pending in the journal | |||
if (journal) { | |||
uint64_t estimate = journal->get_journal_size_estimate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dzafman, as I thought, this change is used to prevent ENOSPC condition, am I right? Actually, I think this change could resolve the issue partially: statfs will be called in update_osd_stat, which will be called in each heartbeat(5s). But when the thoughput from clients is very high, this buf0 maybe out of data. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liupan1111 It isn't so much that buf0 is out of date it is that the journal may contain data which is pending to be written to the store. It is not up to date with respect to the writes that have been received and not yet flushed to the store. The journal uses a fixed amount of space so it is already accounted for. The function name get_journal_size_estimate() is misleading because it isn't the actual size of the journal file or block device, but rather the amount of unflushed data pending to be written. This data will be consuming space when it flushes so we should reflect that in determining how much more client I/O we should accept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dzafman One question: is there possible repetitive computation for this statfs? The case is:
some data is written into journal disk, and also written to the page cache of data disk(not flushed to physical data disk yet). So these data will be included both in estimate and the used part of buf0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liupan1111 No, the filesystem (e.g. xfs, ext4) which has accepted a write must have allocated the space and kept track of this in memory so statfs is always accurate. The page cache might delay the raw data getting onto the disk, but doesn't affect the filesystem's space accounting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liupan1111 there are two cases:
- journal and data are co-located in the same filesystem (fs for short), in which case,
FileStore::statfs()
should return the space still available for the new data. so
fs.available = fs.all - data.used - journal.max_size // the space used by journal is pre-allocated
data.all = fs.all - journal.max_size
buf0.available = fs.available - journal.used
- journal and data are living in different filesystems, in which case,
fs.available = fs.all - data.used
data.all = fs.all
buf0.available = fs.available - journal.used
so i think the change is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there is a small time window, before the journal's apply callback is called, and after the journal is committed, where the journal is counted by (data.total - data.available) and (journal.used). is it what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tchaikov Yup! Sorry for my poor English, that is really what I want to say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, osd will report call statfs every 5 second(each heartbeat), so there is another reason for this inaccurate especially when high throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't have an estimation how large the time window is, but i think it's very small. if you believe this would put us in trouble, could you show us more evidence or statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just study this PR and found this possible inaccurate, just curious about it. Yes, I agree this window shouldn't be large. I haven't got any evidence about it, but I will keep this in mind, and test it when I meet similar case.
Don't log "not implemented" since this should never be seen when properly configured
Increase mon_osd_min_in_ratio for most clusters to 75%
Add warning when the usage on different OSDs exceeds mon_warn_osd_usage_percent (10%)