osd: various changes for preventing internal ENOSPC condition #13425

dzafman · 2017-02-15T00:49:10Z

Don't log "not implemented" since this should never be seen when properly configured
Increase mon_osd_min_in_ratio for most clusters to 75%
Add warning when the usage on different OSDs exceeds mon_warn_osd_usage_percent (10%)

tchaikov · 2017-02-15T03:39:28Z

[ RUN      ] TestMockImageReplayer.DecodeError
/home/jenkins-build/build/workspace/ceph-pull-requests/src/tools/rbd_mirror/ImageReplayer.cc: In function 'rbd::mirror::ImageReplayer<ImageCtxT>::~ImageReplayer() [with ImageCtxT = librbd::{anonymous}::MockTestImageCtx]' thread 7fdd96cf1c40 time 2017-02-15 03:06:26.332065
/home/jenkins-build/build/workspace/ceph-pull-requests/src/tools/rbd_mirror/ImageReplayer.cc: 316: FAILED assert(m_replay_status_formatter == nullptr)
 ceph version 12.0.0-347-g1953a57 (1953a576ce6c5280ba29c6d5d8774ef9bc6321e2)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7fdd8d1d865e]
 2: (()+0x3c3ac9) [0x7fdd96547ac9]
 3: (()+0x3c3ad9) [0x7fdd96547ad9]
 4: (rbd::mirror::TestMockImageReplayer::TearDown()+0x16) [0x7fdd965530f6]
 5: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x33) [0x7fdd966d99f3]
 6: (testing::Test::Run()+0x80) [0x7fdd966cd620]
 7: (testing::TestInfo::Run()+0x9e) [0x7fdd966cd6fe]
 8: (testing::TestCase::Run()+0xa5) [0x7fdd966cd805]
 9: (testing::internal::UnitTestImpl::RunAllTests()+0x248) [0x7fdd966cdab8]
 10: (testing::UnitTest::Run()+0x54) [0x7fdd966cdd74]
 11: (main()+0xee) [0x7fdd964f83ce]
 12: (__libc_start_main()+0xf5) [0x7fdd8bf5ff45]
 13: (()+0x38349a) [0x7fdd9650749a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

https://jenkins.ceph.com/job/ceph-pull-requests/18471/consoleFull#-243017103d63714d2-c8d8-41fc-a9d4-8dee30be4c32 , @dillaman is this caused by environmental issue, and is hence a false alarm?

retest this please.

dillaman · 2017-02-15T03:44:30Z

@tchaikov Yes -- no way that error could be related

ktdreyer · 2017-02-15T15:34:51Z

src/common/config_opts.h

-OPTION(mon_osd_min_in_ratio, OPT_DOUBLE, .3)   // min osds required to be in to mark things out
+OPTION(mon_osd_max_small_cluster, OPT_INT, 10)   // Maximum number of osds that constitutes a small cluster for min in ratio
+OPTION(mon_osd_small_min_in_ratio, OPT_DOUBLE, .30)   // min osds required to be in to mark things out in small cluster
+OPTION(mon_osd_min_in_ratio, OPT_DOUBLE, .75)   // min osds required to be in to mark things out


The new mon osd min in ratio default will need to be updated in doc/rados/configuration/mon-osd-interaction.rst. And it would be great to document all these new options as well.

@ktdreyer Fixed doc for default of ceph_osd_min_in_ratio but we decided not to add any other configs.

I don't understand why we would add configs but not document them. Why not document and add a huge warning?

@ktdreyer The code doesn't look like that anymore. I commented on it because I did have adjust the default for the config that did change. Look at e8fd3e2

@dzafman d52105a adds mon_warn_osd_usage_percent without a doc change

@ktdreyer Oops. Yes thanks.

dzafman · 2017-02-17T23:23:18Z

@liewegas I added in a fix for Bluestore and changed its error message.

dzafman · 2017-02-18T21:52:19Z

@ktdreyer There isn't a document already to add the information about mon_warn_osd_usage_percent so I filed http://tracker.ceph.com/issues/18986

liewegas · 2017-02-24T17:42:38Z

src/mon/PGMonitor.cc

+        detail->push_back(make_pair(HEALTH_WARN, ss.str()));
+    }
+  }
+


PGMonitor is going ot be turned off in luminous, so we probably want to add this to ceph-mgr (unless it's needed for jewel backport... i'm thinking not?)

liewegas · 2017-02-24T17:44:51Z

src/common/config_opts.h

@@ -626,6 +626,9 @@ OPTION(osd_backfill_full_ratio, OPT_FLOAT, 0.85)
 // Seconds to wait before retrying refused backfills
 OPTION(osd_backfill_retry_interval, OPT_DOUBLE, 10.0)

+// Seconds to wait before retrying refused backfills


s/backfills/recovery/

liewegas · 2017-02-24T18:59:11Z

src/osd/OSD.h

+  // -- Recovery Request Scheduling --
+  Mutex recovery_request_lock;
+  SafeTimer recovery_request_timer;
+


We should be able to reusing the existing timer and lock (maybe rename it recovery_* instead of backfill_*)

Agreed, no real reason not to.

liewegas · 2017-02-24T19:03:39Z

src/osd/PrimaryLogPG.cc

+    assert(0);
+  }
+}
+


This doesn't seem like enough. We've already gotten the reservation.. what we really need is a way to abort recovery or backfill. Until we have that figured out, I don't think this patch makes sense.

Yeah, you'll need to wire up those events to release the reservations etc.

liewegas · 2017-02-24T19:06:19Z

I made some changes to the way these fullness thresholds are handled in #13615. I don't think anything conflicts, but it might make sense to have a discussion about how teh recovery/backfill aborts should work in the context of that change. Basically, I'm wondering if OSDs should either (1) set flags for "too full for backfill" or similar in the OSDMap (siilar to nearfull), or (2) the recovery and backfill reservation requests should include an estimate of the size of the pg. (The latter sounds nicer, except that we don't account for omap space utilization, so it's not a complete solution.)

dzafman · 2017-02-24T20:46:55Z

@liewegas For item (1), on the one hand it would be nice if nearfull (defaults to 85%) would be the same as "too full for backfill" that also defaults to 85% so we wouldn't need another indicator, on the other hand it might be better to be able to configure backfill separately, so then it would be nice to know when we are too full to start a backfill.

Signed-off-by: David Zafman <dzafman@redhat.com>

liewegas · 2017-02-24T21:11:26Z

Yeah, that was my initial thought. The only issue is that 'nearfull' is most just a threshold to notify the user "slow down, you're almost full!" with HEALTH_WARN. Mostly. I think CephFS might also switch to synchronous writes at this point, but I'm not sure that's actually a good idea (it's a bit drastic given the spread between nearfull and full).

liewegas · 2017-02-24T21:12:08Z

I like all the other patches, BTW; I'd be happy to see those merge while we figure out a strategy for the hard parts.

athanatos · 2017-02-24T21:12:12Z

src/os/filestore/FileJournal.h

@@ -472,6 +472,8 @@ class FileJournal :

  void set_wait_on_full(bool b) { wait_on_full = b; }

+  off64_t get_journal_size_estimate(void);


No need for (void).

dzafman · 2017-02-24T22:13:07Z

@liewegas I pulled the more complex commit for recovery/backfill. I don't have additional tests for this, but I think including it in a wip-testing branch rados run is sufficient.

jdurgin · 2017-02-24T22:24:23Z

src/common/config_opts.h

@@ -318,6 +318,7 @@ OPTION(mon_crush_min_required_version, OPT_STR, "firefly")
 OPTION(mon_warn_on_crush_straw_calc_version_zero, OPT_BOOL, true) // warn if crush straw_calc_version==0
 OPTION(mon_warn_on_osd_down_out_interval_zero, OPT_BOOL, true) // warn if 'mon_osd_down_out_interval == 0'
 OPTION(mon_warn_on_cache_pools_without_hit_sets, OPT_BOOL, true)
+OPTION(mon_warn_osd_usage_percent, OPT_FLOAT, .10) // warn if difference in usage percent between OSDs exceeds specified percent


add this to the docs and both of these settings changes to PendingReleaseNotes

.10 seems like a pretty low default for this - 45%-55% is a pretty flat distribution, especially for small clusters. How about something like .4?

@jdurgin Thanks, I was hoping we could come up with a better value for this. .4 sounds reasonable.

jdurgin · 2017-02-24T22:24:45Z

src/mon/PGMonitor.cc

+  if (g_conf->mon_warn_osd_usage_percent) {
+    float max_osd_perc_avail = 0.0, min_osd_perc_avail = 1.0;
+    for (auto p = pg_map.osd_stat.begin(); p != pg_map.osd_stat.end(); ++p) {
+      float perc_avail = ((float)(p->second.kb - p->second.kb_avail)) / ((float)p->second.kb);


should guard against bad stats causing divide by 0

…cent (10%) Signed-off-by: David Zafman <dzafman@redhat.com>

Fixes: http://tracker.ceph.com/issues/16878 Signed-off-by: David Zafman <dzafman@redhat.com>

This really only works after journal drains because we adjust for the journal. Signed-off-by: David Zafman <dzafman@redhat.com>

…fs min reserved This fixes OSD crashes because checking osd_failsafe_full_ratio won't work without accurate statfs information. Signed-off-by: David Zafman <dzafman@redhat.com>

dzafman · 2017-03-04T06:37:26Z

Testing passed, 2 failures in tracker, 2 dead infrastructure

http://pulpito.ceph.com/dzafman-2017-03-03_09:17:05-rados-wip-15912-distro-basic-smithi

liupan1111 · 2017-03-27T16:15:27Z

src/common/config_opts.h

@@ -318,6 +318,7 @@ OPTION(mon_crush_min_required_version, OPT_STR, "firefly")
 OPTION(mon_warn_on_crush_straw_calc_version_zero, OPT_BOOL, true) // warn if crush straw_calc_version==0
 OPTION(mon_warn_on_osd_down_out_interval_zero, OPT_BOOL, true) // warn if 'mon_osd_down_out_interval == 0'
 OPTION(mon_warn_on_cache_pools_without_hit_sets, OPT_BOOL, true)
+OPTION(mon_warn_osd_usage_percent, OPT_FLOAT, .40) // warn if difference in usage percent between OSDs exceeds specified percent


@dzafman , could you talk more about this change? I didn't catch why the issue could be resolved by increase this option. Thanks!

mon_warn_osd_usage_percent is a new config parameter that warns when OSDs aren't being evenly used. This can be caused by many different things like not enough PGs, bad weighting or even storing non-OSD data on an OSD disk. Being aware of this situation could allow someone to take action before the most full OSD runs out of space and crashes while the cluster seems to have lots of space available over all.

mon_osd_min_in_ratio was the config parameter that was increased. Say that if a network failure causes the cluster to partition and a significant part of the cluster goes down. If we allow the automatic mechanism to mark too many of the OSDs out an attempt to recover with the remaining nodes could result in running out of space.

Very clear, thanks!

liupan1111 · 2017-03-28T03:14:00Z

src/os/filestore/FileStore.cc

@@ -718,6 +718,14 @@ int FileStore::statfs(struct store_statfs_t *buf0)
  }
  buf0->total = buf.f_blocks * buf.f_bsize;
  buf0->available = buf.f_bavail * buf.f_bsize;
+  // Adjust for writes pending in the journal
+  if (journal) {
+    uint64_t estimate = journal->get_journal_size_estimate();


@dzafman, as I thought, this change is used to prevent ENOSPC condition, am I right? Actually, I think this change could resolve the issue partially: statfs will be called in update_osd_stat, which will be called in each heartbeat(5s). But when the thoughput from clients is very high, this buf0 maybe out of data. Do you agree?

@liupan1111 It isn't so much that buf0 is out of date it is that the journal may contain data which is pending to be written to the store. It is not up to date with respect to the writes that have been received and not yet flushed to the store. The journal uses a fixed amount of space so it is already accounted for. The function name get_journal_size_estimate() is misleading because it isn't the actual size of the journal file or block device, but rather the amount of unflushed data pending to be written. This data will be consuming space when it flushes so we should reflect that in determining how much more client I/O we should accept.

@dzafman One question: is there possible repetitive computation for this statfs? The case is:
some data is written into journal disk, and also written to the page cache of data disk(not flushed to physical data disk yet). So these data will be included both in estimate and the used part of buf0?

@liupan1111 No, the filesystem (e.g. xfs, ext4) which has accepted a write must have allocated the space and kept track of this in memory so statfs is always accurate. The page cache might delay the raw data getting onto the disk, but doesn't affect the filesystem's space accounting.

@liupan1111 there are two cases:

journal and data are co-located in the same filesystem (fs for short), in which case, FileStore::statfs() should return the space still available for the new data. so

fs.available = fs.all - data.used - journal.max_size // the space used by journal is pre-allocated data.all = fs.all - journal.max_size buf0.available = fs.available - journal.used

journal and data are living in different filesystems, in which case,

fs.available = fs.all - data.used data.all = fs.all buf0.available = fs.available - journal.used

so i think the change is correct.

yes, there is a small time window, before the journal's apply callback is called, and after the journal is committed, where the journal is counted by (data.total - data.available) and (journal.used). is it what you mean?

@tchaikov Yup! Sorry for my poor English, that is really what I want to say.

In addition, osd will report call statfs every 5 second(each heartbeat), so there is another reason for this inaccurate especially when high throughput.

i don't have an estimation how large the time window is, but i think it's very small. if you believe this would put us in trouble, could you show us more evidence or statistics?

I just study this PR and found this possible inaccurate, just curious about it. Yes, I agree this window shouldn't be large. I haven't got any evidence about it, but I will keep this in mind, and test it when I meet similar case.

dzafman force-pushed the wip-15912 branch from 5fb1daf to d9dd9d2 Compare February 15, 2017 02:26

tchaikov added the needs-doc label Feb 15, 2017

ktdreyer reviewed Feb 15, 2017

View reviewed changes

dzafman force-pushed the wip-15912 branch 2 times, most recently from 38d853d to 586f4a3 Compare February 17, 2017 23:13

dzafman assigned liewegas Feb 17, 2017

dzafman changed the title ~~Various changes for preventing internal ENOSPC condition~~ [DNM] Various changes for preventing internal ENOSPC condition Feb 17, 2017

dzafman force-pushed the wip-15912 branch from 586f4a3 to e38f381 Compare February 17, 2017 23:25

dzafman added core needs-test and removed needs-doc labels Feb 17, 2017

dzafman force-pushed the wip-15912 branch 2 times, most recently from 32e8a3c to 178bdd3 Compare February 20, 2017 17:58

liewegas reviewed Feb 24, 2017

View reviewed changes

dzafman added 2 commits February 24, 2017 13:02

osd: Fix ENOSPC crash message text

50cfe03

Signed-off-by: David Zafman <dzafman@redhat.com>

mon: Bump min in ratio to 75%

830cc7a

Signed-off-by: David Zafman <dzafman@redhat.com>

athanatos reviewed Feb 24, 2017

View reviewed changes

dzafman force-pushed the wip-15912 branch from d2f052b to 57a9c66 Compare February 24, 2017 22:10

dzafman changed the title ~~[DNM] Various changes for preventing internal ENOSPC condition~~ Various changes for preventing internal ENOSPC condition Feb 24, 2017

jdurgin reviewed Feb 24, 2017

View reviewed changes

dzafman changed the title ~~Various changes for preventing internal ENOSPC condition~~ DNM: Various changes for preventing internal ENOSPC condition Feb 27, 2017

dzafman added 4 commits February 27, 2017 07:53

mon: Add warning if diff in OSD usage > config mon_warn_osd_usage_per…

c8004e6

…cent (10%) Signed-off-by: David Zafman <dzafman@redhat.com>

filestore: Account for dirty journal data in statfs

78adb70

Fixes: http://tracker.ceph.com/issues/16878 Signed-off-by: David Zafman <dzafman@redhat.com>

osd: Round fullness in message to correspond to df -h

26dcb59

This really only works after journal drains because we adjust for the journal. Signed-off-by: David Zafman <dzafman@redhat.com>

bluestore: Fix BlueStore::statfs available amount to not include blue…

72d83f8

…fs min reserved This fixes OSD crashes because checking osd_failsafe_full_ratio won't work without accurate statfs information. Signed-off-by: David Zafman <dzafman@redhat.com>

dzafman force-pushed the wip-15912 branch from 57a9c66 to 72d83f8 Compare February 27, 2017 15:55

dzafman changed the title ~~DNM: Various changes for preventing internal ENOSPC condition~~ Various changes for preventing internal ENOSPC condition Feb 27, 2017

liewegas changed the title ~~Various changes for preventing internal ENOSPC condition~~ osd: various changes for preventing internal ENOSPC condition Feb 28, 2017

dzafman added wip-zafman-testing and removed needs-test wip-zafman-testing labels Mar 1, 2017

tchaikov self-requested a review March 6, 2017 15:50

liewegas approved these changes Mar 6, 2017

View reviewed changes

dzafman merged commit 4536da7 into ceph:master Mar 6, 2017

dzafman deleted the wip-15912 branch March 6, 2017 18:49

liupan1111 mentioned this pull request Mar 27, 2017

rbd: when Ceph cluster is full, return -ENOSPC for creating image command. #14167

Closed

liupan1111 reviewed Mar 27, 2017

View reviewed changes

liupan1111 reviewed Mar 28, 2017

View reviewed changes

liupan1111 mentioned this pull request Mar 28, 2017

librados: set the flag CEPH_OSD_FLAG_FULL_TRY of Op in the right place. #14193

Merged

liupan1111 mentioned this pull request Apr 11, 2017

add config option mon_warn_osd_pg_percent to warn that OSD PGs not balance #14429

Closed

asheplyakov mentioned this pull request May 11, 2017

jewel: core: an OSD was seen getting ENOSPC even with osd_failsafe_full_ratio passed #15050

Merged

		@@ -472,6 +472,8 @@ class FileJournal :

		void set_wait_on_full(bool b) { wait_on_full = b; }

		off64_t get_journal_size_estimate(void);

osd: various changes for preventing internal ENOSPC condition #13425

osd: various changes for preventing internal ENOSPC condition #13425

Conversation

dzafman commented Feb 15, 2017 • edited

tchaikov commented Feb 15, 2017

dillaman commented Feb 15, 2017

ktdreyer Feb 15, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzafman commented Feb 17, 2017

dzafman commented Feb 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Feb 24, 2017

dzafman commented Feb 24, 2017

liewegas commented Feb 24, 2017 via email

liewegas commented Feb 24, 2017 via email

Choose a reason for hiding this comment

dzafman commented Feb 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzafman commented Mar 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liupan1111 Mar 28, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzafman commented Feb 15, 2017 •

edited

ktdreyer Feb 15, 2017 •

edited

liupan1111 Mar 28, 2017 •

edited