os/bluestore: fix bugs in bluefs and bdev flush #13911

liewegas · 2017-03-09T23:17:02Z

http://tracker.ceph.com/issues/19251
http://tracker.ceph.com/issues/19250

These will get backported to kraken too.

Signed-off-by: Sage Weil <sage@redhat.com>

flush() may be called from multiple racing threads (notably, rocksdb can call fsync via bluefs at any time), and we need to make sure that if one thread sees the io_since_flush command and does an actual flush, that other racing threads also wait until that flush is complete. This is accomplished with a simple mutex! Also, set the flag on IO *completion*, since flush is only a promise about completed IOs, not submitted IOs. Document. Fixes: http://tracker.ceph.com/issues/19251 Signed-off-by: Sage Weil <sage@redhat.com>

We need to flush any new writes on any fsync(). Notably, this includes the rocksdb log. However, previously _fsync was only doing a bdev flush if we also had a dirty bluefs journal and called into _sync_and_flush_journal. If we didn't, we weren't doing a flush() at all, which could lead to corrupted data. Fix this by moving the first flush_bdev *out* of _sync_and_flush_log. (The second one is there to flush the bluefs journal; the first one was to ensure prior writes are stable.) Instead, flush prior writes in all of the callers prior to calling _sync_and_flush_log. This includes _fsync (and fixes the bug by covering the non-journal-flush path) as well as several other callers. Fixes: http://tracker.ceph.com/issues/19250 Signed-off-by: Sage Weil <sage@redhat.com>

yuyuyu101 · 2017-03-10T02:58:30Z

src/os/bluestore/KernelDevice.cc

+  // aio completion notification will not return before that aio is
+  // stable on disk: whichever thread sees the flag first will block
+  // followers until the aio is stable.
+  std::lock_guard<std::mutex> l(flush_mutex);


I suggest we add a perf counter to this flush_mutex lock contention. I suspect it will cause serious performance spike...

we could fix this now.

liewegas · 2017-03-10T04:06:28Z

in practice it is very rare that multiple threads call flush... i only saw it after quite a bit of testing, and it only happened because rocksdb (compaction?). and if two threads *do* collide on this lock, the whole point is that they *must* block in order to ensure their data is stable on disk. unless i'm misunderstanding what you mean by 'lock contention' in this case?

yuyuyu101 · 2017-03-10T07:21:21Z

OH, I got the actual idea.

liewegas added 3 commits March 9, 2017 18:14

ceph_test_objectstore: drop bluestore_sync_wal_* options from test

3f3d6c6

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas added bluestore bug-fix labels Mar 9, 2017

liewegas requested a review from yuyuyu101 March 9, 2017 23:17

yuyuyu101 reviewed Mar 10, 2017

View reviewed changes

yuyuyu101 approved these changes Mar 10, 2017

View reviewed changes

liewegas added the wip-sage-testing label Mar 10, 2017

liewegas merged commit 6c6b1be into ceph:master Mar 17, 2017

liewegas deleted the wip-bluestore-fix-flush branch March 17, 2017 18:36

smithfarm mentioned this pull request Apr 22, 2017

kraken: core: bluestore bdev: flush no-op optimization is racy #14736

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore: fix bugs in bluefs and bdev flush #13911

os/bluestore: fix bugs in bluefs and bdev flush #13911

liewegas commented Mar 9, 2017

yuyuyu101 Mar 10, 2017

yuyuyu101 Mar 10, 2017

liewegas commented Mar 10, 2017 via email

yuyuyu101 commented Mar 10, 2017

os/bluestore: fix bugs in bluefs and bdev flush #13911

os/bluestore: fix bugs in bluefs and bdev flush #13911

Conversation

liewegas commented Mar 9, 2017

yuyuyu101 Mar 10, 2017

Choose a reason for hiding this comment

yuyuyu101 Mar 10, 2017

Choose a reason for hiding this comment

liewegas commented Mar 10, 2017 via email

yuyuyu101 commented Mar 10, 2017