New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: fix bugs in bluefs and bdev flush #13911
Conversation
Signed-off-by: Sage Weil <sage@redhat.com>
flush() may be called from multiple racing threads (notably, rocksdb can call fsync via bluefs at any time), and we need to make sure that if one thread sees the io_since_flush command and does an actual flush, that other racing threads also wait until that flush is complete. This is accomplished with a simple mutex! Also, set the flag on IO *completion*, since flush is only a promise about completed IOs, not submitted IOs. Document. Fixes: http://tracker.ceph.com/issues/19251 Signed-off-by: Sage Weil <sage@redhat.com>
We need to flush any new writes on any fsync(). Notably, this includes the rocksdb log. However, previously _fsync was only doing a bdev flush if we also had a dirty bluefs journal and called into _sync_and_flush_journal. If we didn't, we weren't doing a flush() at all, which could lead to corrupted data. Fix this by moving the first flush_bdev *out* of _sync_and_flush_log. (The second one is there to flush the bluefs journal; the first one was to ensure prior writes are stable.) Instead, flush prior writes in all of the callers prior to calling _sync_and_flush_log. This includes _fsync (and fixes the bug by covering the non-journal-flush path) as well as several other callers. Fixes: http://tracker.ceph.com/issues/19250 Signed-off-by: Sage Weil <sage@redhat.com>
// aio completion notification will not return before that aio is | ||
// stable on disk: whichever thread sees the flag first will block | ||
// followers until the aio is stable. | ||
std::lock_guard<std::mutex> l(flush_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we add a perf counter to this flush_mutex lock contention. I suspect it will cause serious performance spike...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could fix this now.
in practice it is very rare that multiple threads call flush... i only saw
it after quite a bit of testing, and it only happened because rocksdb
(compaction?). and if two threads *do* collide on this lock, the whole
point is that they *must* block in order to ensure their data is stable on
disk.
unless i'm misunderstanding what you mean by 'lock contention' in this
case?
|
OH, I got the actual idea. |
http://tracker.ceph.com/issues/19251
http://tracker.ceph.com/issues/19250
These will get backported to kraken too.