New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
journal: increase concurrency/parallelism of journal recorder #10445
Conversation
@@ -53,7 +53,7 @@ namespace { | |||
class ThreadPoolSingleton : public ThreadPool { | |||
public: | |||
explicit ThreadPoolSingleton(CephContext *cct) | |||
: ThreadPool(cct, "librbd::thread_pool", "tp_librbd", 1, | |||
: ThreadPool(cct, "librbd::thread_pool", "tp_librbd", 16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am figuring this is for test purposes: you could revert commit 6c0ab75 and temporarily re-use "rbd_op_threads" for the journal below.
@dillaman followed your suggestions and pushed the new changes. I added the new changes as new commit on purpose just to maintain the history of changes. |
@rjfd Since you are removing the fixed thread limit, we should also fix the issue with journal flush events: https://github.com/ceph/ceph/blob/master/src/librbd/AioImageRequest.cc#L456 To keep the IO consistent since the flush might be racing with predecessor write events for a journal append, we would actually want to |
42fc0fb
to
2c99908
Compare
@dillaman pushed all the changes and I think it's ready for another review iteration |
length, flush_entry); | ||
} | ||
|
||
template <typename I> | ||
uint64_t Journal<I>::append_io_events(journal::EventType event_type, | ||
const Bufferlists &bufferlists, | ||
Bufferlists *bufferlists, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use rvalue move semantics to avoid the need to create a vector on the heap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need C++14 extensions to use move semantics in the capture list of a lambda function.
@rjfd Assuming ObjectRecorder's per-event invocation of librados::aio_operate is the main bottle-neck, how about adding an atomic ref count for "incoming events" to ObjectRecorder that is incremented by JournalRecorder before attempting to acquire the ObjectRecorder's lock and decremented by ObjectRecorder::append. Within ObjectRecorder::flush_appends, it could use that atomic counter as a hint for whether or not another IO is incoming and should thus delay the flush (up to some maximum). The effect is that you should start batching single event append operations into multiple event append operations -- or just always execute the librados::aio_operate append operation from the op_work_queue and again potentially batch up multiple events into a single librados call. |
649590b
to
ffca4e2
Compare
@rjfd Don't switch to c++14 -- stay at c++11. You actually can use move semantics with c++11 via a wrapper (https://github.com/dillaman/ceph/blob/wip-librbd-cache/src/librbd/cache/file/AioFile.cc#L29). However, if you move the "async" part down into journal::ObjectRecorder, you won't need to make any changes in librbd. |
@rjfd I don't think Journal::ThreadPoolSingleton is still required since everything should be async now. |
@dillaman to late on switching to c+14 :P |
@dillaman regarding
|
@rjfd Yes, those config values provide optional batching -- but this would provide batching when you have multiple in-flight events and would always be enabled. I prefer the second option I listed where you just queue an async "do work" callback (if not already scheduled) since it takes "aio_operate" out of the write path and it is self-throttling wrt the speed in which aio_operate calls can be invoked. |
@dillaman thanks, will go with the second option |
Also run more tests of the RBD suite using the filter "journal", the results look good: http://pulpito.ceph.com/rdias-2016-09-15_10:42:48-rbd-wip-rdias-testing---basic-smithi/ |
@@ -59,8 +61,11 @@ JournalRecorder::JournalRecorder(librados::IoCtx &ioctx, | |||
|
|||
uint8_t splay_width = m_journal_metadata->get_splay_width(); | |||
for (uint8_t splay_offset = 0; splay_offset < splay_width; ++splay_offset) { | |||
m_object_locks.push_back(shared_ptr<Mutex>(new Mutex("ObjectRecorder::m_lock::"+ | |||
std::to_string(splay_offset)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: indentation
uint64_t object_number = splay_offset + (m_current_set * splay_width); | ||
m_object_ptrs[splay_offset] = create_object_recorder(object_number); | ||
m_object_ptrs[splay_offset] = create_object_recorder(object_number, | ||
m_object_locks[splay_offset]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: indentation
@@ -201,6 +213,7 @@ void JournalRecorder::open_object_set() { | |||
create_next_object_recorder(object_recorder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an "else" branch that will unlock the object_recorder if a create wasn't required --- then create_next_object_recorder
can become create_next_object_recorder_unlock
and ObjectRecord::append(..., bool unblock)
can be removed (merged) and all the extra if (unlock)
conditions can be removed.
@@ -262,7 +279,7 @@ void JournalRecorder::create_next_object_recorder( | |||
new_object_recorder->get_object_number()); | |||
} | |||
|
|||
new_object_recorder->append(append_buffers); | |||
new_object_recorder->append(append_buffers, false); | |||
|
|||
m_object_ptrs[splay_offset] = new_object_recorder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to before the append_unlock
call above (once changed per the comment above)
@@ -59,10 +69,20 @@ bool ObjectRecorder::append(const AppendBuffers &append_buffers) { | |||
last_flushed_future = iter->first; | |||
} | |||
} | |||
|
|||
if (unlock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delay the unlock so that you don't need to (potentially) immediately re-lock/unlock the lock below.
if (!m_aio_scheduled) { | ||
m_op_work_queue->queue(new FunctionContext( | ||
[this] (int r) { | ||
Mutex::Locker locker(*m_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: this is probably large enough chunk of logic to deserve its own method instead of using a lamba.
@dillaman pushed the new changes. |
@rjfd The rbd-mirror stress test case failed pretty early (http://pulpito.ceph.com/jdillaman-2016-09-20_10:23:26-rbd-wip-jd-testing---basic-smithi/), but I haven't gotten a chance to determine why. I'll try to reproduce locally and see if it's related to the changes here. Also, I think before we remove the rbd_op_threads limit, we should get some teuthology test case permutations where it overrides the config value to some value > 1 (e.g. 8 threads) for the qemu_xfstests and ceph_test_librbd cases. |
@rjfd Still sorting through the test failures. From what I can easily see, running your branch the stress test's random write is able to inject more IOs / second, but it appears like the actual number of journal events is way out of proportion. For example, it wrote 7600 IOs, which at a maximum should have been around 15,000 events -- but I am seeing nearly 200,000 events in the journal. |
@rjfd Correction -- it is working just fine (I thought the max IO size was 24K not 24M) and generating the correct number of events. The only issue is that the current use of |
@dillaman reverted the rbd_op_threads commit. The thread pool is now created with size 1 (hardcoded) |
@rjfd I ran this through bench-write and it actually looks like performance is slower: wip-15259 branch:
master branch:
I'll take a deeper look at it this weekend. |
@rjfd It turns out that commit 600a9e7 is the cause of the slowdown. By combining the threads, Therefore, I would suggest reverting that commit (and removing the increase of journal threads to 16 in b8bd24f#diff-d6120e3403affa49fa548a4f35928259R154). |
Signed-off-by: Ricardo Dias <rdias@suse.com>
@dillaman reverted the commit. Using again the journal thread pool with size 1. |
@dillaman found a new deadlock caused by the |
It would be safe to drop the lock around "ctx.wait()" and On Mon, Sep 26, 2016 at 7:36 AM, Ricardo Dias notifications@github.com
Jason |
Signed-off-by: Ricardo Dias <rdias@suse.com>
Signed-off-by: Ricardo Dias <rdias@suse.com>
Signed-off-by: Ricardo Dias <rdias@suse.com>
@dillaman done! |
This is a preliminary solution to the problem of increasing the throughput of write operations to RBD images' journals.
The current solution might still have data races in some fields of
JournalRecorder
class, that were protected by theJournalRecorder::m_lock
.Also, this implementation still needs testing, not even "make check" was run against this.
Related tracker ticket: http://tracker.ceph.com/issues/15259
Signed-off-by: Ricardo Dias rdias@suse.com