Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

journal: increase concurrency/parallelism of journal recorder #10445

Merged
merged 4 commits into from Sep 27, 2016

Conversation

rjfd
Copy link
Contributor

@rjfd rjfd commented Jul 26, 2016

This is a preliminary solution to the problem of increasing the throughput of write operations to RBD images' journals.

The current solution might still have data races in some fields of JournalRecorder class, that were protected by the JournalRecorder::m_lock.
Also, this implementation still needs testing, not even "make check" was run against this.

Related tracker ticket: http://tracker.ceph.com/issues/15259

Signed-off-by: Ricardo Dias rdias@suse.com

@@ -53,7 +53,7 @@ namespace {
class ThreadPoolSingleton : public ThreadPool {
public:
explicit ThreadPoolSingleton(CephContext *cct)
: ThreadPool(cct, "librbd::thread_pool", "tp_librbd", 1,
: ThreadPool(cct, "librbd::thread_pool", "tp_librbd", 16,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am figuring this is for test purposes: you could revert commit 6c0ab75 and temporarily re-use "rbd_op_threads" for the journal below.

@rjfd
Copy link
Contributor Author

rjfd commented Jul 27, 2016

@dillaman followed your suggestions and pushed the new changes.

I added the new changes as new commit on purpose just to maintain the history of changes.
Will merge the commits when we are done with this.

@dillaman
Copy link

@rjfd Since you are removing the fixed thread limit, we should also fix the issue with journal flush events: https://github.com/ceph/ceph/blob/master/src/librbd/AioImageRequest.cc#L456

To keep the IO consistent since the flush might be racing with predecessor write events for a journal append, we would actually want to flush_async_operations first, then m_aio_comp->start_op(true) (to start tracking the flush op order), then as part of the async callback from flush_async_operations we can append the flush journal event since we now know all predecessor ops have been flushed to the journal. Might need to put the start_op call within the async callback to avoid a race w/ the callback.

@rjfd rjfd force-pushed the wip-15259 branch 4 times, most recently from 42fc0fb to 2c99908 Compare September 6, 2016 14:39
@rjfd
Copy link
Contributor Author

rjfd commented Sep 6, 2016

@dillaman pushed all the changes and I think it's ready for another review iteration

length, flush_entry);
}

template <typename I>
uint64_t Journal<I>::append_io_events(journal::EventType event_type,
const Bufferlists &bufferlists,
Bufferlists *bufferlists,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use rvalue move semantics to avoid the need to create a vector on the heap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need C++14 extensions to use move semantics in the capture list of a lambda function.

@dillaman
Copy link

dillaman commented Sep 6, 2016

@rjfd Assuming ObjectRecorder's per-event invocation of librados::aio_operate is the main bottle-neck, how about adding an atomic ref count for "incoming events" to ObjectRecorder that is incremented by JournalRecorder before attempting to acquire the ObjectRecorder's lock and decremented by ObjectRecorder::append. Within ObjectRecorder::flush_appends, it could use that atomic counter as a hint for whether or not another IO is incoming and should thus delay the flush (up to some maximum). The effect is that you should start batching single event append operations into multiple event append operations -- or just always execute the librados::aio_operate append operation from the op_work_queue and again potentially batch up multiple events into a single librados call.

@rjfd rjfd force-pushed the wip-15259 branch 2 times, most recently from 649590b to ffca4e2 Compare September 7, 2016 14:27
@dillaman
Copy link

dillaman commented Sep 7, 2016

@rjfd Don't switch to c++14 -- stay at c++11. You actually can use move semantics with c++11 via a wrapper (https://github.com/dillaman/ceph/blob/wip-librbd-cache/src/librbd/cache/file/AioFile.cc#L29). However, if you move the "async" part down into journal::ObjectRecorder, you won't need to make any changes in librbd.

@dillaman
Copy link

dillaman commented Sep 7, 2016

@rjfd I don't think Journal::ThreadPoolSingleton is still required since everything should be async now.

@rjfd
Copy link
Contributor Author

rjfd commented Sep 7, 2016

@dillaman to late on switching to c+14 :P
I'll revert that! (gitbuilders didn't like it either)

@rjfd
Copy link
Contributor Author

rjfd commented Sep 8, 2016

@dillaman regarding

Within ObjectRecorder::flush_appends, it could use that atomic counter as a hint for whether or not another IO is incoming and should thus delay the flush (up to some maximum)

ObjectRecorder::flush_appends already has some form of batching controlled by the rbd_journal_object_flush_interval and the rbd_journal_object_flush_bytes config options. What would be the benefit of such hint (incoming_events_counter) compared to the use of the config options?

@dillaman
Copy link

dillaman commented Sep 8, 2016

@rjfd Yes, those config values provide optional batching -- but this would provide batching when you have multiple in-flight events and would always be enabled. I prefer the second option I listed where you just queue an async "do work" callback (if not already scheduled) since it takes "aio_operate" out of the write path and it is self-throttling wrt the speed in which aio_operate calls can be invoked.

@rjfd
Copy link
Contributor Author

rjfd commented Sep 8, 2016

@dillaman thanks, will go with the second option

@rjfd
Copy link
Contributor Author

rjfd commented Sep 15, 2016

Also run more tests of the RBD suite using the filter "journal", the results look good:

http://pulpito.ceph.com/rdias-2016-09-15_10:42:48-rbd-wip-rdias-testing---basic-smithi/

@@ -59,8 +61,11 @@ JournalRecorder::JournalRecorder(librados::IoCtx &ioctx,

uint8_t splay_width = m_journal_metadata->get_splay_width();
for (uint8_t splay_offset = 0; splay_offset < splay_width; ++splay_offset) {
m_object_locks.push_back(shared_ptr<Mutex>(new Mutex("ObjectRecorder::m_lock::"+
std::to_string(splay_offset))));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: indentation

uint64_t object_number = splay_offset + (m_current_set * splay_width);
m_object_ptrs[splay_offset] = create_object_recorder(object_number);
m_object_ptrs[splay_offset] = create_object_recorder(object_number,
m_object_locks[splay_offset]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: indentation

@@ -201,6 +213,7 @@ void JournalRecorder::open_object_set() {
create_next_object_recorder(object_recorder);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an "else" branch that will unlock the object_recorder if a create wasn't required --- then create_next_object_recorder can become create_next_object_recorder_unlock and ObjectRecord::append(..., bool unblock) can be removed (merged) and all the extra if (unlock) conditions can be removed.

@@ -262,7 +279,7 @@ void JournalRecorder::create_next_object_recorder(
new_object_recorder->get_object_number());
}

new_object_recorder->append(append_buffers);
new_object_recorder->append(append_buffers, false);

m_object_ptrs[splay_offset] = new_object_recorder;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to before the append_unlock call above (once changed per the comment above)

@@ -59,10 +69,20 @@ bool ObjectRecorder::append(const AppendBuffers &append_buffers) {
last_flushed_future = iter->first;
}
}

if (unlock) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delay the unlock so that you don't need to (potentially) immediately re-lock/unlock the lock below.

if (!m_aio_scheduled) {
m_op_work_queue->queue(new FunctionContext(
[this] (int r) {
Mutex::Locker locker(*m_lock);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this is probably large enough chunk of logic to deserve its own method instead of using a lamba.

@rjfd
Copy link
Contributor Author

rjfd commented Sep 16, 2016

@dillaman pushed the new changes.

@dillaman
Copy link

@rjfd The rbd-mirror stress test case failed pretty early (http://pulpito.ceph.com/jdillaman-2016-09-20_10:23:26-rbd-wip-jd-testing---basic-smithi/), but I haven't gotten a chance to determine why. I'll try to reproduce locally and see if it's related to the changes here.

Also, I think before we remove the rbd_op_threads limit, we should get some teuthology test case permutations where it overrides the config value to some value > 1 (e.g. 8 threads) for the qemu_xfstests and ceph_test_librbd cases.

@dillaman
Copy link

@rjfd Still sorting through the test failures. From what I can easily see, running your branch the stress test's random write is able to inject more IOs / second, but it appears like the actual number of journal events is way out of proportion. For example, it wrote 7600 IOs, which at a maximum should have been around 15,000 events -- but I am seeing nearly 200,000 events in the journal.

@dillaman
Copy link

@rjfd Correction -- it is working just fine (I thought the max IO size was 24K not 24M) and generating the correct number of events. The only issue is that the current use of wait_for_snap_present in rbd_mirror_stress needs to wait for replay to complete before testing for the snap (since given the much larger number of events, replay can take longer).

@rjfd
Copy link
Contributor Author

rjfd commented Sep 22, 2016

@dillaman reverted the rbd_op_threads commit. The thread pool is now created with size 1 (hardcoded)

@rjfd
Copy link
Contributor Author

rjfd commented Sep 23, 2016

@dillaman
Copy link

@rjfd I ran this through bench-write and it actually looks like performance is slower:

wip-15259 branch:

bench type write io_size 4096 io_threads 16 bytes 268435456 pattern random
SEC OPS OPS/SEC BYTES/SEC
1 3993 4008.61 16419282.51
2 5580 2797.47 11458424.09
3 7152 2382.04 9756846.92
4 7643 1897.39 7771691.76
5 8139 1607.77 6585431.73
6 8594 920.04 3768468.85
7 9158 713.03 2920579.08
8 9696 504.86 2067913.75
9 10193 513.71 2104168.28
10 10790 537.57 2201870.20

master branch:

bench type write io_size 4096 io_threads 16 bytes 268435456 pattern random
SEC OPS OPS/SEC BYTES/SEC
1 2231 2246.16 9200268.87
2 4993 2503.23 10253225.23
3 6863 2291.92 9387686.15
4 8661 2164.36 8865222.36
5 10450 2090.23 8561580.69
6 12087 1966.68 8055530.50
7 13639 1729.52 7084110.80
8 15321 1691.22 6927241.44
9 16990 1666.70 6826819.93
10 18812 1674.03 6856825.58

I'll take a deeper look at it this weekend.

@dillaman
Copy link

@rjfd It turns out that commit 600a9e7 is the cause of the slowdown. By combining the threads, ObjectRecorder::send_append_aio effectively blocks incoming IO processing within AioImageRequestWQ. Even increasing the number of threads doesn't help since eventually the journal will be the largest bottleneck.

Therefore, I would suggest reverting that commit (and removing the increase of journal threads to 16 in b8bd24f#diff-d6120e3403affa49fa548a4f35928259R154).

Signed-off-by: Ricardo Dias <rdias@suse.com>
@rjfd
Copy link
Contributor Author

rjfd commented Sep 26, 2016

@dillaman reverted the commit. Using again the journal thread pool with size 1.

@rjfd
Copy link
Contributor Author

rjfd commented Sep 26, 2016

@dillaman found a new deadlock caused by the ObjectRecorder::send_appends_aio and the returned journal thread pool with size 1.
When demoting an image Journal::demote, it waits for the flush of the append operation holding the m_lock. At the same time, the Journal::handle_metadata_updated is called in the context of the only thread available on the journal thread pool, and blocks while trying to acquire the m_lock. Since the only thread of the journal thread pool is blocked, the ObjectRecorder::send_appends_aio is never executed, and thus the Journal::demote will keep waiting forever.

@dillaman
Copy link

It would be safe to drop the lock around "ctx.wait()" and
"flush_ctx.wait()" -- synchronous methods shouldn't wait on conditions
while holding locks.

On Mon, Sep 26, 2016 at 7:36 AM, Ricardo Dias notifications@github.com
wrote:

@dillaman https://github.com/dillaman found a new deadlock caused by
the ObjectRecorder::send_appends_aio and the returned journal thread pool
with size 1.
When demoting an image Journal::demote, it waits for the flush of the
append operation holding the m_lock. At the same time, the
Journal::handle_metadata_updated is called in the context of the only
thread available on the journal thread pool, and blocks while trying to
acquire the m_lock. Since the only thread of the journal thread pool is
blocked, the ObjectRecorder::send_appends_aio is never executed, and thus
the Journal::demote will keep waiting forever.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#10445 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEhP4rjSXTUotAS8cOMaT6LZo6Q7DWSyks5qt65HgaJpZM4JVO1p
.

Jason

Signed-off-by: Ricardo Dias <rdias@suse.com>
Signed-off-by: Ricardo Dias <rdias@suse.com>
Signed-off-by: Ricardo Dias <rdias@suse.com>
@rjfd
Copy link
Contributor Author

rjfd commented Sep 26, 2016

@dillaman done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants