journal: increase concurrency/parallelism of journal recorder #10445

rjfd · 2016-07-26T14:52:43Z

This is a preliminary solution to the problem of increasing the throughput of write operations to RBD images' journals.

The current solution might still have data races in some fields of JournalRecorder class, that were protected by the JournalRecorder::m_lock.
Also, this implementation still needs testing, not even "make check" was run against this.

Related tracker ticket: http://tracker.ceph.com/issues/15259

Signed-off-by: Ricardo Dias rdias@suse.com

dillaman · 2016-07-26T16:37:17Z

src/librbd/ImageCtx.cc

@@ -53,7 +53,7 @@ namespace {
 class ThreadPoolSingleton : public ThreadPool {
 public:
  explicit ThreadPoolSingleton(CephContext *cct)
-    : ThreadPool(cct, "librbd::thread_pool", "tp_librbd", 1,
+    : ThreadPool(cct, "librbd::thread_pool", "tp_librbd", 16,


I am figuring this is for test purposes: you could revert commit 6c0ab75 and temporarily re-use "rbd_op_threads" for the journal below.

rjfd · 2016-07-27T15:27:31Z

@dillaman followed your suggestions and pushed the new changes.

I added the new changes as new commit on purpose just to maintain the history of changes.
Will merge the commits when we are done with this.

dillaman · 2016-07-28T12:52:03Z

@rjfd Since you are removing the fixed thread limit, we should also fix the issue with journal flush events: https://github.com/ceph/ceph/blob/master/src/librbd/AioImageRequest.cc#L456

To keep the IO consistent since the flush might be racing with predecessor write events for a journal append, we would actually want to flush_async_operations first, then m_aio_comp->start_op(true) (to start tracking the flush op order), then as part of the async callback from flush_async_operations we can append the flush journal event since we now know all predecessor ops have been flushed to the journal. Might need to put the start_op call within the async callback to avoid a race w/ the callback.

rjfd · 2016-09-06T16:30:31Z

@dillaman pushed all the changes and I think it's ready for another review iteration

dillaman · 2016-09-06T19:17:32Z

src/librbd/Journal.cc

                          length, flush_entry);
 }

 template <typename I>
 uint64_t Journal<I>::append_io_events(journal::EventType event_type,
-                                      const Bufferlists &bufferlists,
+                                      Bufferlists *bufferlists,


Can use rvalue move semantics to avoid the need to create a vector on the heap.

I'll need C++14 extensions to use move semantics in the capture list of a lambda function.

dillaman · 2016-09-06T19:40:26Z

@rjfd Assuming ObjectRecorder's per-event invocation of librados::aio_operate is the main bottle-neck, how about adding an atomic ref count for "incoming events" to ObjectRecorder that is incremented by JournalRecorder before attempting to acquire the ObjectRecorder's lock and decremented by ObjectRecorder::append. Within ObjectRecorder::flush_appends, it could use that atomic counter as a hint for whether or not another IO is incoming and should thus delay the flush (up to some maximum). The effect is that you should start batching single event append operations into multiple event append operations -- or just always execute the librados::aio_operate append operation from the op_work_queue and again potentially batch up multiple events into a single librados call.

dillaman · 2016-09-07T14:29:31Z

@rjfd Don't switch to c++14 -- stay at c++11. You actually can use move semantics with c++11 via a wrapper (https://github.com/dillaman/ceph/blob/wip-librbd-cache/src/librbd/cache/file/AioFile.cc#L29). However, if you move the "async" part down into journal::ObjectRecorder, you won't need to make any changes in librbd.

dillaman · 2016-09-07T14:30:58Z

@rjfd I don't think Journal::ThreadPoolSingleton is still required since everything should be async now.

rjfd · 2016-09-07T14:33:34Z

@dillaman to late on switching to c+14 :P
I'll revert that! (gitbuilders didn't like it either)

rjfd · 2016-09-08T13:47:55Z

@dillaman regarding

Within ObjectRecorder::flush_appends, it could use that atomic counter as a hint for whether or not another IO is incoming and should thus delay the flush (up to some maximum)

ObjectRecorder::flush_appends already has some form of batching controlled by the rbd_journal_object_flush_interval and the rbd_journal_object_flush_bytes config options. What would be the benefit of such hint (incoming_events_counter) compared to the use of the config options?

dillaman · 2016-09-08T14:05:01Z

@rjfd Yes, those config values provide optional batching -- but this would provide batching when you have multiple in-flight events and would always be enabled. I prefer the second option I listed where you just queue an async "do work" callback (if not already scheduled) since it takes "aio_operate" out of the write path and it is self-throttling wrt the speed in which aio_operate calls can be invoked.

rjfd · 2016-09-08T14:19:20Z

@dillaman thanks, will go with the second option

rjfd · 2016-09-15T13:47:28Z

Also run more tests of the RBD suite using the filter "journal", the results look good:

http://pulpito.ceph.com/rdias-2016-09-15_10:42:48-rbd-wip-rdias-testing---basic-smithi/

dillaman · 2016-09-15T16:12:07Z

src/journal/JournalRecorder.cc

@@ -59,8 +61,11 @@ JournalRecorder::JournalRecorder(librados::IoCtx &ioctx,

  uint8_t splay_width = m_journal_metadata->get_splay_width();
  for (uint8_t splay_offset = 0; splay_offset < splay_width; ++splay_offset) {
+    m_object_locks.push_back(shared_ptr<Mutex>(new Mutex("ObjectRecorder::m_lock::"+
+                                       std::to_string(splay_offset))));


Minor: indentation

dillaman · 2016-09-15T16:12:37Z

src/journal/JournalRecorder.cc

    uint64_t object_number = splay_offset + (m_current_set * splay_width);
-    m_object_ptrs[splay_offset] = create_object_recorder(object_number);
+    m_object_ptrs[splay_offset] = create_object_recorder(object_number,
+                                                m_object_locks[splay_offset]);


Minor: indentation

dillaman · 2016-09-15T16:19:24Z

src/journal/JournalRecorder.cc

@@ -201,6 +213,7 @@ void JournalRecorder::open_object_set() {
      create_next_object_recorder(object_recorder);


Add an "else" branch that will unlock the object_recorder if a create wasn't required --- then create_next_object_recorder can become create_next_object_recorder_unlock and ObjectRecord::append(..., bool unblock) can be removed (merged) and all the extra if (unlock) conditions can be removed.

dillaman · 2016-09-15T16:20:14Z

src/journal/JournalRecorder.cc

@@ -262,7 +279,7 @@ void JournalRecorder::create_next_object_recorder(
      new_object_recorder->get_object_number());
  }

-  new_object_recorder->append(append_buffers);
+  new_object_recorder->append(append_buffers, false);

  m_object_ptrs[splay_offset] = new_object_recorder;


Move to before the append_unlock call above (once changed per the comment above)

dillaman · 2016-09-15T16:21:12Z

src/journal/ObjectRecorder.cc

@@ -59,10 +69,20 @@ bool ObjectRecorder::append(const AppendBuffers &append_buffers) {
        last_flushed_future = iter->first;
      }
    }
+
+    if (unlock) {


Delay the unlock so that you don't need to (potentially) immediately re-lock/unlock the lock below.

dillaman · 2016-09-15T16:22:23Z

src/journal/ObjectRecorder.cc

+  if (!m_aio_scheduled) {
+    m_op_work_queue->queue(new FunctionContext(
+      [this] (int r) {
+        Mutex::Locker locker(*m_lock);


Minor: this is probably large enough chunk of logic to deserve its own method instead of using a lamba.

rjfd · 2016-09-16T15:51:05Z

@dillaman pushed the new changes.

dillaman · 2016-09-20T20:08:44Z

@rjfd The rbd-mirror stress test case failed pretty early (http://pulpito.ceph.com/jdillaman-2016-09-20_10:23:26-rbd-wip-jd-testing---basic-smithi/), but I haven't gotten a chance to determine why. I'll try to reproduce locally and see if it's related to the changes here.

Also, I think before we remove the rbd_op_threads limit, we should get some teuthology test case permutations where it overrides the config value to some value > 1 (e.g. 8 threads) for the qemu_xfstests and ceph_test_librbd cases.

dillaman · 2016-09-21T14:56:07Z

@rjfd Still sorting through the test failures. From what I can easily see, running your branch the stress test's random write is able to inject more IOs / second, but it appears like the actual number of journal events is way out of proportion. For example, it wrote 7600 IOs, which at a maximum should have been around 15,000 events -- but I am seeing nearly 200,000 events in the journal.

dillaman · 2016-09-21T15:10:07Z

@rjfd Correction -- it is working just fine (I thought the max IO size was 24K not 24M) and generating the correct number of events. The only issue is that the current use of wait_for_snap_present in rbd_mirror_stress needs to wait for replay to complete before testing for the snap (since given the much larger number of events, replay can take longer).

rjfd · 2016-09-22T14:12:23Z

@dillaman reverted the rbd_op_threads commit. The thread pool is now created with size 1 (hardcoded)

rjfd · 2016-09-23T09:55:38Z

upstream #15259 - performance: journaling results in 4X slowdown when writes are not blocked by cache

dillaman · 2016-09-23T16:45:40Z

@rjfd I ran this through bench-write and it actually looks like performance is slower:

wip-15259 branch:

bench type write io_size 4096 io_threads 16 bytes 268435456 pattern random
SEC OPS OPS/SEC BYTES/SEC
1 3993 4008.61 16419282.51
2 5580 2797.47 11458424.09
3 7152 2382.04 9756846.92
4 7643 1897.39 7771691.76
5 8139 1607.77 6585431.73
6 8594 920.04 3768468.85
7 9158 713.03 2920579.08
8 9696 504.86 2067913.75
9 10193 513.71 2104168.28
10 10790 537.57 2201870.20

master branch:

bench type write io_size 4096 io_threads 16 bytes 268435456 pattern random
SEC OPS OPS/SEC BYTES/SEC
1 2231 2246.16 9200268.87
2 4993 2503.23 10253225.23
3 6863 2291.92 9387686.15
4 8661 2164.36 8865222.36
5 10450 2090.23 8561580.69
6 12087 1966.68 8055530.50
7 13639 1729.52 7084110.80
8 15321 1691.22 6927241.44
9 16990 1666.70 6826819.93
10 18812 1674.03 6856825.58

I'll take a deeper look at it this weekend.

dillaman · 2016-09-26T01:14:50Z

@rjfd It turns out that commit 600a9e7 is the cause of the slowdown. By combining the threads, ObjectRecorder::send_append_aio effectively blocks incoming IO processing within AioImageRequestWQ. Even increasing the number of threads doesn't help since eventually the journal will be the largest bottleneck.

Therefore, I would suggest reverting that commit (and removing the increase of journal threads to 16 in b8bd24f#diff-d6120e3403affa49fa548a4f35928259R154).

Signed-off-by: Ricardo Dias <rdias@suse.com>

rjfd · 2016-09-26T09:27:48Z

@dillaman reverted the commit. Using again the journal thread pool with size 1.

rjfd · 2016-09-26T11:36:28Z

@dillaman found a new deadlock caused by the ObjectRecorder::send_appends_aio and the returned journal thread pool with size 1.
When demoting an image Journal::demote, it waits for the flush of the append operation holding the m_lock. At the same time, the Journal::handle_metadata_updated is called in the context of the only thread available on the journal thread pool, and blocks while trying to acquire the m_lock. Since the only thread of the journal thread pool is blocked, the ObjectRecorder::send_appends_aio is never executed, and thus the Journal::demote will keep waiting forever.

dillaman · 2016-09-26T12:22:53Z

It would be safe to drop the lock around "ctx.wait()" and
"flush_ctx.wait()" -- synchronous methods shouldn't wait on conditions
while holding locks.

On Mon, Sep 26, 2016 at 7:36 AM, Ricardo Dias notifications@github.com
wrote:

@dillaman https://github.com/dillaman found a new deadlock caused by
the ObjectRecorder::send_appends_aio and the returned journal thread pool
with size 1.
When demoting an image Journal::demote, it waits for the flush of the
append operation holding the m_lock. At the same time, the
Journal::handle_metadata_updated is called in the context of the only
thread available on the journal thread pool, and blocks while trying to
acquire the m_lock. Since the only thread of the journal thread pool is
blocked, the ObjectRecorder::send_appends_aio is never executed, and thus
the Journal::demote will keep waiting forever.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#10445 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AEhP4rjSXTUotAS8cOMaT6LZo6Q7DWSyks5qt65HgaJpZM4JVO1p
.

Jason

Signed-off-by: Ricardo Dias <rdias@suse.com>

rjfd · 2016-09-26T13:48:44Z

@dillaman done!

rjfd force-pushed the wip-15259 branch from 9bc1903 to d00fe11 Compare July 26, 2016 15:16

dillaman added rbd performance labels Jul 26, 2016

dillaman reviewed Jul 26, 2016
View reviewed changes

rjfd force-pushed the wip-15259 branch 4 times, most recently from 42fc0fb to 2c99908 Compare September 6, 2016 14:39

dillaman reviewed Sep 6, 2016
View reviewed changes

rjfd force-pushed the wip-15259 branch 2 times, most recently from 649590b to ffca4e2 Compare September 7, 2016 14:27

rjfd force-pushed the wip-15259 branch from ffca4e2 to 1dc889c Compare September 7, 2016 15:10

dillaman reviewed Sep 15, 2016

View reviewed changes

rjfd force-pushed the wip-15259 branch from ae0a23a to 36ab434 Compare September 16, 2016 15:48

dillaman added the wip-jason-testing label Sep 16, 2016

rjfd force-pushed the wip-15259 branch from bab215c to d2403bd Compare September 22, 2016 14:10

dillaman removed the wip-jason-testing label Sep 26, 2016

rjfd force-pushed the wip-15259 branch from d2403bd to af54c67 Compare September 26, 2016 09:25

journal: increase concurrency of journal recorder

5c88edd

Signed-off-by: Ricardo Dias <rdias@suse.com>

rjfd force-pushed the wip-15259 branch from af54c67 to d814c4a Compare September 26, 2016 09:27

rjfd force-pushed the wip-15259 branch from d814c4a to 81ac528 Compare September 26, 2016 09:30

rjfd added 3 commits September 26, 2016 14:25

journal: make librados call async in ObjectRecorder

7b740f5

Signed-off-by: Ricardo Dias <rdias@suse.com>

rbd: Fix race between journal flush and append events

aa959e7

Signed-off-by: Ricardo Dias <rdias@suse.com>

rbd-mirror: test: Fixed timeout problem in rbd_mirror_stress.sh

5c73703

Signed-off-by: Ricardo Dias <rdias@suse.com>

rjfd force-pushed the wip-15259 branch from 81ac528 to 5c73703 Compare September 26, 2016 13:26

dillaman added the wip-jason-testing label Sep 26, 2016

dillaman merged commit d4f144e into ceph:master Sep 27, 2016

dillaman mentioned this pull request Oct 11, 2016

jewel: rbd: mirror: improve resiliency of stress test case #11433

Merged

rjfd removed the wip-rdias-testing label Mar 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

journal: increase concurrency/parallelism of journal recorder #10445

journal: increase concurrency/parallelism of journal recorder #10445

rjfd commented Jul 26, 2016 •

edited

dillaman Jul 26, 2016

rjfd commented Jul 27, 2016

dillaman commented Jul 28, 2016

rjfd commented Sep 6, 2016

dillaman Sep 6, 2016

rjfd Sep 7, 2016

dillaman commented Sep 6, 2016 •

edited

dillaman commented Sep 7, 2016

dillaman commented Sep 7, 2016

rjfd commented Sep 7, 2016 •

edited

rjfd commented Sep 8, 2016

dillaman commented Sep 8, 2016

rjfd commented Sep 8, 2016

rjfd commented Sep 15, 2016

dillaman Sep 15, 2016

dillaman Sep 15, 2016

dillaman Sep 15, 2016

dillaman Sep 15, 2016

dillaman Sep 15, 2016

dillaman Sep 15, 2016

rjfd commented Sep 16, 2016

dillaman commented Sep 20, 2016

dillaman commented Sep 21, 2016

dillaman commented Sep 21, 2016

rjfd commented Sep 22, 2016

rjfd commented Sep 23, 2016

dillaman commented Sep 23, 2016

dillaman commented Sep 26, 2016

rjfd commented Sep 26, 2016

rjfd commented Sep 26, 2016

dillaman commented Sep 26, 2016

rjfd commented Sep 26, 2016

		@@ -201,6 +213,7 @@ void JournalRecorder::open_object_set() {
		create_next_object_recorder(object_recorder);

journal: increase concurrency/parallelism of journal recorder #10445

journal: increase concurrency/parallelism of journal recorder #10445

Conversation

rjfd commented Jul 26, 2016 • edited

Choose a reason for hiding this comment

rjfd commented Jul 27, 2016

dillaman commented Jul 28, 2016

rjfd commented Sep 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dillaman commented Sep 6, 2016 • edited

dillaman commented Sep 7, 2016

dillaman commented Sep 7, 2016

rjfd commented Sep 7, 2016 • edited

rjfd commented Sep 8, 2016

dillaman commented Sep 8, 2016

rjfd commented Sep 8, 2016

rjfd commented Sep 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjfd commented Sep 16, 2016

dillaman commented Sep 20, 2016

dillaman commented Sep 21, 2016

dillaman commented Sep 21, 2016

rjfd commented Sep 22, 2016

rjfd commented Sep 23, 2016

dillaman commented Sep 23, 2016

dillaman commented Sep 26, 2016

rjfd commented Sep 26, 2016

rjfd commented Sep 26, 2016

dillaman commented Sep 26, 2016

rjfd commented Sep 26, 2016

rjfd commented Jul 26, 2016 •

edited

dillaman commented Sep 6, 2016 •

edited

rjfd commented Sep 7, 2016 •

edited