librbd: batch object map updates during trim #11510

vshankar · 2016-10-15T17:05:45Z

Shrinking an image can result in huge number of ObjectMap
updates -- two for each object being removed. This results
in lots of requests to be send to OSDs especially when
trimming a gigantus image. This situation can be optimized
by sending batch ObjectMap updates, significantly cutting
down the OSD traffic.

Fixes: http://tracker.ceph.com/issues/17356
Signed-off-by: Venky Shankar vshankar@redhat.com

dillaman · 2016-10-17T14:43:58Z

src/librbd/operation/TrimRequest.cc

 }

 template <typename I>
-void TrimRequest<I>::send_copyup_objects() {
+void TrimRequest<I>::send_update_objectmap() {


Minor: this state doesn't seem to do anything besides pick which method to invoke next -- why not just combine it w/ send_pre_copyup and skip to send_pre_remove if copyup isn't needed.

dillaman · 2016-10-17T14:50:58Z

src/librbd/operation/TrimRequest.cc

+
+      Context *ctx = this->create_callback_context();
+      RWLock::WLocker object_map_locker(image_ctx.object_map_lock);
+      if (!image_ctx.object_map->aio_update(m_copyup_start, m_copyup_end,


I don't think this is correct. The purpose of the copy-up state is to ensure that if you have snapshots and if there are one-or-more blocks that haven't been copied up to the child image, we need to ensure that these snapshots have the backing parent image data just in case the child image is flattened.

This path would only be executed if you are shrinking an image and will not be executed upon an image removal (since you cannot have snapshots when you remove the image). Since we need to read each affected block from the parent image for copyup to know if the parent image has backing data, I am not sure we can optimize this path to a batch object map update.

Yeh, image removal updates ObjectMap in batches. There were couple of scenarios which updated ObjectMap for each object, one during discard (aio_discard) and the other (as you mentioned) being shrinking an image which has snapshots.

This PR reduces the ObjectMap update traffic to the OSD for the latter case, especially when the object exists (no copy required from its parent) and needs to be removed.

dillaman · 2016-10-17T14:52:03Z

src/librbd/AioObjectRequest.h

@@ -348,10 +348,6 @@ class AioObjectTrim : public AbstractAioObjectWrite {
  virtual void pre_object_map_update(uint8_t *new_state) {
    *new_state = OBJECT_PENDING;
  }
-
-  virtual bool post_object_map_update() {


I didn't realize that the trim state machine already performed a batch update -- I think this removal is correct and is what was (incorrectly) causing per-object object map updates.

no worries...

vshankar

@dillaman I'll resend the PR with updated commit message for clarity.

dillaman · 2016-10-24T14:30:09Z

@vshankar It appears that all the fsx tests have failed due to IO mismatches:
http://pulpito.ceph.com/jdillaman-2016-10-24_09:21:03-rbd-wip-jd-testing-distro-basic-smithi/

dillaman · 2016-10-27T18:40:31Z

src/librbd/AioObjectRequest.cc

+  *new_state = OBJECT_PENDING;
+
+  // object map state (before pre) decides if a post updation is needed
+  need_post = m_ictx->object_map->update_required(m_object_no, *new_state);


Why is this needed? Shouldn't https://github.com/vshankar/ceph/blob/master/src/librbd/AioObjectRequest.cc#L485 cover the case when the PRE didn't update the object map (since https://github.com/vshankar/ceph/blob/master/src/librbd/ObjectMap.cc#L98 would essentially no-op the update if the object is already non-existent), thus the POST would be no-op as well (since it's still flagged as non-existent).

POST would not be a no-op as the object state is still PENDING state at this point (in AioObejctRequest). Post batch update (PENDING -> NON_EXISTENT) is done in TrimRequest after objects for a given range are trimmed).

If this is not the case, then I'm surely missing something in proofreading code and testing the fix :( [reverting post_object_map_update() virtual function, post update happens for each object]

Before send_pre() is called (for each object), the object state for the range [start, end) has already been moved to OBJECT_PENDING. Therefore, pre operation for every object being trimmed becomes a no-op. But in send_post(), the object map state for every object is still in OBJECT_PENDING state as bulk object map update of [start, end) is done after objects in this range are trimmed (in TrimRequest::send_post_copyup()). Thus post object map update is done for each object being trimmed.

This is the reason to introduce need_post in AioObjectTrim -- post update is done only if the object state was not already in OBJECT_PENDING state during send_pre().

It should only be set to PENDING if the current state is EXISTS/EXISTS_CLEAN

I think that's kind of implicit. ->update_required() in send_pre() would become a no-op for other cases. AioObjectTrim::pre_object_map_update() does something similar and sets need_post accordingly. Actually, I had initially thought to introduce AioObejctTrim::send_post that would set need_post thereby avoiding redundant call to ->update_required and avoiding call to pre_object_map_update() under object_map_lock.

vshankar · 2016-11-02T14:13:53Z

@dillaman - I'm unable to get a hold of your concerns in this PR. So, let me briefly explain the reason for the changes where the concerns are:

AioObjectTrim needs to handle cases where pre/post object map updates are already done by the caller (TrimRequest::send_{pre, post}_copyup()) and the case where pre/post operations needs to be done by AioObjectTrim itself (call from TrimRequest::send_clean_boundary()). For the latter, we need pre/post to go through for AioObjectTrim (as object state would need to be moved from EXIST to PENDING in pre and from PENDING to NONEXISTENT in post). So, I still think the changes are required. Plus, fsx runs to completion successfully with these changes.

Let me know if this is completely off...

dillaman · 2016-11-02T14:36:21Z

@vshankar I think I'd rather see TrimRequest::<anon>::C_CopyupObject either pass a "disable object map updates" flag to AioObjectTrim or create a new class that explicitly disables object map updates since it was already handled in batch. Right now, I think the current implementation isn't very clear what it's attempting to handle. The AioObjectTrim was added for one special case instead of adding lots of conditions to the original AioObjectRemove. (not to mention, your need_post is named as if it is a local variable)

vshankar · 2016-11-02T16:49:40Z

@dillaman - Right, So its more from an implementation point. I'll do the required changes.

Thanks for reviewing.

Shrinking a clone which has snapshots and does not share majority of objects with its parent (i.e., there are less objects to be copied up) involves huge number of object map updates -- two (pre, post) per object. This results in lots of requests to be send to OSDs especially when trimming a gigantus image. This situation can be optimized by sending batch ObjectMap updates for an object range thereby significantly cutting down OSD traffic resulting in faster trim times. Fixes: http://tracker.ceph.com/issues/17356 Signed-off-by: Venky Shankar <vshankar@redhat.com>

vshankar · 2016-11-03T14:38:01Z

Teuthology jobs which failed earlier (fsx runs) have now passed: http://pulpito.ceph.com/vshankar-2016-11-03_13:12:19-rbd-wip-17356---basic-mira/

dillaman · 2016-11-03T15:26:04Z

lgtm

In master, the "batch update" change [1] was merged before the "order concurrent updates" [2], while in jewel the latter is already backported [3]. A partial backport of [1] was attempted, but the automated cherry-pick missed some parts of it which this commit is adding manually. [1] ceph#11510 [2] ceph#12420 [3] ceph#12909 Signed-off-by: Mykola Golub <mgolub@mirantis.com> Signed-off-by: Nathan Cutler <ncutler@suse.com>

In master, the "batch update" change [1] was merged before the "order concurrent updates" [2], while in jewel the latter is already backported [3]. A backport of [1] to jewel was attempted, and was necessarily applied on top of [3] - i.e. in the reverse order compared to how the commits went into master. This reverse ordering caused the automated cherry-pick to miss some parts of [1] which this commit is adding manually. [1] ceph#11510 [2] ceph#12420 [3] ceph#12909 Signed-off-by: Mykola Golub <mgolub@mirantis.com> Signed-off-by: Nathan Cutler <ncutler@suse.com>

vshankar added feature performance rbd labels Oct 15, 2016

dillaman changed the title ~~librbd: batch ObjectMap updations upon trim~~ librbd: batch object map updates during trim Oct 17, 2016

dillaman reviewed Oct 17, 2016

View reviewed changes

dillaman self-assigned this Oct 17, 2016

vshankar commented Oct 17, 2016

View reviewed changes

vshankar force-pushed the wip-17356 branch from f9becff to aae4df6 Compare October 18, 2016 04:50

dillaman added wip-jason-testing and removed wip-jason-testing labels Oct 21, 2016

vshankar force-pushed the wip-17356 branch from aae4df6 to f66e236 Compare October 26, 2016 13:10

dillaman reviewed Oct 27, 2016

View reviewed changes

vshankar force-pushed the wip-17356 branch from f66e236 to 05653b7 Compare November 3, 2016 14:35

dillaman added the wip-jason-testing label Nov 4, 2016

dillaman merged commit 5e03f48 into ceph:master Nov 5, 2016

vshankar deleted the wip-17356 branch November 9, 2016 06:38

trociny mentioned this pull request Aug 25, 2017

jewel: rbd: object-map: batch updates during trim operation #15460

Merged

gregsfortytwo mentioned this pull request May 11, 2015

Hammer uclient checking #4629

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

librbd: batch object map updates during trim #11510

librbd: batch object map updates during trim #11510

vshankar commented Oct 15, 2016

dillaman Oct 17, 2016

dillaman Oct 17, 2016

vshankar Oct 17, 2016

dillaman Oct 17, 2016

vshankar Oct 17, 2016

vshankar left a comment

dillaman commented Oct 24, 2016

dillaman Oct 27, 2016 •

edited by vshankar

vshankar Oct 28, 2016 •

edited

dillaman Oct 28, 2016

vshankar Oct 28, 2016 •

edited

vshankar commented Nov 2, 2016

dillaman commented Nov 2, 2016

vshankar commented Nov 2, 2016

vshankar commented Nov 3, 2016

dillaman commented Nov 3, 2016

librbd: batch object map updates during trim #11510

librbd: batch object map updates during trim #11510

Conversation

vshankar commented Oct 15, 2016

dillaman Oct 17, 2016

Choose a reason for hiding this comment

dillaman Oct 17, 2016

Choose a reason for hiding this comment

vshankar Oct 17, 2016

Choose a reason for hiding this comment

dillaman Oct 17, 2016

Choose a reason for hiding this comment

vshankar Oct 17, 2016

Choose a reason for hiding this comment

vshankar left a comment

Choose a reason for hiding this comment

dillaman commented Oct 24, 2016

dillaman Oct 27, 2016 • edited by vshankar

Choose a reason for hiding this comment

vshankar Oct 28, 2016 • edited

Choose a reason for hiding this comment

dillaman Oct 28, 2016

Choose a reason for hiding this comment

vshankar Oct 28, 2016 • edited

Choose a reason for hiding this comment

vshankar commented Nov 2, 2016

dillaman commented Nov 2, 2016

vshankar commented Nov 2, 2016

vshankar commented Nov 3, 2016

dillaman commented Nov 3, 2016

dillaman Oct 27, 2016 •

edited by vshankar

vshankar Oct 28, 2016 •

edited

vshankar Oct 28, 2016 •

edited