Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librbd: batch object map updates during trim #11510

Merged
merged 1 commit into from Nov 5, 2016

Conversation

vshankar
Copy link
Contributor

Shrinking an image can result in huge number of ObjectMap
updates -- two for each object being removed. This results
in lots of requests to be send to OSDs especially when
trimming a gigantus image. This situation can be optimized
by sending batch ObjectMap updates, significantly cutting
down the OSD traffic.

Fixes: http://tracker.ceph.com/issues/17356
Signed-off-by: Venky Shankar vshankar@redhat.com

@dillaman dillaman changed the title librbd: batch ObjectMap updations upon trim librbd: batch object map updates during trim Oct 17, 2016
}

template <typename I>
void TrimRequest<I>::send_copyup_objects() {
void TrimRequest<I>::send_update_objectmap() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this state doesn't seem to do anything besides pick which method to invoke next -- why not just combine it w/ send_pre_copyup and skip to send_pre_remove if copyup isn't needed.


Context *ctx = this->create_callback_context();
RWLock::WLocker object_map_locker(image_ctx.object_map_lock);
if (!image_ctx.object_map->aio_update(m_copyup_start, m_copyup_end,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. The purpose of the copy-up state is to ensure that if you have snapshots and if there are one-or-more blocks that haven't been copied up to the child image, we need to ensure that these snapshots have the backing parent image data just in case the child image is flattened.

This path would only be executed if you are shrinking an image and will not be executed upon an image removal (since you cannot have snapshots when you remove the image). Since we need to read each affected block from the parent image for copyup to know if the parent image has backing data, I am not sure we can optimize this path to a batch object map update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, image removal updates ObjectMap in batches. There were couple of scenarios which updated ObjectMap for each object, one during discard (aio_discard) and the other (as you mentioned) being shrinking an image which has snapshots.

This PR reduces the ObjectMap update traffic to the OSD for the latter case, especially when the object exists (no copy required from its parent) and needs to be removed.

@@ -348,10 +348,6 @@ class AioObjectTrim : public AbstractAioObjectWrite {
virtual void pre_object_map_update(uint8_t *new_state) {
*new_state = OBJECT_PENDING;
}

virtual bool post_object_map_update() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that the trim state machine already performed a batch update -- I think this removal is correct and is what was (incorrectly) causing per-object object map updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries...

@dillaman dillaman self-assigned this Oct 17, 2016
Copy link
Contributor Author

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman I'll resend the PR with updated commit message for clarity.

@dillaman
Copy link

@vshankar It appears that all the fsx tests have failed due to IO mismatches:
http://pulpito.ceph.com/jdillaman-2016-10-24_09:21:03-rbd-wip-jd-testing-distro-basic-smithi/

*new_state = OBJECT_PENDING;

// object map state (before pre) decides if a post updation is needed
need_post = m_ictx->object_map->update_required(m_object_no, *new_state);
Copy link

@dillaman dillaman Oct 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Shouldn't https://github.com/vshankar/ceph/blob/master/src/librbd/AioObjectRequest.cc#L485 cover the case when the PRE didn't update the object map (since https://github.com/vshankar/ceph/blob/master/src/librbd/ObjectMap.cc#L98 would essentially no-op the update if the object is already non-existent), thus the POST would be no-op as well (since it's still flagged as non-existent).

POST would not be a no-op as the object state is still PENDING state at this point (in AioObejctRequest). Post batch update (PENDING -> NON_EXISTENT) is done in TrimRequest after objects for a given range are trimmed).

If this is not the case, then I'm surely missing something in proofreading code and testing the fix :( [reverting post_object_map_update() virtual function, post update happens for each object]

Copy link
Contributor Author

@vshankar vshankar Oct 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before send_pre() is called (for each object), the object state for the range [start, end) has already been moved to OBJECT_PENDING. Therefore, pre operation for every object being trimmed becomes a no-op. But in send_post(), the object map state for every object is still in OBJECT_PENDING state as bulk object map update of [start, end) is done after objects in this range are trimmed (in TrimRequest::send_post_copyup()). Thus post object map update is done for each object being trimmed.

This is the reason to introduce need_post in AioObjectTrim -- post update is done only if the object state was not already in OBJECT_PENDING state during send_pre().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should only be set to PENDING if the current state is EXISTS/EXISTS_CLEAN

Copy link
Contributor Author

@vshankar vshankar Oct 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's kind of implicit. ->update_required() in send_pre() would become a no-op for other cases. AioObjectTrim::pre_object_map_update() does something similar and sets need_post accordingly. Actually, I had initially thought to introduce AioObejctTrim::send_post that would set need_post thereby avoiding redundant call to ->update_required and avoiding call to pre_object_map_update() under object_map_lock.

@vshankar
Copy link
Contributor Author

vshankar commented Nov 2, 2016

@dillaman - I'm unable to get a hold of your concerns in this PR. So, let me briefly explain the reason for the changes where the concerns are:

AioObjectTrim needs to handle cases where pre/post object map updates are already done by the caller (TrimRequest::send_{pre, post}_copyup()) and the case where pre/post operations needs to be done by AioObjectTrim itself (call from TrimRequest::send_clean_boundary()). For the latter, we need pre/post to go through for AioObjectTrim (as object state would need to be moved from EXIST to PENDING in pre and from PENDING to NONEXISTENT in post). So, I still think the changes are required. Plus, fsx runs to completion successfully with these changes.

Let me know if this is completely off...

@dillaman
Copy link

dillaman commented Nov 2, 2016

@vshankar I think I'd rather see TrimRequest::<anon>::C_CopyupObject either pass a "disable object map updates" flag to AioObjectTrim or create a new class that explicitly disables object map updates since it was already handled in batch. Right now, I think the current implementation isn't very clear what it's attempting to handle. The AioObjectTrim was added for one special case instead of adding lots of conditions to the original AioObjectRemove. (not to mention, your need_post is named as if it is a local variable)

@vshankar
Copy link
Contributor Author

vshankar commented Nov 2, 2016

@dillaman - Right, So its more from an implementation point. I'll do the required changes.

Thanks for reviewing.

Shrinking a clone which has snapshots and does not share
majority of objects with its parent (i.e., there are less
objects to be copied up) involves huge number of object
map updates -- two (pre, post) per object. This results
in lots of requests to be send to OSDs especially when
trimming a gigantus image. This situation can be optimized
by sending batch ObjectMap updates for an object range
thereby significantly cutting down OSD traffic resulting
in faster trim times.

Fixes: http://tracker.ceph.com/issues/17356
Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar
Copy link
Contributor Author

vshankar commented Nov 3, 2016

Teuthology jobs which failed earlier (fsx runs) have now passed: http://pulpito.ceph.com/vshankar-2016-11-03_13:12:19-rbd-wip-17356---basic-mira/

@dillaman
Copy link

dillaman commented Nov 3, 2016

lgtm

@dillaman dillaman merged commit 5e03f48 into ceph:master Nov 5, 2016
@vshankar vshankar deleted the wip-17356 branch November 9, 2016 06:38
smithfarm added a commit to smithfarm/ceph that referenced this pull request Aug 26, 2017
In master, the "batch update" change [1] was merged before
the "order concurrent updates" [2], while in jewel the latter
is already backported [3]. A partial backport of [1]
was attempted, but the automated cherry-pick missed some
parts of it which this commit is adding manually.

[1] ceph#11510
[2] ceph#12420
[3] ceph#12909

Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Nathan Cutler <ncutler@suse.com>
smithfarm added a commit to smithfarm/ceph that referenced this pull request Aug 26, 2017
In master, the "batch update" change [1] was merged before the "order
concurrent updates" [2], while in jewel the latter is already
backported [3]. A backport of [1] to jewel was attempted, and was
necessarily applied on top of [3] - i.e. in the reverse order compared
to how the commits went into master. This reverse ordering caused the
automated cherry-pick to miss some parts of [1] which this commit is
adding manually.

[1] ceph#11510
[2] ceph#12420
[3] ceph#12909

Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Nathan Cutler <ncutler@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants