New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
librbd: batch object map updates during trim #11510
Conversation
} | ||
|
||
template <typename I> | ||
void TrimRequest<I>::send_copyup_objects() { | ||
void TrimRequest<I>::send_update_objectmap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: this state doesn't seem to do anything besides pick which method to invoke next -- why not just combine it w/ send_pre_copyup
and skip to send_pre_remove
if copyup isn't needed.
|
||
Context *ctx = this->create_callback_context(); | ||
RWLock::WLocker object_map_locker(image_ctx.object_map_lock); | ||
if (!image_ctx.object_map->aio_update(m_copyup_start, m_copyup_end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct. The purpose of the copy-up state is to ensure that if you have snapshots and if there are one-or-more blocks that haven't been copied up to the child image, we need to ensure that these snapshots have the backing parent image data just in case the child image is flattened.
This path would only be executed if you are shrinking an image and will not be executed upon an image removal (since you cannot have snapshots when you remove the image). Since we need to read each affected block from the parent image for copyup to know if the parent image has backing data, I am not sure we can optimize this path to a batch object map update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, image removal updates ObjectMap in batches. There were couple of scenarios which updated ObjectMap for each object, one during discard (aio_discard
) and the other (as you mentioned) being shrinking an image which has snapshots.
This PR reduces the ObjectMap update traffic to the OSD for the latter case, especially when the object exists (no copy required from its parent) and needs to be removed.
@@ -348,10 +348,6 @@ class AioObjectTrim : public AbstractAioObjectWrite { | |||
virtual void pre_object_map_update(uint8_t *new_state) { | |||
*new_state = OBJECT_PENDING; | |||
} | |||
|
|||
virtual bool post_object_map_update() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize that the trim state machine already performed a batch update -- I think this removal is correct and is what was (incorrectly) causing per-object object map updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worries...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman I'll resend the PR with updated commit message for clarity.
@vshankar It appears that all the fsx tests have failed due to IO mismatches: |
*new_state = OBJECT_PENDING; | ||
|
||
// object map state (before pre) decides if a post updation is needed | ||
need_post = m_ictx->object_map->update_required(m_object_no, *new_state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Shouldn't https://github.com/vshankar/ceph/blob/master/src/librbd/AioObjectRequest.cc#L485 cover the case when the PRE didn't update the object map (since https://github.com/vshankar/ceph/blob/master/src/librbd/ObjectMap.cc#L98 would essentially no-op the update if the object is already non-existent), thus the POST would be no-op as well (since it's still flagged as non-existent).
POST would not be a no-op as the object state is still PENDING
state at this point (in AioObejctRequest
). Post batch update (PENDING
-> NON_EXISTENT
) is done in TrimRequest after objects for a given range are trimmed).
If this is not the case, then I'm surely missing something in proofreading code and testing the fix :( [reverting post_object_map_update()
virtual function, post update happens for each object]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before send_pre()
is called (for each object), the object state for the range [start, end)
has already been moved to OBJECT_PENDING
. Therefore, pre operation for every object being trimmed becomes a no-op. But in send_post()
, the object map state for every object is still in OBJECT_PENDING
state as bulk object map update of [start, end)
is done after objects in this range are trimmed (in TrimRequest::send_post_copyup()
). Thus post object map update is done for each object being trimmed.
This is the reason to introduce need_post
in AioObjectTrim
-- post update is done only if the object state was not already in OBJECT_PENDING
state during send_pre()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should only be set to PENDING if the current state is EXISTS/EXISTS_CLEAN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's kind of implicit. ->update_required()
in send_pre()
would become a no-op for other cases. AioObjectTrim::pre_object_map_update()
does something similar and sets need_post
accordingly. Actually, I had initially thought to introduce AioObejctTrim::send_post
that would set need_post
thereby avoiding redundant call to ->update_required
and avoiding call to pre_object_map_update()
under object_map_lock
.
@dillaman - I'm unable to get a hold of your concerns in this PR. So, let me briefly explain the reason for the changes where the concerns are:
Let me know if this is completely off... |
@vshankar I think I'd rather see |
@dillaman - Right, So its more from an implementation point. I'll do the required changes. Thanks for reviewing. |
Shrinking a clone which has snapshots and does not share majority of objects with its parent (i.e., there are less objects to be copied up) involves huge number of object map updates -- two (pre, post) per object. This results in lots of requests to be send to OSDs especially when trimming a gigantus image. This situation can be optimized by sending batch ObjectMap updates for an object range thereby significantly cutting down OSD traffic resulting in faster trim times. Fixes: http://tracker.ceph.com/issues/17356 Signed-off-by: Venky Shankar <vshankar@redhat.com>
Teuthology jobs which failed earlier (fsx runs) have now passed: http://pulpito.ceph.com/vshankar-2016-11-03_13:12:19-rbd-wip-17356---basic-mira/ |
lgtm |
In master, the "batch update" change [1] was merged before the "order concurrent updates" [2], while in jewel the latter is already backported [3]. A partial backport of [1] was attempted, but the automated cherry-pick missed some parts of it which this commit is adding manually. [1] ceph#11510 [2] ceph#12420 [3] ceph#12909 Signed-off-by: Mykola Golub <mgolub@mirantis.com> Signed-off-by: Nathan Cutler <ncutler@suse.com>
In master, the "batch update" change [1] was merged before the "order concurrent updates" [2], while in jewel the latter is already backported [3]. A backport of [1] to jewel was attempted, and was necessarily applied on top of [3] - i.e. in the reverse order compared to how the commits went into master. This reverse ordering caused the automated cherry-pick to miss some parts of [1] which this commit is adding manually. [1] ceph#11510 [2] ceph#12420 [3] ceph#12909 Signed-off-by: Mykola Golub <mgolub@mirantis.com> Signed-off-by: Nathan Cutler <ncutler@suse.com>
Shrinking an image can result in huge number of ObjectMap
updates -- two for each object being removed. This results
in lots of requests to be send to OSDs especially when
trimming a gigantus image. This situation can be optimized
by sending batch ObjectMap updates, significantly cutting
down the OSD traffic.
Fixes: http://tracker.ceph.com/issues/17356
Signed-off-by: Venky Shankar vshankar@redhat.com