Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: RGWMetaSyncShardCR drops stack refs on destruction #12605

Merged
merged 3 commits into from Jan 16, 2017

Conversation

cbodley
Copy link
Contributor

@cbodley cbodley commented Dec 21, 2016

if the coroutine is canceled before collect_children() can clean up
all of its child stacks, those stack refs will leak. store these
stacks as boost::intrusive_ptr so the ref is dropped automatically on
destruction

Fixes: http://tracker.ceph.com/issues/18300

@liewegas liewegas added this to the kraken milestone Dec 21, 2016
Copy link
Member

@yehudasa yehudasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, has it been tested?

@cbodley
Copy link
Contributor Author

cbodley commented Dec 22, 2016

i added a dout to the RGWMetaSyncShardCR destructor and killed the gateway in the middle of test_multi.py:

2016-12-22 11:11:18.958001 7fb24c500700  1 meta sync: RGWMetaSyncShardCR canceled, releasing 2 stacks
2016-12-22 11:11:18.958077 7fb24c500700  1 meta sync: RGWMetaSyncShardCR canceled, releasing 3 stacks

and the valgrind came out clean

@liewegas
Copy link
Member

rebased onto kraken and pushed wip-18300 to ceph-ci so we can run through rgw suite

@liewegas
Copy link
Member

http://pulpito.ceph.com/sage-2016-12-26_18:32:29-rgw-wip-sage-testing---basic-smithi/

still has some valgrind failures, both leaks and the invalid read. see /a/sage-2016-12-26_18:32:29-rgw-wip-sage-testing---basic-smithi/668072

@cbodley
Copy link
Contributor Author

cbodley commented Jan 3, 2017

thanks for running this @liewegas! looking through the results

@cbodley
Copy link
Contributor Author

cbodley commented Jan 4, 2017

i identified 4 different valgrind failures in those results. one was the invalid read bug that Matt's been working on. i opened tracker issues for the other three:

the original leak addressed by this PR did not reproduce

@cbodley
Copy link
Contributor Author

cbodley commented Jan 5, 2017

added two commits to address the use-after-free issue that was uncovered by the initial fix. teuthology run pending..

@cbodley cbodley changed the title rgw: RGWMetaSyncShardCR drops stack refs on destruction [DNM] rgw: RGWMetaSyncShardCR drops stack refs on destruction Jan 5, 2017
@cbodley
Copy link
Contributor Author

cbodley commented Jan 9, 2017

updates:

@cbodley
Copy link
Contributor Author

cbodley commented Jan 10, 2017

I didn't realize that there's a loop in RGWCloneMetaLogCoroutine::operate(), so state_read_shard_status() was being called multiple times with the same completion, and this was hitting an assertion in librados. This means the completion's lifetime has to be separate from RGWCloneMetaLogCoroutine. I pushed an updated version that allocates the completion separately, and cancels its callback in ~RGWCloneMetaLogCoroutine(). Will schedule another run of wip-cbodley-testing.

if the coroutine is canceled before collect_children() can clean up
all of its child stacks, those stack refs will leak. store these
stacks as boost::intrusive_ptr so the ref is dropped automatically on
destruction

Fixes: http://tracker.ceph.com/issues/18300

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley
Copy link
Contributor Author

cbodley commented Jan 13, 2017

latest run had some leaks of RGWMetadataLogInfoCompletion. i added a fix for the interaction between boost::intrusive_ptr and RefCountedObj. i updated the wip-cbodley-testing branch (and included Matt's fixes from #12767) and will reschedule as soon as it's built

@cbodley cbodley changed the title [DNM] rgw: RGWMetaSyncShardCR drops stack refs on destruction rgw: RGWMetaSyncShardCR drops stack refs on destruction Jan 13, 2017
@cbodley
Copy link
Contributor Author

cbodley commented Jan 16, 2017

teuthology results were green, except for a single valgrind failure in the osd and an s3test failure due to timeouts on slow requests

@yehudasa yehudasa merged commit afa6cbf into ceph:master Jan 16, 2017
@smithfarm
Copy link
Contributor

@cbodley The jewel backport is non-trivial due to the use of boost::intrusive_ptr which is not available in jewel. If you want to do it, the backport tracker issue is http://tracker.ceph.com/issues/18563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants