-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[prof-heap] Fix issues stemming from rb_gc_force_recycle #3366
Conversation
cb01363
to
2d89d6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really happy to see this working! Left a few notes :)
a4583e0
to
f80cedf
Compare
[prof-heap] Remove internal include
f80cedf
to
304c77e
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3366 +/- ##
========================================
Coverage 98.24% 98.24%
========================================
Files 1254 1254
Lines 73218 73350 +132
Branches 3430 3437 +7
========================================
+ Hits 71931 72064 +133
+ Misses 1287 1286 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Really happy that this seems to have gone from a really annoying issue to a small roadbump :)
What does this PR do?
When trying our new heap profiler in real applications running in Ruby 2.7 we noticed the profiler quickly shuts down due to detecting duplicate sampling of objects with the same id. This is not supposed to happen since after Ruby 2.7 the object id system was reworked to ensure uniqueness and we rely on this as part of our object liveness tracking.
Turns out Rubies previous to 3.1 support a feature called "force recycling" where an object slot can be manually given back to the heap outside of a GC cycle by calling
rb_gc_force_recycle
. However, the implementation of this feature is buggy and fails to clean-up objects ids or finalizers associated with objects that undergo this recycling (in a way that will actually slowly leak map entries over time). This feature also brought enough problems to the management of garbage collection code that it was eventually removed from Ruby 3.1 and later and replaced with a no-op (https://bugs.ruby-lang.org/issues/18290).Therefore, in Rubies < 3.1, if an object we were tracking were recycled and a new object uses its slot, this new object would "inherit" the id of the original and, from the perspective of our profiler, we'd be blind to this behaviour and continue reporting the original object as being alive.
Fortunately, as figured out by @ivoanjo in #3360 (comment), this implementation is buggy in a way that allows us to detect this recycling actually happened. When an object id is requested for an object, that object has the
FL_SEEN_OBJ_ID
flag forcibly set. This is supposed to be an invariant as evidenced by the assert in https://github.com/ruby/ruby/blob/4a8d7246d15b2054eacb20f8ab3d29d39a3e7856/gc.c#L4050. However, when a new object uses a force recycled slot, it will inherit the object id from the object that was previously using that slow but will NOT be marked with theFL_SEEN_OBJ_ID
flag. Consequently, when we're checking the liveness of a heap tracked object, if we're able to successfully map an id to a reference and then check the flags of that reference, if we don't seeFL_SEEN_OBJ_ID
then we can assume the object we were tracking was implicitly freed and clean its record.We have to be careful though. Since allocations that re-use recycled slots still trigger allocation tracepoints, it is possible that we decide to start tracking an object that was allocated on such a recycled slot. In this event, that object will already be missing the
FL_SEEN_OBJ_ID
flag when we start tracking it. If we did nothing, next time we checked liveness using the workaround described in the previous paragraph we'd immediately discard it as dead even though it might very well be alive. Hence, when we detect a missingFL_SEEN_OBJ_ID
flag on sampling, we need to force it back in, thus "fixing" the bug in the runtime. This also had the side-effect of reducing the number of map entries leaked in the obj_to_id and id_to_obj objspace tables.Motivation:
Handle problems introduced by
gc_force_recycle
in Rubies < 3.1.Additional Notes:
How to test the change?
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!