[prof-heap] Fix issues stemming from rb_gc_force_recycle #3366

AlexJF · 2024-01-05T19:21:39Z

What does this PR do?

When trying our new heap profiler in real applications running in Ruby 2.7 we noticed the profiler quickly shuts down due to detecting duplicate sampling of objects with the same id. This is not supposed to happen since after Ruby 2.7 the object id system was reworked to ensure uniqueness and we rely on this as part of our object liveness tracking.

Turns out Rubies previous to 3.1 support a feature called "force recycling" where an object slot can be manually given back to the heap outside of a GC cycle by calling rb_gc_force_recycle. However, the implementation of this feature is buggy and fails to clean-up objects ids or finalizers associated with objects that undergo this recycling (in a way that will actually slowly leak map entries over time). This feature also brought enough problems to the management of garbage collection code that it was eventually removed from Ruby 3.1 and later and replaced with a no-op (https://bugs.ruby-lang.org/issues/18290).

Therefore, in Rubies < 3.1, if an object we were tracking were recycled and a new object uses its slot, this new object would "inherit" the id of the original and, from the perspective of our profiler, we'd be blind to this behaviour and continue reporting the original object as being alive.

Fortunately, as figured out by @ivoanjo in #3360 (comment), this implementation is buggy in a way that allows us to detect this recycling actually happened. When an object id is requested for an object, that object has the FL_SEEN_OBJ_ID flag forcibly set. This is supposed to be an invariant as evidenced by the assert in https://github.com/ruby/ruby/blob/4a8d7246d15b2054eacb20f8ab3d29d39a3e7856/gc.c#L4050. However, when a new object uses a force recycled slot, it will inherit the object id from the object that was previously using that slow but will NOT be marked with the FL_SEEN_OBJ_ID flag. Consequently, when we're checking the liveness of a heap tracked object, if we're able to successfully map an id to a reference and then check the flags of that reference, if we don't see FL_SEEN_OBJ_ID then we can assume the object we were tracking was implicitly freed and clean its record.

We have to be careful though. Since allocations that re-use recycled slots still trigger allocation tracepoints, it is possible that we decide to start tracking an object that was allocated on such a recycled slot. In this event, that object will already be missing the FL_SEEN_OBJ_ID flag when we start tracking it. If we did nothing, next time we checked liveness using the workaround described in the previous paragraph we'd immediately discard it as dead even though it might very well be alive. Hence, when we detect a missing FL_SEEN_OBJ_ID flag on sampling, we need to force it back in, thus "fixing" the bug in the runtime. This also had the side-effect of reducing the number of map entries leaked in the obj_to_id and id_to_obj objspace tables.

Motivation:
Handle problems introduced by gc_force_recycle in Rubies < 3.1.

Additional Notes:

How to test the change?

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

ivoanjo

Really happy to see this working! Left a few notes :)

ext/ddtrace_profiling_native_extension/extconf.rb

ext/ddtrace_profiling_native_extension/heap_recorder.c

lib/datadog/profiling/component.rb

spec/datadog/profiling/stack_recorder_spec.rb

[prof-heap] Remove internal include

codecov-commenter · 2024-01-08T18:14:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (cfeff93) 98.24% compared to head (5e3af5a) 98.24%.
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #3366    +/-   ##
========================================
  Coverage   98.24%   98.24%            
========================================
  Files        1254     1254            
  Lines       73218    73350   +132     
  Branches     3430     3437     +7     
========================================
+ Hits        71931    72064   +133     
+ Misses       1287     1286     -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ivoanjo

👍 Really happy that this seems to have gone from a really annoying issue to a small roadbump :)

ext/ddtrace_profiling_native_extension/extconf.rb

lib/datadog/profiling/component.rb

spec/datadog/profiling/stack_recorder_spec.rb

github-actions bot added the profiling Involves Datadog profiling label Jan 5, 2024

AlexJF mentioned this pull request Jan 5, 2024

[prof-heap] Misc improvements/fixes #3360

Merged

2 tasks

AlexJF force-pushed the alexjf/heap-profiler-force-recycle-fix branch from cb01363 to 2d89d6d Compare January 5, 2024 19:22

ivoanjo reviewed Jan 8, 2024

View reviewed changes

Base automatically changed from alexjf/heap-profiler-misc-fixes to master January 8, 2024 11:41

[prof-heap] Fix issues stemming from rb_gc_force_recycle

3b7a216

AlexJF force-pushed the alexjf/heap-profiler-force-recycle-fix branch 3 times, most recently from a4583e0 to f80cedf Compare January 8, 2024 17:25

[prof-heap] Address comments

304c77e

[prof-heap] Remove internal include

AlexJF force-pushed the alexjf/heap-profiler-force-recycle-fix branch from f80cedf to 304c77e Compare January 8, 2024 17:40

[prof-heap] Skip heap profiling tests on Rubies < 2.7

5e3af5a

AlexJF marked this pull request as ready for review January 8, 2024 18:31

AlexJF requested a review from a team as a code owner January 8, 2024 18:31

ivoanjo approved these changes Jan 9, 2024

View reviewed changes

ext/ddtrace_profiling_native_extension/extconf.rb Show resolved Hide resolved

lib/datadog/profiling/component.rb Outdated Show resolved Hide resolved

spec/datadog/profiling/stack_recorder_spec.rb Outdated Show resolved Hide resolved

AlexJF added 3 commits January 9, 2024 11:07

[prof-heap] Add some spec helpers to help with sample matching

26f1213

[prof-heap] Simplifying macro for workaround

61c2f9f

[prof-heap] Different way to simplify macro for workaround

61fa6bd

ivoanjo approved these changes Jan 9, 2024

View reviewed changes

AlexJF merged commit ffcf68c into master Jan 9, 2024
218 checks passed

AlexJF deleted the alexjf/heap-profiler-force-recycle-fix branch January 9, 2024 12:46

github-actions bot added this to the 1.19.0 milestone Jan 9, 2024

TonyCTHsu pushed a commit that referenced this pull request Jan 9, 2024

[prof-heap] Fix issues stemming from rb_gc_force_recycle (#3366)

81d6387

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prof-heap] Fix issues stemming from rb_gc_force_recycle #3366

[prof-heap] Fix issues stemming from rb_gc_force_recycle #3366

AlexJF commented Jan 5, 2024 •

edited

ivoanjo left a comment

codecov-commenter commented Jan 8, 2024

ivoanjo left a comment

[prof-heap] Fix issues stemming from rb_gc_force_recycle #3366

[prof-heap] Fix issues stemming from rb_gc_force_recycle #3366

Conversation

AlexJF commented Jan 5, 2024 • edited

ivoanjo left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 8, 2024

Codecov Report

ivoanjo left a comment

Choose a reason for hiding this comment

AlexJF commented Jan 5, 2024 •

edited