Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9342] Introduce profiler workaround for Ruby Dir interruption bug #3720

Merged
merged 22 commits into from
Jul 2, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Jun 14, 2024

What does this PR do?

This PR introduces the "dir interruption" workaround to the profiler. This workaround targets an issue in the Ruby Dir class that was originally reported by a customer in #3450 .

In short, 13 of the Ruby Dir class APIs can be, in rare situations, interrupted by the profiler's sampling mechanism. This interruption makes Ruby incorrectly think that it has reached the end of a directory it is reading, so this could mean for the affected APIs:

  • Ruby would return that a directory is empty when it isn't
  • Ruby would return only some, but not all of the files in the directory

The "dir interruption" workaround prevents this issue from happening by monkey patching the affected APIs to wrap the critical regions with calls to CpuAndWallTimeWorker._native_hold_interruptions, CpuAndWallTimeWorker._native_resume_interruptions, preventing the profiler from interrupting those APIs at the wrong time.

Because all stable Ruby versions are known to be affected by this issue, the workaround is applied by default. This PR adds a new setting that can be used to turn it off:

  • DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false
  • Datadog.configure { |c| c.profiling.advanced.dir_interruption_workaround_enabled = false }

Motivation:

Avoid triggering bug out-of-the box.

Additional Notes:

I also plan to report the issue upstream, and hopefully work with the Ruby core team to fix it.

Update: reported upstream as https://bugs.ruby-lang.org/issues/20586 and https://bugs.ruby-lang.org/issues/20587 .

The "no signals" workaround documented in our FAQ also prevents this issue from happening, but at the cost of lower quality data gathering, and since this affects every Ruby version, we didn't want to have to toggle the "no signals" workaround enabled by default.

How to test the change?

This PR includes a lot of test coverage.

I've also tested the monkey patch with the Ruby standard library specs ( https://github.com/ruby/spec/tree/master/core/dir ) on my local machine. I'll follow up this PR with a PR on the https://github.com/datadog/prof-correctness/ to automatically add this to CI there, as otherwise it was a bit awkward to add to the dd-trace-rb repository.

For datadog folks, I've setup a simple GCS bucket than can be used to trigger this issue:

$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt  test.html  test.txt
# Trigger issue without the workaround
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
# Trigger issue with workaround
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=true bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
["hello.txt", "test.html", "test.txt"]
# The "no signals" workaround can be used instead of the "dir interruption" workaround
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false DD_PROFILING_NO_SIGNALS_WORKAROUND_ENABLED=true bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
W, [2024-06-14T15:30:09.108975 #948820]  WARN -- datadog: [datadog] Profiling "no signals" workaround enabled via configuration. Profiling data will have lower quality.
["hello.txt", "test.html", "test.txt"]

These will be needed to implement the dir interruption workaround.
I've separately tested these with ruby-spec, but haven't integrated
it yet for automated testing. Will do it a follow-up commit.
For future reference, here's what I get on my local machine:

```
ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [x86_64-linux]
Warming up --------------------------------------
       hold / resume   391.856k i/100ms
Calculating -------------------------------------
       hold / resume      3.829M (± 1.9%) i/s -     38.402M in  10.033530s
```
Looks like the `sort:` was added later, and it wasn't particularly
critical to what we were testing so I chose to omit it.
@ivoanjo ivoanjo requested review from a team as code owners June 14, 2024 14:33
@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Jun 14, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jun 14, 2024

Codecov Report

Attention: Patch coverage is 58.54271% with 165 lines in your changes missing coverage. Please review.

Project coverage is 97.89%. Comparing base (4cd328d) to head (259e134).
Report is 7 commits behind head on master.

Files Patch % Lines
lib/datadog/profiling/ext/dir_monkey_patches.rb 21.96% 135 Missing ⚠️
...c/datadog/profiling/ext/dir_monkey_patches_spec.rb 82.14% 30 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3720      +/-   ##
==========================================
- Coverage   98.11%   97.89%   -0.22%     
==========================================
  Files        1235     1237       +2     
  Lines       73749    74146     +397     
  Branches     3560     3594      +34     
==========================================
+ Hits        72357    72588     +231     
- Misses       1392     1558     +166     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# <-- We're safe now while running customer code
yield entry_name
# <-- We'll go back to the Dir internals, critical region again
Datadog::Profiling::Collectors::CpuAndWallTimeWorker._native_hold_interruptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be in an ensure as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I don't think so...? It's ok to deliver signals during exception handling, so if an exception gets raised in the client code, it should be fine for the profiler to continue operating from them on, since no more directory calls will be made by Ruby.

Or do you mean making sure that _native_hold_interruptions never raises or something similar?

Copy link
Contributor

@lloeki lloeki Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't know, hence why I ask ;)

I'm basing myself on the following outline:

begin
  # [A+] enter our code: hold
  super do
    # [B-] exit our code back to not-our-code: resume
    yield
    # [B+] enter our code: hold
  end
ensure
  # [A-] exit our code: resume
end

On the super happy path the execution order is [A+] [B-] [B+] [A-].

But if yield raises it becomes [A+] [B-] [A-], which in my mind triggers some "consistency" alarms and makes me wonder about internal details:

  • what happens when _native_resume_signals is called twice in a row? is it a NOOP? is it doing work twice (e.g invoking a syscall) but idempotent?
  • there seems to be work to be protected after yield (because of that [B+]) and before method exit ([A-]) so I presume that work is in super, but if yield raises there may be work to protect too in a hypothetical ensure block inside that super too? Or is [B+] entirely superfluous even in the happy yield case?

Conversely, imagine an abstraction like so:

module SignalHolder
  def hold
	# hold
    yield
  ensure
    # resume
  end

  def ignore
    # resume
    yield
  ensure
    # hold
  end
end

And the code becoming:

SignalHolder.hold do
  super { SignalHolder.ignore { yield } }
end

The above is not even a suggestion it's merely to materialise and compare the kind of structural "consistency".

It could simply instead just need a comment to prevent the mind tripwire or help someone who would refactor this code in the future for some reason.

But really, I'd say it's fine to merge if you're confident and my concerns are unwarranted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for explaining!

So, in parts:

  • Yes, it's OK if the yield raises. Specifically:
    • We can call _native_resume_signals many times in a row.

      As you suspected, it leads to pthread_sigmask which I believe on Linux will lead to the rt_sigprocmask system call being called.
      I've included a benchmark in the PR of what the cost of this is, and here's a run I did:

      ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [x86_64-linux]
      Warming up --------------------------------------
             hold / resume   391.856k i/100ms
      Calculating -------------------------------------
             hold / resume      3.829M (± 1.9%) i/s -     38.402M in  10.033530s
      

      ...so I think we're fine if we end up calling it a few more times than strictly needed.

    • The root of the issue in the Ruby VM happens because in the affected APIs Ruby silently interprets an interruption as "end of folder/empty folder". When an exception is being raised, the iteration will stop anyway, so there's no longer a concern of the interruption causing a semantic mismatch (at least that I can see).

I've pushed 259e134 to add this note to the code as in hindsight I agree it was not very clear before (and hopefully this clarifies it)

lib/datadog/profiling/ext/dir_monkey_patches.rb Outdated Show resolved Hide resolved
@@ -136,6 +136,7 @@ def self.allocation_count # rubocop:disable Lint/NestedMethodDefinition (On purp
return false unless supported?

require_relative 'profiling/ext/forking'
require_relative 'profiling/ext/dir_monkey_patches'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be more functionally named? ("what" instead of "how")

WDYT of dir_signal_masking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with this name because I know it will show up in the profiling flamegraph for customers.

My thinking is that it's clearer if you see "oh, why is there a datadog thing here? ah, it's monkey patching my uses of dir". Signal masking makes me think a bit more along hte lines of "what is this signal masking thing? what is masking anyway?".

Copy link
Contributor

@lloeki lloeki Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, interesting! I was thinking that if I saw that in a stack trace or something I would wonder "why is Datadog monkey patching dir stuff at all?". In a way "monkey patching" is how but not what/why. In a way it's also redundant with ext, the sort of conventional way to say "this folder contains things that extend stuff". dir_signal_fix maybe?

But at this stage that's becoming bikeshedding and certainly not a blocker. Feel free to move forward.

Adding links to upstream issues in the comment at the top of the file would help too I think (I reread it and with only the Datadog issue link it makes it like this is a Datadog issue when it's really an upstream issue)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the joke about naming things being hard applies here >_>. I suspect/fear just relying on ext/ may be too subtle (e.g. when you're looking a flamegraph), but am open to changing if folks get confused about the current name.

+1 I've added a link to the upstream ticket in 259e134.

The interruptions part is just a consequence of signal delivery.
Loic suggested this may provide better compatibility, and it seems
to make sense given we're monkey patching a quite core Ruby class.

This diff is way easier to review without whitespace. What I did was
copy/paste the module, remove the `ruby2_keywords`, add `**kwargs` to
the Ruby 3 variants and clean up any things that don't make sense for
Ruby 3 (e.g. testing for 2.5 for some methods).
…arks

...which I totally was almost forgetting.
ivoanjo added a commit to DataDog/prof-correctness that referenced this pull request Jun 19, 2024
…patch

**What does this PR do?**

This PR adds a new test scenario, the `ruby_dir_interruption_patch`.

**Motivation:**

This scenario tests DataDog/dd-trace-rb#3720
in a very unique way: by loading the profiler, and then running the
upstream Ruby test suite for the `Dir` class with our monkey patches
applied.

(For context, our "monkey patches" are a couple of Ruby modules which
wrap the behavior of the standard library `Dir` class, and thus we
need to be quite careful to make sure not to break any existing
functionality.)

**Additional Notes:**

This test does not actually generate or assert on any profiles, it only
checks that the profiler is running and the monkey patches were
correctly applied prior to running the Ruby test suite.

The `mspec_config.rb` file is also quite important, since it ensures
that both the profiler as well as the monkey patches were correctly
applied (e.g. to avoid the test passing trivially because e.g. the
profiler actually didn't start).

Because at time of writing
DataDog/dd-trace-rb#3720 is not yet merged,
this test is expected to fail with:

```
$ TEST_RUN_SECS=5 TEST_SCENARIOS="ruby_dir_interruption_patch" go test -v -run TestScenarios
=== RUN   TestScenarios
    correctness_test.go:210: Considering only scenarios in ruby_dir_interruption_patch
    correctness_test.go:225: Extract base image from: scenarios/ruby_dir_interruption_patch/Dockerfile
=== RUN   TestScenarios/scenarios/ruby_dir_interruption_patch
    correctness_test.go:244: Folder: scenarios/ruby_dir_interruption_patch
    correctness_test.go:245: Json file: scenarios/ruby_dir_interruption_patch/expected_profile.json
    correctness_test.go:246: Docker file: scenarios/ruby_dir_interruption_patch/Dockerfile
    correctness_test.go:248: Built test app with: test-app
    correctness_test.go:137: Running docker command with output /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1618380497
    correctness_test.go:138: '[/bin/sh -c bundle exec ddprofrb exec ruby mspec-master/bin/mspec-run --config mspec_config.rb spec-master/core/dir/]'
    correctness_test.go:151: Error running the test docker run -v /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1618380497:/app/data:rw -u 1000:1000 --security-opt seccomp=unconfined -e EXECUTION_TIME_SEC=5 -e DD_SERVICE=prof-correctness-scenarios/ruby_dir_interruption_patch test-app:latest - /app/mspec_config.rb:3:in `<top (required)>': uninitialized constant Datadog::Profiling::Ext::DirInstanceMonkeyPatches (NameError)

        if Dir.ancestors.first == Datadog::Profiling::Ext::DirInstanceMonkeyPatches &&
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
        	from /app/mspec-master/lib/mspec/utils/script.rb:91:in `load'
        	from /app/mspec-master/lib/mspec/utils/script.rb:91:in `block (2 levels) in try_load'
        	from /app/mspec-master/lib/mspec/utils/script.rb:86:in `each'
        	from /app/mspec-master/lib/mspec/utils/script.rb:86:in `block in try_load'
        	from /app/mspec-master/lib/mspec/utils/script.rb:85:in `each'
        	from /app/mspec-master/lib/mspec/utils/script.rb:85:in `try_load'
        	from /app/mspec-master/lib/mspec/utils/script.rb:102:in `load'
        	from /app/mspec-master/lib/mspec/commands/mspec-run.rb:34:in `block in options'
        	from /app/mspec-master/lib/mspec/utils/options.rb:111:in `process'
        	from /app/mspec-master/lib/mspec/utils/options.rb:143:in `parse'
        	from /app/mspec-master/lib/mspec/commands/mspec-run.rb:76:in `options'
        	from /app/mspec-master/lib/mspec/utils/script.rb:286:in `main'
        	from mspec-master/bin/mspec-run:7:in `<main>'
        I, [2024-06-19T08:58:37.927905 #7]  INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-19T08:58:37Z","os_name":"x86_64-pc-linux","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"prof-correctness-scenarios/ruby_dir_interruption_patch","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":true}
--- FAIL: TestScenarios (19.35s)
```

...e.g. the monkey patches are missing.

For local testing, you can tweak the `gems.rb` to pick up the new
branch:

```ruby
gem 'datadog', git: 'https://github.com/datadog/dd-trace-rb.git', branch: 'ivoanjo/prof-9342-dir-interruption-workaround'
```

and you'll see the test passing.

Furthermore, I've created a branch where I introduce a bug in
`Dir.home`:

```ruby
gem 'datadog', git: 'https://github.com/datadog/dd-trace-rb.git',  branch: 'ivoanjo/prof-9342-dir-interruption-workaround-deleteme'
```

and here's the Ruby test suite flagging this behavior issue:

```
$ TEST_RUN_SECS=5 TEST_SCENARIOS="ruby_dir_interruption_patch" go test -v -run TestScenarios
=== RUN   TestScenarios
    correctness_test.go:210: Considering only scenarios in ruby_dir_interruption_patch
    correctness_test.go:225: Extract base image from: scenarios/ruby_dir_interruption_patch/Dockerfile
=== RUN   TestScenarios/scenarios/ruby_dir_interruption_patch
    correctness_test.go:244: Folder: scenarios/ruby_dir_interruption_patch
    correctness_test.go:245: Json file: scenarios/ruby_dir_interruption_patch/expected_profile.json
    correctness_test.go:246: Docker file: scenarios/ruby_dir_interruption_patch/Dockerfile
    correctness_test.go:248: Built test app with: test-app
    correctness_test.go:137: Running docker command with output /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1872402346
    correctness_test.go:138: '[/bin/sh -c bundle exec ddprofrb exec ruby mspec-master/bin/mspec-run --config mspec_config.rb spec-master/core/dir/]'
    correctness_test.go:151: Error running the test docker run -v /home/ivo.anjo/datadog/prof-correctness/data/ruby_dir_interruption_patch-1872402346:/app/data:rw -u 1000:1000 --security-opt seccomp=unconfined -e EXECUTION_TIME_SEC=5 -e DD_SERVICE=prof-correctness-scenarios/ruby_dir_interruption_patch test-app:latest - I, [2024-06-19T09:04:55.521052 #7]  INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-19T09:04:55Z","os_name":"x86_64-pc-linux","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"prof-correctness-scenarios/ruby_dir_interruption_patch","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":true}
        Dir interruption patch is present!
        ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [x86_64-linux]
        ...................................................................................................................................................................................................................................................................FFEEFFEF...............................................................

        1)
        Dir.home when called without arguments returns the current user's home directory, reading $HOME first FAILED
        Expected nil == "/rubyspec_home"
        to be truthy but was false
        /app/spec-master/core/dir/home_spec.rb:16:in `block (3 levels) in <top (required)>'
        /app/spec-master/core/dir/home_spec.rb:4:in `<top (required)>'

        2)
        Dir.home when called without arguments returns a non-frozen string FAILED
        Expected nil.frozen?
        to be falsy but was true
        /app/spec-master/core/dir/home_spec.rb:20:in `block (3 levels) in <top (required)>'
        /app/spec-master/core/dir/home_spec.rb:4:in `<top (required)>'

        (...etc...)

        32 files, 330 examples, 440 expectations, 5 failures, 3 errors, 0 tagged
--- FAIL: TestScenarios (19.42s)
```
@ivoanjo
Copy link
Member Author

ivoanjo commented Jun 19, 2024

I've updated the PR description with links to the upstream reports for this issue https://bugs.ruby-lang.org/issues/20586 / https://bugs.ruby-lang.org/issues/20587 .

@ivoanjo ivoanjo added this to the 2.2.0 milestone Jun 19, 2024
@ivoanjo ivoanjo merged commit dc2739d into master Jul 2, 2024
166 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-9342-dir-interruption-workaround branch July 2, 2024 09:00
@TonyCTHsu TonyCTHsu mentioned this pull request Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants