Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate trace through Concurrent::Promises.future #1522

Merged
merged 4 commits into from
Nov 1, 2023

Conversation

lloeki
Copy link
Contributor

@lloeki lloeki commented May 26, 2021

Argument passing is preserved through the composited executor.

Concurrent::Future has been moved to a prepend style to resolve an infinite call loop situation.

@lloeki lloeki requested review from marcotc, ivoanjo, delner and a team May 26, 2021 11:30
@lloeki
Copy link
Contributor Author

lloeki commented May 26, 2021

Duh, one lousy test on my part. Will fix.

ivoanjo
ivoanjo previously approved these changes May 27, 2021
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me quite a bit to untangle the many bug fixes and features in this one commit:

  • Support Concurrent::Promises::Future
  • Args were not being passed to tasks by the ContextCompositeExecutorService
  • Switch to prepend in FuturePatch to avoid (potential?) infinite recursion
  • Support wrapping the global executors (e.g. whenever the future gets a executer: :io argument)
  • Fix piecemeal requires of concurrent-ruby modules in the integration specs which are not supported by upstream

Some of the items were alluded in the commit message, but I admit that I couldn't quite understand what they were until I reverse engineered them from the diff.

Having all these changes together made it quite hard for me to review the PR, as I had to separate them out for myself to understand why each change was there and why it was needed.

It may not be worth doing the separation after the fact for this PR (up to you), but I'd really like to see such a separation in the future, as it would make non-trivial changes such as this one a lot easier to review.

parent_context = datadog_configuration.tracer.provider.context

@composited_executor.post(*args) do
executor = @composited_executor.is_a?(Symbol) ? Concurrent.executor(@composited_executor) : @composited_executor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this check to #initialize, since these pools are global.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default executor pools are global but in Concurrent::Promises a custom executor instance can be passed as argument in some public API methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. If that's the case, how would it affect this line of code (vs if it was executed at initialization time)? e.g. would @composited_executor or Concurrent.executor(@composited_executor) change based on the promise configuration?

I wouldn't quite expect them too (??).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I believe you are correct, my mind fumbled and for some reason I thought you were referring to another initializer.

Copy link
Contributor Author

@lloeki lloeki May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm in fact I now recall it's the Concurrent.executor caller behaviour I'm more concerned with. Doing the call here is consistent with the moment where the same method is called in evaluate_to (it's just a little bit earlier in the call stack), whereas if called in initialize the timeline is sensibly different, including being called in another thread entirely. With the current implementation of concurrent-ruby it does not look like it could be a problem, but I feel it would be introducing some brittleness by depending on a second internal behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that usually things break in the ways we didn't think of (lolol) but it's harmless so I'm happy to keep it -- I just asked to make sure that I wasn't missing anything :)

lib/ddtrace/contrib/concurrent_ruby/patcher.rb Outdated Show resolved Hide resolved
Comment on lines 12 to 20
# DEV We save an unmodified copy of Concurrent::Future.
let!(:unmodified_promises) { ::Concurrent::Promises.dup }
let!(:unmodified_future) { ::Concurrent::Future.dup }

# DEV We then restore Concurrent::Future, a dangerous game.
after do
::Concurrent.send(:remove_const, :Promises)
::Concurrent.const_set('Promises', unmodified_promises)
::Concurrent.send(:remove_const, :Future)
::Concurrent.const_set('Future', unmodified_future)
remove_patch!(:concurrent_ruby)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could use RSpec's constant stubbing for this purpose, rather than doing our own homegrown constant stubbing implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we want to test on the real constant (integration testing) but we also want to be able to restore the original inheritance chain (Ideally we would just remove the module from the chain, but that's not possible in Ruby. I had on occasion resorted to inserting dynamically created modules, and empty them afterwards).

Are you suggesting we could do something like this to achieve a similar effect?

stub_const('Concurrent::Promises', ::Concurrent::Promises.dup)
stub_const('Concurrent::Future', ::Concurrent::Future.dup)

Copy link
Member

@ivoanjo ivoanjo May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we want to test on the real constant (integration testing) but we also want to be able to restore the original inheritance chain (Ideally we would just remove the module from the chain, but that's not possible in Ruby. I had on occasion resorted to inserting dynamically created modules, and empty them afterwards).

To be honest, I think we could just dependency inject the modules to modify and have the production ones as defaults, but that's usually not what has been done in the gem historically (I also think we shouldn't have as many module functions) so I didn't suggest going that far :)

Are you suggesting we could do something like this to achieve a similar effect?

Yeah, I was thinking that they should be equivalent. I can't think of any cases where it wouldn't be -- but it may just be an unknown-unknown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This after block is pure dark magic.
It works, but no one knows if it will continue to after the next change.

If you can find a better way to reset patching, please go ahead and make that change.

Alternatively, this is the only integration that does this: maybe such strict testing is not truly required?

Either way, you might also just leave it as is if you can't find a good alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to stub_const which seems to work well, and is more readable.

I do have some facilities to handle this in the upcoming Graft instrumentation, but further change here feels like being way out of scope for this PR. The other ways I can come up with feel even hackier.

As for testing this, pragmatically I definitely found these tests useful to check that this PR works.

@lloeki
Copy link
Contributor Author

lloeki commented May 27, 2021

avoid (potential?) infinite recursion

That triggered during a rspec run, so unless the behaviour is spec-induced it had a chance of happening for real.

Having all these changes together made it quite hard for me to review the PR, as I had to separate them out for myself to understand why each change was there and why it was needed.

It may not be worth doing the separation after the fact for this PR (up to you), but I'd really like to see such a separation in the future, as it would make non-trivial changes such as this one a lot easier to review.

I agree. It turns out the order of discovering the issues and corresponding fixes was basically reversed from what would be a sensible sequence of changes for review, and by that point it wasn't quite straightforward to retcon these into non-broken individual commits, which I feel sorry for not doing.

I tried to make the code comments relevant, but I could definitely have been more explicit in the commit message/PR description though.

@ivoanjo
Copy link
Member

ivoanjo commented May 27, 2021

That triggered during a rspec run, so unless the behaviour is spec-induced it had a chance of happening for real.

I'd be curious to see what happened -- was the patch applied more than once? Just to make sure we're fixing it properly with the prepend version.

I agree. It turns out the order of discovering the issues and corresponding fixes was basically reversed from what would be a sensible sequence of changes for review, and by that point it wasn't quite straightforward to retcon these into non-broken individual commits, which I feel sorry for not doing.

No worries, there's always the ideal and then there's reality. Hope you don't mind if I poke when it made my life harder, and please do the same and get revenge on my PRs as well :)

@lloeki
Copy link
Contributor Author

lloeki commented May 27, 2021

I'd be curious to see what happened -- was the patch applied more than once? Just to make sure we're fixing it properly with the prepend version.

Yup, now that you mention it, the after step wasn't cleaning up correctly (now fixed), so the replaced method was the same as the replacing, and it looped. It could also happen in the now classic case of someone else using prepend on the same method, or a variety of other situations involving inheritance, for which the solution would be to use prepend through and through anyway.

Hope you don't mind if I poke when it made my life harder

No offense taken! It was definitely called for :)

@lloeki lloeki force-pushed the feature/1316-concurrent-promises-future branch 2 times, most recently from 9281c98 to 7a61b6b Compare May 27, 2021 14:17
@ivoanjo
Copy link
Member

ivoanjo commented May 27, 2021

The JRuby 9.2.0.0 test failure looks like it is deterministic -- https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/3520/workflows/c571f62b-cf7f-44ed-bd05-2d134b54a8a8/jobs/138556/tests#failed-test-0 .

We have 9.2.0.0 because of #1409 but to be honest I'm not entirely a fan. At most, I'd suggest just skipping support for whatever is broken on those JRuby versions, and printing out a warning "did not patch concurrent ruby... your JRuby version is too old, please update".

@marcotc
Copy link
Member

marcotc commented May 27, 2021

Regarding the lint failure, rebasing with master will fix it now.

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks very good and complete, kudos @lloeki!

There are a few discussions in the comments, but overall it's in :shipit: shape.

@lloeki lloeki force-pushed the feature/1316-concurrent-promises-future branch from 7a61b6b to 22c0550 Compare May 28, 2021 14:19
@lloeki
Copy link
Contributor Author

lloeki commented May 28, 2021

Rebased!

@ivoanjo
Copy link
Member

ivoanjo commented May 28, 2021

Can you look into #1522 (comment) as well?

@lloeki
Copy link
Contributor Author

lloeki commented May 28, 2021

Can you look into #1522 (comment) as well?

I was just typing it out ;)

I'm not entirely a fan

I wasn't aware of this PR, and I'm not either, not the least because I had to face non-conformant breakage already on JRuby. TBH I'd rather officially support only latest for each minor, given the regular stream of "oops we were not quite conformant on that not-so-corner-case" fixes that regularly land at JRuby. Not my call though.

At most, I'd suggest just skipping support for whatever is broken on those JRuby versions, and printing out a warning "did not patch concurrent ruby... your JRuby version is too old, please update".

Is there a sanctioned DSL-ish way to do this, or is this kind of thing made of hardcoded checks?

@ivoanjo
Copy link
Member

ivoanjo commented May 28, 2021

I wasn't aware of this PR, and I'm not either, not the least because I had to face non-conformant breakage already on JRuby. TBH I'd rather officially support only latest for each minor, given the regular stream of "oops we were not quite conformant on that not-so-corner-case" fixes that regularly land at JRuby. Not my call though.

Pfft I'll see your issue and raise you this one 🤣

For real though, since this looks like one of those "it breaks your app" kind of deal, I think it's worth doing the small detour, since people are known to get stuck on random versions and not update them.

(But I also don't think we should spend too much time searching for new ones -- e.g. let's not add all minor versions of JRuby and MRI to CI)

Is there a sanctioned DSL-ish way to do this, or is this kind of thing made of hardcoded checks?

#1503 had a few examples that got removed recently. My specific suggestion would be to: find out which JRuby version fixed it (binary search 9.2 releases?) and then do something like

(in concurrent_ruby/integration.rb)

        def self.compatible?
          super && version >= MINIMUM_VERSION && !broken_jruby?
        end

@lloeki
Copy link
Contributor Author

lloeki commented May 28, 2021

Thanks Ivo, I'll look into these examples and attempt something.

ivoanjo added a commit that referenced this pull request Sep 22, 2021
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.
ivoanjo added a commit that referenced this pull request Sep 23, 2021
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.
ivoanjo added a commit that referenced this pull request Sep 23, 2021
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.

One downside of this approach is that if there really are many threads,
the resulting wall clock times in a one minute profile may "drift" around
the 60 second mark, e.g. maybe we only sampled a thread once per
second and only 59 times, so we'll report 59s, but on the next report
we'll include the missing one, so then the result will be 61s.
I've observed 60 +- 1.68 secs for an app with ~65 threads, given the
default maximum of 16 threads.
This seems a reasonable enough margin of error given the improvement to
latency (especially on such a large application! -> even bigger latency
impact if we tried to sample all threads).

If it helps, I've used this trivial Ruby app for testing with many
threads:

```ruby
32.times do |n|
  eval("
    def sleeper#{n}
      sleep
    end
  ")

  Thread.new { send(:"sleeper#{n}") }
end

sleep
```
ivoanjo added a commit that referenced this pull request Sep 23, 2021
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.

One downside of this approach is that if there really are many threads,
the resulting wall clock times in a one minute profile may "drift" around
the 60 second mark, e.g. maybe we only sampled a thread once per
second and only 59 times, so we'll report 59s, but on the next report
we'll include the missing one, so then the result will be 61s.
I've observed 60 +- 1.68 secs for an app with ~65 threads, given the
default maximum of 16 threads.
This seems a reasonable enough margin of error given the improvement to
latency (especially on such a large application! -> even bigger latency
impact if we tried to sample all threads).

If it helps, I've used this trivial Ruby app for testing with many
threads:

```ruby
32.times do |n|
  eval("
    def sleeper#{n}
      sleep
    end
  ")

  Thread.new { send(:"sleeper#{n}") }
end

sleep
```
ivoanjo added a commit that referenced this pull request Sep 24, 2021
Prior to this change, the profiler tried to sample all threads each
time it decided to sample.

This could have a negative impact on the latency of Ruby services with
a lot of threads, because each sampling pass would be quite expensive.

Just as an illustration, consider the
`benchmarks/profiler_sample_loop` benchmark which tries to sample as
fast as possible 4 threads with very deep (500 frames+) stacks.

As the number of threads to sample increases, so does the performance
of sampling degrade:

| Sampled threads | Iterations per second                               |
|-----------------|-----------------------------------------------------|
| 4 threads       | 390.921  (± 6.1%) i/s -      3.920k in  10.075624s  |
| 16 threads      |  97.037  (± 6.2%) i/s -    972.000  in  10.072298s  |
| 32 threads      |  49.576  (± 4.0%) i/s -    496.000  in  10.038491s  |
| 64 threads      |  24.060  (± 8.3%) i/s -    240.000  in  10.033481s  |

(All numbers from my laptop: i7-1068NG7 + ruby 2.7.4p191 [x86_64-darwin19])

When combined with our dynamic sampling mechanism, if the profiler
starts to take longer to sample, it just samples less often.

BUT this means that it can impact service latency because suddenly we
may be spending 40ms+ of CPU time (64 thread example + 500 frames, very
extreme) which from experience (#1511, #1522) is not good.

Instead, this PR changes the stack sampler to sample up to a set
maximum of threads (16), randomly selected among all threads.

This means that adding more threads to the system means very little
degradation:

| Sampled threads | Iterations per second                              |
|-----------------|----------------------------------------------------|
| 16 threads      | 104.859  (± 6.7%) i/s -      1.050k in  10.068781s |
| 32 threads      | 100.957  (± 5.9%) i/s -      1.010k in  10.047250s |
| 64 threads      |  98.098  (± 5.1%) i/s -    981.000  in  10.038244s |
| 256 threads     |  84.740  (± 8.3%) i/s -    848.000  in  10.074037s |

There's still a bit of degradation that I suspect is from pure VM
overhead -- 256+ threads on a single Ruby VM adds up.

Because we pick the threads to sample randomly, we'll eventually sample
all threads -- just not at once.

Finally, regarding the dynamic sampling mechanism, because the profiler
will not take longer to sample, it will sample more often, and thus
over a longer period we should take sample roughly the same samples.

One downside of this approach is that if there really are many threads,
the resulting wall clock times in a one minute profile may "drift" around
the 60 second mark, e.g. maybe we only sampled a thread once per
second and only 59 times, so we'll report 59s, but on the next report
we'll include the missing one, so then the result will be 61s.
I've observed 60 +- 1.68 secs for an app with ~65 threads, given the
default maximum of 16 threads.
This seems a reasonable enough margin of error given the improvement to
latency (especially on such a large application! -> even bigger latency
impact if we tried to sample all threads).

If it helps, I've used this trivial Ruby app for testing with many
threads:

```ruby
32.times do |n|
  eval("
    def sleeper#{n}
      sleep
    end
  ")

  Thread.new { send(:"sleeper#{n}") }
end

sleep
```
@ivoanjo
Copy link
Member

ivoanjo commented Jul 13, 2022

@lloeki any update on this one?

@Stashchenko
Copy link

We faced with the same issue. We use the graphql-ruby with a new usage of Concurrent::Promises.future and our spans are not showing in a span list.

@Stashchenko
Copy link

@Stashchenko do you mind opening an issue for it? I can use it to chase down a few people to look at it :)

We have already issue #1316

@ivoanjo
Copy link
Member

ivoanjo commented Sep 28, 2023

I've requested the team take another look at this! Hopefully they'll be able to reply soon :)

@ekump ekump force-pushed the feature/1316-concurrent-promises-future branch from 22c0550 to 87eb055 Compare October 12, 2023 15:41
@lloeki lloeki requested a review from a team as a code owner October 12, 2023 15:41
@github-actions github-actions bot added integrations Involves tracing integrations tracing labels Oct 12, 2023
@ekump ekump force-pushed the feature/1316-concurrent-promises-future branch from 87eb055 to e72489b Compare October 12, 2023 15:44
@ekump
Copy link
Contributor

ekump commented Oct 12, 2023

The JRuby 9.2.0.0 test failure looks like it is deterministic -- https://app.circleci.com/pipelines/github/DataDg/dd-trace-rb/3520/workflows/c571f62b-cf7f-44ed-bd05-2d134b54a8a8/jobs/138556/tests#failed-test-0 .

We have 9.2.0.0 because of #1409 but to be honest I'm not entirely a fan. At most, I'd suggest just skipping support for whatever is broken on those JRuby versions, and printing out a warning "did not patch concurrent ruby... your JRuby version is too old, please update".

@ivoanjo @lloeki JRuby CI is currently passing (JRuby 9.2.0 is no longer tested), I think this is good to merge?

@lloeki
Copy link
Contributor Author

lloeki commented Oct 13, 2023

@ekump LGTM, thanks for picking it up!

@composited_executor.post(*args) do
# Pass the original arguments to the composited executor, which
# pushes them (possibly transformed) as block args
executor.post(*args) do |*block_args|
begin
original_context = tracer.provider.context
tracer.provider.context = parent_context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this type of context sharing works as we expect anymore.

It's not safe to share the same context object between different threads. They will fight for who the active_span is at any point in time, and span parenting will likely get intertwined between all threads sharing the same context.

We should use TraceDigest propagation instead:

def continue_trace!(digest, &block)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use continue_trace!

expect(inner_span.parent_id).to eq(outer_span.span_id)
end

context 'when there are multiple futures with inner spans that have the same parent' do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this new context to the PR at @marcotc's suggestion. This test fails with the previous implementation and passes when continue_trace! is used, confirming Marco's theory here.

base.class_eval do
alias_method :ns_initialize_without_datadog, :ns_initialize
remove_method(:ns_initialize)
def ns_initialize(value, opts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup!

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

lloeki and others added 4 commits November 1, 2023 09:54
Argument passing is preserved through the composited executor.

Concurrent::Future has been moved to a prepend style to resolve an
infinite call loop situation.
This allows a rollback of the effects of prepend upon the original
module being restored after each test.
@ekump ekump force-pushed the feature/1316-concurrent-promises-future branch from 5c1c3d2 to 8278d5d Compare November 1, 2023 13:54
@ekump ekump merged commit f3e4f7e into master Nov 1, 2023
218 checks passed
@ekump ekump deleted the feature/1316-concurrent-promises-future branch November 1, 2023 14:31
@github-actions github-actions bot added this to the 1.16.0 milestone Nov 1, 2023
@marcotc
Copy link
Member

marcotc commented Nov 3, 2023

Hey @Stashchenko, this change has just been released in 1.16.0. Let us know if this addresses your issue!

@Stashchenko
Copy link

Stashchenko commented Nov 6, 2023

Hey @Stashchenko, this change has just been released in 1.16.0. Let us know if this addresses your issue!

Just checked, Unfortunately, doesn't work :(
With monkey patch:
image

With new 1.16 release:
image

In the second screenshot there is not faraday trace :(

Our setup:

# frozen_string_literal: true

require 'datadog/statsd'
require 'ddtrace'
require 'datadog/appsec'
require 'faraday'

Datadog.configure do |c|
  service_name = 'my-service'
  c.service = service_name

  c.tracing.analytics.enabled = true
  c.tracing.instrument(:concurrent_ruby)
  c.tracing.instrument(:rails)
  c.tracing.instrument(:redis, service_name: "#{service_name}-redis")
  c.tracing.instrument(:active_support, cache_service: service_name)
  c.tracing.instrument(:active_record, service_name: "#{service_name}-postgres")
  c.tracing.instrument(:http, split_by_domain: true)
  c.tracing.instrument(:faraday, service_name:)

  c.runtime_metrics.enabled = true
  c.runtime_metrics.statsd = Datadog::Statsd.new('localhost', (ENV['STATSD_PORT'] || '8120').to_i)

  c.ci.enabled = ENV['CI'].present?
  c.ci.instrument :rspec

  c.appsec.instrument :rails
end

@Stashchenko
Copy link

Stashchenko commented Nov 6, 2023

I tried to experiment with our monkey patch, and what I've found: I've made only changes to the ContextCompositeExecutorService#post
In initialize folder: ddtrace_concurrent_ruby_patch.rb

# frozen_string_literal: true

require 'concurrent/executor/executor_service'

module Datadog
  module Tracing
    module Contrib
      module ConcurrentRuby
        # This patches the Future - to wrap executor service using ContextCompositeExecutorService
        class ContextCompositeExecutorService
          extend Forwardable
          include Concurrent::ExecutorService

          attr_accessor :composited_executor

          def initialize(composited_executor)
            @composited_executor = composited_executor
          end

          # The post method runs the task within composited executor - in a
          # different thread. The original arguments are captured to be
          # propagated to the composited executor post method
          def post(*)
            tracer = Tracing.send(:tracer)
            parent_context = tracer.provider.context
            executor = @composited_executor
            executor = Concurrent.executor(@composited_executor) if @composited_executor.is_a?(Symbol)

            # Pass the original arguments to the composited executor, which
            # pushes them (possibly transformed) as block args
            executor.post(*) do |*block_args|
              original_context = tracer.provider.context
              tracer.provider.context = parent_context

              # Pass the executor-provided block args as they should have been
              # originally passed without composition, see ChainPromise#on_resolvable
              yield(*block_args)
            ensure
              # Restore context in case the current thread gets reused
              tracer.provider.context = original_context
            end
          end

          delegate %i[can_overflow? serialized?] => :composited_executor
        end
      end
    end
  end
end

And it started working:
image

Without patch (not showing):
image

Looks like the issue for us is somewhere in the def post(*)method when we get the tracer. I'm not aware how it works yet maybe it will be helpful for you. Otherwise, we have somewhere incompatible versions

# Restore context in case the current thread gets reused
tracer.provider.context = original_context
end
digest = Tracing.active_trace.to_digest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing crashes on launch for us when Tracing.active_trace is nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@segiddins , oops, sorry for this. I am addressing it and should be released with 1.16.1 that is gona be around the corner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed fix is up at #3242

@marcotc
Copy link
Member

marcotc commented Nov 7, 2023

👋 @Stashchenko, with the 1.16 change, you can see that there's a "floating" block of tracing (the one with a big execute.graphql on the bottom right). This is because the instrumentation of Future is now thread safe.

With new 1.16 release: image

The original patch, where tracer.provider.context is manually passed to another concurrent-ruby context is not thread-safe. It's not possible for two threads to modify the same tracer.provider.context in a sensible way. We added a test that demonstrate the issues caused by sharing a tracer.provider.context amongst different Futures (you get arbitrary threads creating spans that parent another thread's spans): https://github.com/DataDog/dd-trace-rb/pull/1522/files#diff-c44ebabaf45a6be710afd7ccb798cb12e505e0f0cbd30293c980b52cd5c12121R105-R136

The new implementation actually represents the accurate trace flow in a parallel application, and won't cause span parenting issues.

In the second screenshot there is not faraday trace :(

Now this is unexpected, given this change has nothing to do with Faraday.
Could you let us know how Faraday is related to the rest of the trace you provided?
It's very intriguing that a separate integration stopped working with this change.

@TonyCTHsu
Copy link
Contributor

TonyCTHsu commented Nov 7, 2023

👋 @Stashchenko, I was not able to reproduce your case.

Could you provide me something more to further investigate with the missing faraday span

The snippet I used to mimic your behavior:

require 'ddtrace'
require 'graphql'
require 'faraday'
require 'concurrent-ruby'
require 'ostruct'
require 'pry'

class GraphQL::Types::BaseObject < GraphQL::Schema::Object
end

class PersonType < GraphQL::Types::BaseObject
  field :name, String, null: false
end

class PostType < GraphQL::Types::BaseObject
  field :id, ID, null: false
  field :author, PersonType

  def author
    Concurrent::Promises.future do
      Faraday.get 'https://www.google.com'

      OpenStruct.new(id: object.author_id, name: object.author_id)
    end.value
  end
end

class QueryType < GraphQL::Schema::Object
  field :post, PostType do
    argument :id, ID
  end

  # Then provide an implementation:
  def post(id:)
    OpenStruct.new(id: id, author_id: id)
  end
end

class Schema < GraphQL::Schema
  query QueryType

  # define schema lazy_resolve for Future
  lazy_resolve(Concurrent::Promises::Future, :value)
end

Datadog.configure do |c|
  c.service = 'Bug Reproduction'
  c.diagnostics.debug = true
  c.tracing.instrument :faraday
  c.tracing.instrument :concurrent_ruby
  c.tracing.instrument :graphql, schemas: [Schema]
end

query_string = <<-GraphqlQueryString
{
  p1: post(id: 1) { author { name } }
  p2: post(id: 2) { author { name } }
  p3: post(id: 3) { author { name } }
}
GraphqlQueryString

result_hash = Datadog::Tracing.trace("outer_span") do
  Schema.execute(query_string)
end

@Stashchenko
Copy link

Stashchenko commented Nov 7, 2023

Hi @TonyCTHsu , you should not get the value just in the same method call.

Concurrent::Promises.future do
      Faraday.get 'https://www.google.com'

      OpenStruct.new(id: object.author_id, name: object.author_id)
end.value # This piece 

It's probably, immediately runs the query and nothing to do for lazy resolver.

One more step to reproduce. Try to get list of products with author's name:

query_string = <<-GraphqlQueryString
{
   posts { author { name } }
}
GraphqlQueryString

That what we have in our code.
One more step I don't know what exactly does this line Faraday.get 'https://www.google.com' . Does it execute faraday get or wait on real method call? (aka proxy object).

Smt like that should call the Faraday

Concurrent::Promises.future do
      response = Faraday.get 'https://api.publicapis.org/entries'
      value = JSON.parse(response.body).sample 
      OpenStruct.new(id: object.author_id, name: value['API'])
end

@Stashchenko
Copy link

Stashchenko commented Nov 8, 2023

Hi @marcotc ,

Could you let us know how Faraday is related to the rest of the trace you provided?

It's not Faraday related :(
Basically, any span I would like to create inside the Promise.future is not showing in a trace for us :(

@Stashchenko
Copy link

Stashchenko commented Nov 8, 2023

Hi guys , tried to reproduce @TonyCTHsu example on a clean project.

class PersonType < BaseObject
      field :name, String, null: false
    end

  class PostType <  BaseObject
    field :author, PersonType
    field :id, ID, null: false

    def author
      Concurrent::Promises.future do
        Datadog::Tracing.trace('example.trace') do
          response = Faraday.get 'https://api.publicapis.org/entries'
          value = JSON.parse(response.body)['entries'].sample
          OpenStruct.new(id: object.author_id, name: value['API'])
        end
      end
    end
  end

  field :posts, [PostType], null: false

  def posts
    Array.new(10) { OpenStruct.new(id: SecureRandom.uuid, author_id: SecureRandom.uuid) }
  end

And result is: 🔥
image

Looks like something incompatible inside our project. Will dig deeper to compare what's the diff. Sorry for the confusion. And thanks a lot for support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrations Involves tracing integrations tracing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants