Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds tracing support to async boundaries in reactor-core #113

Merged
merged 6 commits into from
Feb 18, 2020
Merged

Adds tracing support to async boundaries in reactor-core #113

merged 6 commits into from
Feb 18, 2020

Conversation

bijukunjummen
Copy link
Contributor

@bijukunjummen bijukunjummen commented Dec 7, 2019

Spring's project-reactor provides a powerful set of operators that abstracts some of the async operations behind simple functions like map, flatMap. This PR adds support for propagating tracing details across thread boundaries when using reactor's operators.

@codecov-io
Copy link

codecov-io commented Dec 7, 2019

Codecov Report

Merging #113 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #113      +/-   ##
============================================
+ Coverage     98.87%   98.88%   +<.01%     
- Complexity     1239     1252      +13     
============================================
  Files            68       70       +2     
  Lines          2935     2955      +20     
  Branches        407      407              
============================================
+ Hits           2902     2922      +20     
  Misses           19       19              
  Partials         14       14
Impacted Files Coverage Δ Complexity Δ
...bflux/WingtipsSpringBoot2WebfluxConfiguration.java 100% <100%> (ø) 16 <1> (+1) ⬆️
...pringboot2/webflux/WingtipsReactorInitializer.java 100% <100%> (ø) 4 <4> (?)
.../webflux/WingtipsSpringBoot2WebfluxProperties.java 100% <100%> (ø) 13 <2> (+2) ⬆️
...erwrapper/ScheduledExecutorServiceWithTracing.java 100% <100%> (ø) 6 <6> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0052a27...92909d8. Read the comment docs.

@nicmunroe
Copy link
Member

Hey @bijukunjummen ! Thanks for the contribution. We'll try to get to this soon. In the meantime, you'll need to sign the CLA before we can accept this PR.

@bijukunjummen
Copy link
Contributor Author

Hey @bijukunjummen ! Thanks for the contribution. We'll try to get to this soon. In the meantime, you'll need to sign the CLA before we can accept this PR.

Thanks @nicmunroe , I have signed the CLA now.

Co-authored-by: RDBreed <rafaelabreed@gmail.com>
@nicmunroe
Copy link
Member

@bijukunjummen I took a quick look, and here are a few things I noticed. These may or may not be showstoppers - hard for me to say since I don't personally do much with project reactor. But in any case:

  1. Project reactor seems to be doing things with caching executors. So if, for example, an ElasticScheduler (like the default Schedulers.elastic()) is created and used before Schedulers.addExecutorServiceDecorator(...) is called to register the new wingtips ScheduledExecutorServiceWithTracing, then the old regular executor is used forever and always and the decorator you wanted is ignored.
    i. This is probably not a big deal for many SB2 apps, as the WingtipsReactorConfiguration will probably do its thing before any of this caching takes effect...
    ii. ...But it does make tests potentially dependent on test ordering, since the hooks are all static and the tests run in the same JVM. If they execute in the wrong order, then we could get intermittent failures. For example, if you run the WingtipsSpringBoot2WebfluxConfigurationTest.reactor_async_trace_propagation(...) test by itself, they'll all fail. But if you reorder the dataprovider for that test so that MANUAL_IMPORT_ONLY is not first, then they all pass. So the fact that this branch builds successfully is by accident, as some other test must be setting the hook before this test runs. And I've definitely seen tests executed in different orders depending on platform. That makes me really uncomfortable.
    iii. As far as I can tell it's impossible to clear this caching once it's done. I'd like to be wrong, and maybe there's a proper official Project Reactor way to reset/clear this kind of cache? 🤷‍♂
  2. The only way the ScheduledExecutorServiceWithTracing wrapping works is if the correct tracing state is on the thread at the site of subscription, not where the Mono/Flux is defined. This is ok if you have full control over where your Mono/Flux is subscribed to, and you understand this subtlety. But it severely limits effectiveness in other scenarios. For example: in a SB2 endpoint you could return a Mono and have the WingtipsReactorConfiguration registered so you'd think it would all "just work". But it falls apart as soon as you do something that causes SB2 to subscribe on a different thread, like putting a @RequestBody arg into your endpoint method. I could see this leading to a lot of confusion.

That said I can see how this PR would be useful for some users in some scenarios, so I don't want to just say no to this PR. Thoughts?

@bijukunjummen
Copy link
Contributor Author

@bijukunjummen I took a quick look, and here are a few things I noticed. These may or may not be showstoppers - hard for me to say since I don't personally do much with project reactor. But in any case:

  1. Project reactor seems to be doing things with caching executors. So if, for example, an ElasticScheduler (like the default Schedulers.elastic()) is created and used before Schedulers.addExecutorServiceDecorator(...) is called to register the new wingtips ScheduledExecutorServiceWithTracing, then the old regular executor is used forever and always and the decorator you wanted is ignored.
    i. This is probably not a big deal for many SB2 apps, as the WingtipsReactorConfiguration will probably do its thing before any of this caching takes effect...
    ii. ...But it does make tests potentially dependent on test ordering, since the hooks are all static and the tests run in the same JVM. If they execute in the wrong order, then we could get intermittent failures. For example, if you run the WingtipsSpringBoot2WebfluxConfigurationTest.reactor_async_trace_propagation(...) test by itself, they'll all fail. But if you reorder the dataprovider for that test so that MANUAL_IMPORT_ONLY is not first, then they all pass. So the fact that this branch builds successfully is by accident, as some other test must be setting the hook before this test runs. And I've definitely seen tests executed in different orders depending on platform. That makes me really uncomfortable.
    iii. As far as I can tell it's impossible to clear this caching once it's done. I'd like to be wrong, and maybe there's a proper official Project Reactor way to reset/clear this kind of cache? 🤷‍♂
  2. The only way the ScheduledExecutorServiceWithTracing wrapping works is if the correct tracing state is on the thread at the site of subscription, not where the Mono/Flux is defined. This is ok if you have full control over where your Mono/Flux is subscribed to, and you understand this subtlety. But it severely limits effectiveness in other scenarios. For example: in a SB2 endpoint you could return a Mono and have the WingtipsReactorConfiguration registered so you'd think it would all "just work". But it falls apart as soon as you do something that causes SB2 to subscribe on a different thread, like putting a @RequestBody arg into your endpoint method. I could see this leading to a lot of confusion.

That said I can see how this PR would be useful for some users in some scenarios, so I don't want to just say no to this PR. Thoughts?

Thanks for your thoughtful feedback @nicmunroe. I will get back soon

@bijukunjummen
Copy link
Contributor Author

@bijukunjummen I took a quick look, and here are a few things I noticed. These may or may not be showstoppers - hard for me to say since I don't personally do much with project reactor. But in any case:

  1. Project reactor seems to be doing things with caching executors. So if, for example, an ElasticScheduler (like the default Schedulers.elastic()) is created and used before Schedulers.addExecutorServiceDecorator(...) is called to register the new wingtips ScheduledExecutorServiceWithTracing, then the old regular executor is used forever and always and the decorator you wanted is ignored.
    i. This is probably not a big deal for many SB2 apps, as the WingtipsReactorConfiguration will probably do its thing before any of this caching takes effect...

Yes, I think this is a fair assumption. I see this being useful only in Spring Boot 2 based projects and the initializer will ensure that the hook takes effect before any of the reactor library functions are used.

ii. ...But it does make tests potentially dependent on test ordering, since the hooks are all static and the tests run in the same JVM. If they execute in the wrong order, then we could get intermittent failures. For example, if you run the WingtipsSpringBoot2WebfluxConfigurationTest.reactor_async_trace_propagation(...) test by itself, they'll all fail. But if you reorder the dataprovider for that test so that MANUAL_IMPORT_ONLY is not first, then they all pass. So the fact that this branch builds successfully is by accident, as some other test must be setting the hook before this test runs. And I've definitely seen tests executed in different orders depending on platform. That makes me really uncomfortable.

Ah, nice catch. I think the reason was that for manual_import_only WingtipsSpringBoot2WebfluxConfiguration is the only configuration being imported and I had put the hook configuration into a separate configuration altogether. I have now added the initializer bean also into the same WingtipsSpringBoot2WebfluxConfiguration configuration. So the test should consistently pass and the order should not matter.

iii. As far as I can tell it's impossible to clear this caching once it's done. I'd like to be wrong, and maybe there's a proper official Project Reactor way to reset/clear this kind of cache? 🤷‍♂

Yes, I don't entirely understand how the internal caching works but will do a little more research.

  1. The only way the ScheduledExecutorServiceWithTracing wrapping works is if the correct tracing state is on the thread at the site of subscription, not where the Mono/Flux is defined.
    This is ok if you have full control over where your Mono/Flux is subscribed to, and you understand this subtlety. But it severely limits effectiveness in other scenarios. For example: in a SB2 endpoint you could return a Mono and have the WingtipsReactorConfiguration registered so you'd think it would all "just work". But it falls apart as soon as you do something that causes SB2 to subscribe on a different thread, like putting a @RequestBody arg into your endpoint method. I could see this leading to a lot of confusion.

That said I can see how this PR would be useful for some users in some scenarios, so I don't want to just say no to this PR. Thoughts?

Yes, good point. I believe though that ScheduledExecutorServiceWithTracing is always called at the point of subscription. I will double check and get back. I have a project which using a subset of this functionality with @RequestBody, I will check how the functionality works there and get back.

@bijukunjummen
Copy link
Contributor Author

The only way the ScheduledExecutorServiceWithTracing wrapping works is if the correct tracing state is on the thread at the site of subscription, not where the Mono/Flux is defined.
This is ok if you have full control over where your Mono/Flux is subscribed to, and you understand this subtlety. But it severely limits effectiveness in other scenarios. For example: in a SB2 endpoint you could return a Mono and have the WingtipsReactorConfiguration registered so you'd think it would all "just work". But it falls apart as soon as you do something that causes SB2 to subscribe on a different thread, like putting a @RequestBody arg into your endpoint method. I could see this leading to a lot of confusion.
That said I can see how this PR would be useful for some users in some scenarios, so I don't want to just say no to this PR. Thoughts?

Got a chance to test a method with normal Webflux endpoints. You are right, reactor does not appear to use the scheduler hooks at all for Webflux endpoints. But I noticed that WingtipsSpringWebfluxWebFilter that is part of WingtipsReactorConfiguration does work for those endpoints and any explicit subscribeOn, publishOn work off the hooks.

@nicmunroe
Copy link
Member

nicmunroe commented Dec 20, 2019

I think the reason was that for manual_import_only WingtipsSpringBoot2WebfluxConfiguration is the only configuration being imported and I had put the hook configuration into a separate configuration altogether. I have now added the initializer bean also into the same WingtipsSpringBoot2WebfluxConfiguration configuration. So the test should consistently pass and the order should not matter.

I was only using that as an example. If there's any other test now or in the future that uses project reactor without using the wingtips integration, and that test happens to run first, then it will cause failures.

I think the only way to truly guarantee tests always pass is to never use a built-in reusable Scheduler like Schedulers.elastic() when you need the wingtips integration. If you use a new scheduler created just for each test like Schedulers.newElastic("foo") instead, then I think the tests will be successful no matter the ordering.

Got a chance to test a method with normal Webflux endpoints. You are right, reactor does not appear to use the scheduler hooks at all for Webflux endpoints.

No, it does use the hooks - you can set breakpoints in CallableWithTracing's constructor and call() methods and see that ScheduledExecutorServiceWithTracing is used even for @RequestBody endpoints. The problem is that things like @RequestBody cause webflux to subscribe to the mono/flux on a thread that doesn't have tracing state attached. Since the request's tracing state isn't attached to the thread when the mono subscription occurs, the ScheduledExecutorServiceWithTracing has no tracing state to capture when it creates the CallableWithTracing. Set the breakpoints in CallableWithTracing and poke around on a SB2 endpoint with @RequestBody, and on a basic one that has no arguments - you'll see the difference and it should make sense.

Spring Sleuth gets around this by adding a bunch more hooks and magic. But they also cause the tracing thread-attach/detach logic to happen a lot during a request - way more than you might expect, which has a noticeable performance impact on efficient services that are high volume and low latency. So I'm hesitant to follow their lead, both for the performance impact and the maintenance hassle (the stuff they do is complex and advanced, and I'd be afraid of doing something subtly wrong that fundamentally breaks core project reactor functionality or performance).

@nicmunroe
Copy link
Member

As far as moving the wingtips+project reactor registration to WingtipsSpringBoot2WebfluxConfiguration, that seems good, except I'd want to see the ability to enable/disable it via application properties (WingtipsSpringBoot2WebfluxProperties).

I'm on the fence on whether it should be enabled or disabled by default. I'm leaning towards disabled due to the subtle behavior concerns we've been discussing - those issues make me think it should be opt-in for users who are aware of when it will work and when it won't.

Which brings up another point - these subtleties should be called out in the readme for this module.

@bijukunjummen
Copy link
Contributor Author

bijukunjummen commented Dec 21, 2019

As far as moving the wingtips+project reactor registration to WingtipsSpringBoot2WebfluxConfiguration, that seems good, except I'd want to see the ability to enable/disable it via application properties (WingtipsSpringBoot2WebfluxProperties).

I'm on the fence on whether it should be enabled or disabled by default. I'm leaning towards disabled due to the subtle behavior concerns we've been discussing - those issues make me think it should be opt-in for users who are aware of when it will work and when it won't.

Yes, I feel disabled could be the right default too. I have now added support to take in a property to enable the feature.

Which brings up another point - these subtleties should be called out in the readme for this module.

Yes, good call. Fixed readme also

@bijukunjummen
Copy link
Contributor Author

Getting a failing test - trying to figure out how to clear the scheduler cache, doesn't look like using a Schedulers.newElastic(..) is helping, checking why.

@nicmunroe
Copy link
Member

Getting a failing test - trying to figure out how to clear the scheduler cache, doesn't look like using a Schedulers.newElastic(..) is helping, checking why.

@bijukunjummen Looks like tests are passing, so did this get resolved? Using Schedulers.newElastic(..) definitely worked in my limited testing.

@bijukunjummen
Copy link
Contributor Author

Yes, it is good for one more round of review @nicmunroe - the failure was fixed after the hooks were explicitly cleared after every test:

Schedulers.removeExecutorServiceDecorator(...);

There is a new flag to enable the feature and the details of the flag has been documented.

@bijukunjummen
Copy link
Contributor Author

@nicmunroe, this should be good to merge now?

@nicmunroe
Copy link
Member

@bijukunjummen yes! Sorry. I'm going to do a squash-and-merge for this PR. I have some bikeshedding to do, but I'll do that in another followup PR.

Thanks you so much for contributing this, and for your patience! It's much appreciated.

@nicmunroe nicmunroe merged commit 11d924c into Nike-Inc:master Feb 18, 2020
@nicmunroe
Copy link
Member

@bijukunjummen this has been released in version 0.22.0. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants