Adds tracing support to async boundaries in reactor-core #113

bijukunjummen · 2019-12-07T05:34:43Z

Spring's project-reactor provides a powerful set of operators that abstracts some of the async operations behind simple functions like map, flatMap. This PR adds support for propagating tracing details across thread boundaries when using reactor's operators.

codecov-io · 2019-12-07T05:42:04Z

Codecov Report

Merging #113 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@             Coverage Diff              @@
##             master     #113      +/-   ##
============================================
+ Coverage     98.87%   98.88%   +<.01%     
- Complexity     1239     1252      +13     
============================================
  Files            68       70       +2     
  Lines          2935     2955      +20     
  Branches        407      407              
============================================
+ Hits           2902     2922      +20     
  Misses           19       19              
  Partials         14       14

Impacted Files	Coverage Δ	Complexity Δ
...bflux/WingtipsSpringBoot2WebfluxConfiguration.java	`100% <100%> (ø)`	`16 <1> (+1)`	⬆️
...pringboot2/webflux/WingtipsReactorInitializer.java	`100% <100%> (ø)`	`4 <4> (?)`
.../webflux/WingtipsSpringBoot2WebfluxProperties.java	`100% <100%> (ø)`	`13 <2> (+2)`	⬆️
...erwrapper/ScheduledExecutorServiceWithTracing.java	`100% <100%> (ø)`	`6 <6> (?)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0052a27...92909d8. Read the comment docs.

nicmunroe · 2019-12-09T21:08:28Z

Hey @bijukunjummen ! Thanks for the contribution. We'll try to get to this soon. In the meantime, you'll need to sign the CLA before we can accept this PR.

bijukunjummen · 2019-12-10T04:09:47Z

Hey @bijukunjummen ! Thanks for the contribution. We'll try to get to this soon. In the meantime, you'll need to sign the CLA before we can accept this PR.

Thanks @nicmunroe , I have signed the CLA now.

Co-authored-by: RDBreed <rafaelabreed@gmail.com>

nicmunroe · 2019-12-17T23:59:01Z

@bijukunjummen I took a quick look, and here are a few things I noticed. These may or may not be showstoppers - hard for me to say since I don't personally do much with project reactor. But in any case:

Project reactor seems to be doing things with caching executors. So if, for example, an ElasticScheduler (like the default Schedulers.elastic()) is created and used before Schedulers.addExecutorServiceDecorator(...) is called to register the new wingtips ScheduledExecutorServiceWithTracing, then the old regular executor is used forever and always and the decorator you wanted is ignored.
i. This is probably not a big deal for many SB2 apps, as the WingtipsReactorConfiguration will probably do its thing before any of this caching takes effect...
ii. ...But it does make tests potentially dependent on test ordering, since the hooks are all static and the tests run in the same JVM. If they execute in the wrong order, then we could get intermittent failures. For example, if you run the WingtipsSpringBoot2WebfluxConfigurationTest.reactor_async_trace_propagation(...) test by itself, they'll all fail. But if you reorder the dataprovider for that test so that MANUAL_IMPORT_ONLY is not first, then they all pass. So the fact that this branch builds successfully is by accident, as some other test must be setting the hook before this test runs. And I've definitely seen tests executed in different orders depending on platform. That makes me really uncomfortable.
iii. As far as I can tell it's impossible to clear this caching once it's done. I'd like to be wrong, and maybe there's a proper official Project Reactor way to reset/clear this kind of cache? 🤷‍♂
The only way the ScheduledExecutorServiceWithTracing wrapping works is if the correct tracing state is on the thread at the site of subscription, not where the Mono/Flux is defined. This is ok if you have full control over where your Mono/Flux is subscribed to, and you understand this subtlety. But it severely limits effectiveness in other scenarios. For example: in a SB2 endpoint you could return a Mono and have the WingtipsReactorConfiguration registered so you'd think it would all "just work". But it falls apart as soon as you do something that causes SB2 to subscribe on a different thread, like putting a @RequestBody arg into your endpoint method. I could see this leading to a lot of confusion.

That said I can see how this PR would be useful for some users in some scenarios, so I don't want to just say no to this PR. Thoughts?

bijukunjummen · 2019-12-18T01:20:27Z

@bijukunjummen I took a quick look, and here are a few things I noticed. These may or may not be showstoppers - hard for me to say since I don't personally do much with project reactor. But in any case:

Project reactor seems to be doing things with caching executors. So if, for example, an ElasticScheduler (like the default Schedulers.elastic()) is created and used before Schedulers.addExecutorServiceDecorator(...) is called to register the new wingtips ScheduledExecutorServiceWithTracing, then the old regular executor is used forever and always and the decorator you wanted is ignored.
i. This is probably not a big deal for many SB2 apps, as the WingtipsReactorConfiguration will probably do its thing before any of this caching takes effect...
ii. ...But it does make tests potentially dependent on test ordering, since the hooks are all static and the tests run in the same JVM. If they execute in the wrong order, then we could get intermittent failures. For example, if you run the WingtipsSpringBoot2WebfluxConfigurationTest.reactor_async_trace_propagation(...) test by itself, they'll all fail. But if you reorder the dataprovider for that test so that MANUAL_IMPORT_ONLY is not first, then they all pass. So the fact that this branch builds successfully is by accident, as some other test must be setting the hook before this test runs. And I've definitely seen tests executed in different orders depending on platform. That makes me really uncomfortable.
iii. As far as I can tell it's impossible to clear this caching once it's done. I'd like to be wrong, and maybe there's a proper official Project Reactor way to reset/clear this kind of cache? 🤷‍♂

The only way the ScheduledExecutorServiceWithTracing wrapping works is if the correct tracing state is on the thread at the site of subscription, not where the Mono/Flux is defined. This is ok if you have full control over where your Mono/Flux is subscribed to, and you understand this subtlety. But it severely limits effectiveness in other scenarios. For example: in a SB2 endpoint you could return a Mono and have the WingtipsReactorConfiguration registered so you'd think it would all "just work". But it falls apart as soon as you do something that causes SB2 to subscribe on a different thread, like putting a @RequestBody arg into your endpoint method. I could see this leading to a lot of confusion.

That said I can see how this PR would be useful for some users in some scenarios, so I don't want to just say no to this PR. Thoughts?

Thanks for your thoughtful feedback @nicmunroe. I will get back soon

bijukunjummen · 2019-12-20T02:26:27Z

@bijukunjummen I took a quick look, and here are a few things I noticed. These may or may not be showstoppers - hard for me to say since I don't personally do much with project reactor. But in any case:

Project reactor seems to be doing things with caching executors. So if, for example, an ElasticScheduler (like the default Schedulers.elastic()) is created and used before Schedulers.addExecutorServiceDecorator(...) is called to register the new wingtips ScheduledExecutorServiceWithTracing, then the old regular executor is used forever and always and the decorator you wanted is ignored.
i. This is probably not a big deal for many SB2 apps, as the WingtipsReactorConfiguration will probably do its thing before any of this caching takes effect...

Yes, I think this is a fair assumption. I see this being useful only in Spring Boot 2 based projects and the initializer will ensure that the hook takes effect before any of the reactor library functions are used.

ii. ...But it does make tests potentially dependent on test ordering, since the hooks are all static and the tests run in the same JVM. If they execute in the wrong order, then we could get intermittent failures. For example, if you run the WingtipsSpringBoot2WebfluxConfigurationTest.reactor_async_trace_propagation(...) test by itself, they'll all fail. But if you reorder the dataprovider for that test so that MANUAL_IMPORT_ONLY is not first, then they all pass. So the fact that this branch builds successfully is by accident, as some other test must be setting the hook before this test runs. And I've definitely seen tests executed in different orders depending on platform. That makes me really uncomfortable.

Ah, nice catch. I think the reason was that for manual_import_only WingtipsSpringBoot2WebfluxConfiguration is the only configuration being imported and I had put the hook configuration into a separate configuration altogether. I have now added the initializer bean also into the same WingtipsSpringBoot2WebfluxConfiguration configuration. So the test should consistently pass and the order should not matter.

iii. As far as I can tell it's impossible to clear this caching once it's done. I'd like to be wrong, and maybe there's a proper official Project Reactor way to reset/clear this kind of cache? 🤷‍♂

Yes, I don't entirely understand how the internal caching works but will do a little more research.

The only way the ScheduledExecutorServiceWithTracing wrapping works is if the correct tracing state is on the thread at the site of subscription, not where the Mono/Flux is defined.
This is ok if you have full control over where your Mono/Flux is subscribed to, and you understand this subtlety. But it severely limits effectiveness in other scenarios. For example: in a SB2 endpoint you could return a Mono and have the WingtipsReactorConfiguration registered so you'd think it would all "just work". But it falls apart as soon as you do something that causes SB2 to subscribe on a different thread, like putting a @RequestBody arg into your endpoint method. I could see this leading to a lot of confusion.

That said I can see how this PR would be useful for some users in some scenarios, so I don't want to just say no to this PR. Thoughts?

Yes, good point. I believe though that ScheduledExecutorServiceWithTracing is always called at the point of subscription. I will double check and get back. I have a project which using a subset of this functionality with @RequestBody, I will check how the functionality works there and get back.

bijukunjummen · 2019-12-20T15:20:40Z

The only way the ScheduledExecutorServiceWithTracing wrapping works is if the correct tracing state is on the thread at the site of subscription, not where the Mono/Flux is defined.
This is ok if you have full control over where your Mono/Flux is subscribed to, and you understand this subtlety. But it severely limits effectiveness in other scenarios. For example: in a SB2 endpoint you could return a Mono and have the WingtipsReactorConfiguration registered so you'd think it would all "just work". But it falls apart as soon as you do something that causes SB2 to subscribe on a different thread, like putting a @RequestBody arg into your endpoint method. I could see this leading to a lot of confusion.
That said I can see how this PR would be useful for some users in some scenarios, so I don't want to just say no to this PR. Thoughts?

Got a chance to test a method with normal Webflux endpoints. You are right, reactor does not appear to use the scheduler hooks at all for Webflux endpoints. But I noticed that WingtipsSpringWebfluxWebFilter that is part of WingtipsReactorConfiguration does work for those endpoints and any explicit subscribeOn, publishOn work off the hooks.

nicmunroe · 2019-12-20T18:13:10Z

I think the reason was that for manual_import_only WingtipsSpringBoot2WebfluxConfiguration is the only configuration being imported and I had put the hook configuration into a separate configuration altogether. I have now added the initializer bean also into the same WingtipsSpringBoot2WebfluxConfiguration configuration. So the test should consistently pass and the order should not matter.

I was only using that as an example. If there's any other test now or in the future that uses project reactor without using the wingtips integration, and that test happens to run first, then it will cause failures.

I think the only way to truly guarantee tests always pass is to never use a built-in reusable Scheduler like Schedulers.elastic() when you need the wingtips integration. If you use a new scheduler created just for each test like Schedulers.newElastic("foo") instead, then I think the tests will be successful no matter the ordering.

Got a chance to test a method with normal Webflux endpoints. You are right, reactor does not appear to use the scheduler hooks at all for Webflux endpoints.

No, it does use the hooks - you can set breakpoints in CallableWithTracing's constructor and call() methods and see that ScheduledExecutorServiceWithTracing is used even for @RequestBody endpoints. The problem is that things like @RequestBody cause webflux to subscribe to the mono/flux on a thread that doesn't have tracing state attached. Since the request's tracing state isn't attached to the thread when the mono subscription occurs, the ScheduledExecutorServiceWithTracing has no tracing state to capture when it creates the CallableWithTracing. Set the breakpoints in CallableWithTracing and poke around on a SB2 endpoint with @RequestBody, and on a basic one that has no arguments - you'll see the difference and it should make sense.

Spring Sleuth gets around this by adding a bunch more hooks and magic. But they also cause the tracing thread-attach/detach logic to happen a lot during a request - way more than you might expect, which has a noticeable performance impact on efficient services that are high volume and low latency. So I'm hesitant to follow their lead, both for the performance impact and the maintenance hassle (the stuff they do is complex and advanced, and I'd be afraid of doing something subtly wrong that fundamentally breaks core project reactor functionality or performance).

nicmunroe · 2019-12-20T18:19:50Z

As far as moving the wingtips+project reactor registration to WingtipsSpringBoot2WebfluxConfiguration, that seems good, except I'd want to see the ability to enable/disable it via application properties (WingtipsSpringBoot2WebfluxProperties).

I'm on the fence on whether it should be enabled or disabled by default. I'm leaning towards disabled due to the subtle behavior concerns we've been discussing - those issues make me think it should be opt-in for users who are aware of when it will work and when it won't.

Which brings up another point - these subtleties should be called out in the readme for this module.

bijukunjummen · 2019-12-21T05:34:28Z

As far as moving the wingtips+project reactor registration to WingtipsSpringBoot2WebfluxConfiguration, that seems good, except I'd want to see the ability to enable/disable it via application properties (WingtipsSpringBoot2WebfluxProperties).

I'm on the fence on whether it should be enabled or disabled by default. I'm leaning towards disabled due to the subtle behavior concerns we've been discussing - those issues make me think it should be opt-in for users who are aware of when it will work and when it won't.

Yes, I feel disabled could be the right default too. I have now added support to take in a property to enable the feature.

Which brings up another point - these subtleties should be called out in the readme for this module.

Yes, good call. Fixed readme also

bijukunjummen · 2019-12-21T06:23:27Z

Getting a failing test - trying to figure out how to clear the scheduler cache, doesn't look like using a Schedulers.newElastic(..) is helping, checking why.

nicmunroe · 2020-01-08T00:16:11Z

Getting a failing test - trying to figure out how to clear the scheduler cache, doesn't look like using a Schedulers.newElastic(..) is helping, checking why.

@bijukunjummen Looks like tests are passing, so did this get resolved? Using Schedulers.newElastic(..) definitely worked in my limited testing.

bijukunjummen · 2020-01-08T00:54:12Z

Yes, it is good for one more round of review @nicmunroe - the failure was fixed after the hooks were explicitly cleared after every test:

Schedulers.removeExecutorServiceDecorator(...);

There is a new flag to enable the feature and the details of the flag has been documented.

bijukunjummen · 2020-02-10T15:25:48Z

@nicmunroe, this should be good to merge now?

nicmunroe · 2020-02-18T01:14:00Z

@bijukunjummen yes! Sorry. I'm going to do a squash-and-merge for this PR. I have some bikeshedding to do, but I'll do that in another followup PR.

Thanks you so much for contributing this, and for your patience! It's much appreciated.

nicmunroe · 2020-02-19T22:30:26Z

@bijukunjummen this has been released in version 0.22.0. Thanks again!

Adds tracing support to async boundaries in reactor-core

f06a099

Co-authored-by: RDBreed <rafaelabreed@gmail.com>

Fixes location of reactor hook configuration

5e1a45f

Disables reactor support by default, additional tests

381a549

bijukunjummen added 3 commits December 21, 2019 06:11

Explicitly removes schedule decorator between tests

1af5d66

Adds one more explicit removal of decorator in test

19c17a4

Does not return a null bean

92909d8

nicmunroe merged commit 11d924c into Nike-Inc:master Feb 18, 2020

nicmunroe mentioned this pull request Feb 19, 2020

Clean up docs and tests for PR 113 #114

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds tracing support to async boundaries in reactor-core #113

Adds tracing support to async boundaries in reactor-core #113

bijukunjummen commented Dec 7, 2019 •

edited

Loading

codecov-io commented Dec 7, 2019 •

edited

Loading

nicmunroe commented Dec 9, 2019

bijukunjummen commented Dec 10, 2019

nicmunroe commented Dec 17, 2019

bijukunjummen commented Dec 18, 2019

bijukunjummen commented Dec 20, 2019

bijukunjummen commented Dec 20, 2019

nicmunroe commented Dec 20, 2019 •

edited

Loading

nicmunroe commented Dec 20, 2019

bijukunjummen commented Dec 21, 2019 •

edited

Loading

bijukunjummen commented Dec 21, 2019

nicmunroe commented Jan 8, 2020

bijukunjummen commented Jan 8, 2020

bijukunjummen commented Feb 10, 2020

nicmunroe commented Feb 18, 2020

nicmunroe commented Feb 19, 2020

Adds tracing support to async boundaries in reactor-core #113

Adds tracing support to async boundaries in reactor-core #113

Conversation

bijukunjummen commented Dec 7, 2019 • edited Loading

codecov-io commented Dec 7, 2019 • edited Loading

Codecov Report

nicmunroe commented Dec 9, 2019

bijukunjummen commented Dec 10, 2019

nicmunroe commented Dec 17, 2019

bijukunjummen commented Dec 18, 2019

bijukunjummen commented Dec 20, 2019

bijukunjummen commented Dec 20, 2019

nicmunroe commented Dec 20, 2019 • edited Loading

nicmunroe commented Dec 20, 2019

bijukunjummen commented Dec 21, 2019 • edited Loading

bijukunjummen commented Dec 21, 2019

nicmunroe commented Jan 8, 2020

bijukunjummen commented Jan 8, 2020

bijukunjummen commented Feb 10, 2020

nicmunroe commented Feb 18, 2020

nicmunroe commented Feb 19, 2020

bijukunjummen commented Dec 7, 2019 •

edited

Loading

codecov-io commented Dec 7, 2019 •

edited

Loading

nicmunroe commented Dec 20, 2019 •

edited

Loading

bijukunjummen commented Dec 21, 2019 •

edited

Loading