Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Request Tracing Implementation #1093

Merged
merged 6 commits into from Apr 12, 2021
Merged

Conversation

tgianos
Copy link
Contributor

@tgianos tgianos commented Apr 12, 2021

Initial effort into supporting request tracing in Genie

  • Leverage Spring Cloud Sleuth and Brave
  • Cross process boundaries (server -> agent, agent -> job) via environment variables
  • Annotate important spans (job launch, agent state machine)
  • Tag job launch and agent exec spans with job id

Part of larger internal platform tracing effort as PoC

- Clean up tests
  - Remove unnecessary integration tests and replace with unit
- Clean up code formatting
- Other random things
Auto wires in Brave tracing
Adds hooks for grip tracing in agent and server
…thin Genie server and agent

- Trace propagator to be able to put trace information from server to agent and extract it
- Tag adapter for modifying tags applied to spans that will allow inheritors to adapt tags based on need (e.g. internal netflix tag modifications)
- Trace cleanup class for synchronously flushing brave spans before agent shut down
- auto configuration to expose components as beans to be used
- Extract Trace information from environment at startup if possible otherwise a new trace will be instantiated
- close spans at agent termination
- Add annotations for agent state machine execution
- Add tag for the genie job id when pertinent
- Add tag for the agent command that is being executed
- Propagate trace information from agent launchers into environment variables for agent processes
- Annotate job launch process to show phases
- Tag job launch span with job id
Exported as GENIE_B3_TRACE_ID_HIGH, GENIE_B3_TRACE_ID_LOW, GENIE_B3_PARENT_SPAN_ID and GENIE_B3_SAMPLED environment variables
Can be used by downstream user job clients to continue to propagate the trace (e.g. to spark or presto, etc)
@tgianos tgianos added this to the 4.0.0 milestone Apr 12, 2021
@tgianos tgianos self-assigned this Apr 12, 2021
@tgianos
Copy link
Contributor Author

tgianos commented Apr 12, 2021

@narayaruna @karkum @nvhoang this is a semi-large PR. I've tried to break it down into commits that should be self contained for easier review/understanding so I'd suggest navigating by commit to get a picture of what I did

@tgianos tgianos requested a review from xiao-chen April 12, 2021 00:38
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.08%) to 93.835% when pulling b8dfd98 on tgianos:tracing into cbef9ee on Netflix:master.

@codecov
Copy link

codecov bot commented Apr 12, 2021

Codecov Report

Merging #1093 (b8dfd98) into master (cbef9ee) will decrease coverage by 0.09%.
The diff coverage is 93.78%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1093      +/-   ##
============================================
- Coverage     90.88%   90.79%   -0.10%     
- Complexity     3646     3669      +23     
============================================
  Files           451      457       +6     
  Lines         14098    14230     +132     
  Branches        986     1000      +14     
============================================
+ Hits          12813    12920     +107     
- Misses          850      869      +19     
- Partials        435      441       +6     
Impacted Files Coverage Δ Complexity Δ
...vices/impl/grpc/GRpcServicesAutoConfiguration.java 100.00% <ø> (ø) 5.00 <0.00> (ø)
...ent/launchers/AgentLaunchersAutoConfiguration.java 100.00% <ø> (ø) 9.00 <0.00> (ø)
...b/agent/launchers/impl/LocalAgentLauncherImpl.java 78.76% <80.00%> (+0.04%) 12.00 <0.00> (ø)
...internal/tracing/brave/BraveTracingComponents.java 83.33% <83.33%> (ø) 5.00 <5.00> (?)
...ing/brave/impl/EnvVarBraveTracePropagatorImpl.java 84.78% <84.78%> (ø) 12.00 <12.00> (?)
.../execution/process/impl/JobProcessManagerImpl.java 86.11% <93.75%> (+0.49%) 38.00 <5.00> (+1.00)
.../com/netflix/genie/agent/cli/GenieAgentRunner.java 98.43% <97.61%> (-1.57%) 9.00 <7.00> (+1.00) ⬇️
.../netflix/genie/agent/cli/CliAutoConfiguration.java 80.00% <100.00%> (-20.00%) 19.00 <1.00> (-5.00)
.../java/com/netflix/genie/agent/cli/InfoCommand.java 89.60% <100.00%> (ø) 24.00 <0.00> (ø)
...ie/agent/execution/ExecutionAutoConfiguration.java 100.00% <100.00%> (ø) 36.00 <2.00> (+1.00)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbef9ee...b8dfd98. Read the comment docs.

*
* @see TraceContext.Builder#traceId(long)
*/
static final String GENIE_JOB_B3_TRACE_ID_LOW_KEY = "GENIE_B3_TRACE_ID_LOW";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call it 'trace_id_low' (instead of 'trace_id'), when B3 just calls it traceId?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this a few times (and it's certainly not set in stone). Ideally I'd have just done GENIE_B3_TRACE_ID and used the hex string however the builder for TraceContext doesn't expose the method which would allow me to reconstitute the context from that single value. Hence I had to break it up and just felt it was a little clearer to use LOW and HIGH rather than absence of LOW representing low. That felt to me like a historical thing where they went from supporting only 64 bit identifiers to both that and 128 bit.

process.waitFor(KILL_CHECK_INTERVAL_MS, TimeUnit.MILLISECONDS);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these 2 method moves seem unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Intellij auto formatting. oh well

Copy link
Contributor

@xiao-chen xiao-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending 2 minor comments. Thanks for the good work!

@@ -63,7 +63,7 @@
@Bean
@Lazy
@ConditionalOnMissingBean(AgentHeartBeatService.class)
public AgentHeartBeatService agentHeartBeatService(
public GrpcAgentHeartBeatServiceImpl agentHeartBeatService(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why changing to *Impl instead of the interface? Will this change tie it to a specific implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general for Spring it is a better practice to make the return type as specific as possible. This allows it to be used in the most injection scenarios. If for example someone had a @ConditionalOnBean(GrpcAgentHeartBeatServiceImpl) on some other bean definition because they explicitly needed that one even though technically there was a bean of this type available Spring would only know it as AgentHeartBeatService and thus that condition would fail.

It doesn't tie anything to a specific implementation just makes the type more concrete in the spring context.

@tgianos tgianos merged commit 02d8047 into Netflix:master Apr 12, 2021
@tgianos tgianos deleted the tracing branch April 12, 2021 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants