Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OASIS-12] Add trace exporter to Datadog exporter #25759

Merged
merged 9 commits into from
Jun 20, 2024

Conversation

songy23
Copy link
Member

@songy23 songy23 commented May 21, 2024

What does this PR do?

Add trace exporter implementation to Datadog exporter in OTel agent.

Trace agent internal metrics are missing for now, will be added in a follow-up PR.

Motivation

Support trace ingestion in OTel Agent

Describe how to test/QA your changes

Build OTel agent and start the agent with Datadog exporter in the trace pipeline. Then verify traces are sent to DD.

@songy23 songy23 added this to the 7.55.0 milestone May 21, 2024
@songy23 songy23 added the team/opentelemetry OpenTelemetry team label May 21, 2024
Copy link

codecov bot commented May 21, 2024

Codecov Report

Attention: Patch coverage is 54.00000% with 69 lines in your changes missing coverage. Please review.

Project coverage is 51.24%. Comparing base (781d7f1) to head (f4bf27e).

Current head f4bf27e differs from pull request most recent head 0fc9425

Please upload reports for the commit 0fc9425 to get more accurate results.

Files Patch % Lines
...tlp/components/exporter/datadogexporter/factory.go 43.83% 38 Missing and 3 partials ⚠️
comp/trace/agent/agent.go 0.00% 15 Missing ⚠️
cmd/otel-agent/config/agent_config.go 70.00% 4 Missing and 2 partials ⚠️
...components/exporter/serializerexporter/exporter.go 50.00% 2 Missing and 1 partial ⚠️
.../components/exporter/serializerexporter/factory.go 62.50% 2 Missing and 1 partial ⚠️
pkg/trace/api/otlp.go 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #25759       +/-   ##
===========================================
+ Coverage   44.85%   51.24%    +6.38%     
===========================================
  Files        2354     1782      -572     
  Lines      273223   152885   -120338     
===========================================
- Hits       122559    78344    -44215     
+ Misses     140994    70094    -70900     
+ Partials     9670     4447     -5223     
Flag Coverage Δ
amzn_aarch64 51.55% <54.00%> (+5.86%) ⬆️
centos_x86_64 51.44% <54.00%> (+5.83%) ⬆️
ubuntu_aarch64 51.56% <54.00%> (+5.86%) ⬆️
ubuntu_x86_64 51.54% <54.00%> (+5.85%) ⬆️
windows_amd64 54.62% <54.00%> (+3.95%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pr-commenter
Copy link

pr-commenter bot commented May 21, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=37232273 --os-family=ubuntu

Note: This applies to commit d082ad9

@pr-commenter
Copy link

pr-commenter bot commented May 21, 2024

Regression Detector

Regression Detector Results

Run ID: 687e1164-0f32-4ce1-9a0e-64eddd4555d3 Metrics dashboard Target profiles

Baseline: 4bddf58
Comparison: 6ae09a6

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI links
basic_py_check % cpu utilization +1.74 [-0.93, +4.42] Logs
file_tree memory utilization +1.17 [+1.06, +1.27] Logs
tcp_syslog_to_blackhole ingress throughput +0.73 [-11.96, +13.43] Logs
otel_to_otel_logs ingress throughput +0.15 [-0.66, +0.97] Logs
uds_dogstatsd_to_api ingress throughput +0.00 [-0.00, +0.00] Logs
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.01, +0.01] Logs
idle memory utilization -0.20 [-0.24, -0.16] Logs
uds_dogstatsd_to_api_cpu % cpu utilization -1.78 [-2.65, -0.90] Logs
pycheck_1000_100byte_tags % cpu utilization -2.80 [-7.47, +1.88] Logs

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@songy23 songy23 force-pushed the yang.song/otel-trace-exp-2 branch 2 times, most recently from 61854d4 to 8e86f44 Compare May 27, 2024 22:45
@songy23 songy23 modified the milestones: 7.55.0, 7.56.0 Jun 3, 2024
@songy23 songy23 force-pushed the yang.song/otel-trace-exp-2 branch from 8e86f44 to 0bb3e7d Compare June 6, 2024 21:45
@pr-commenter

This comment has been minimized.

@pr-commenter

This comment has been minimized.

@songy23 songy23 force-pushed the yang.song/otel-trace-exp-2 branch from e8e9eeb to f4bf27e Compare June 10, 2024 18:26
@pr-commenter

This comment has been minimized.

@pr-commenter

This comment has been minimized.

@songy23 songy23 force-pushed the yang.song/otel-trace-exp-2 branch from 0fc9425 to 7dfec1c Compare June 11, 2024 18:34
@pr-commenter

This comment has been minimized.

@pr-commenter

This comment has been minimized.

@songy23 songy23 force-pushed the yang.song/otel-trace-exp-2 branch 2 times, most recently from be68310 to f3f83ae Compare June 11, 2024 20:25
@pr-commenter

This comment has been minimized.

@pr-commenter

This comment has been minimized.

@pr-commenter

This comment has been minimized.

@pr-commenter

This comment has been minimized.

@songy23 songy23 requested a review from dineshg13 June 12, 2024 13:23
@songy23 songy23 force-pushed the yang.song/otel-trace-exp-2 branch from 7757a8e to 476c99a Compare June 12, 2024 16:25
@pr-commenter

This comment has been minimized.

@pr-commenter

This comment has been minimized.

@pr-commenter

This comment has been minimized.

@songy23 songy23 requested a review from dineshg13 June 14, 2024 20:35
type Component interface{}
type Component interface {
// SetStatsdClient sets the stats client of the underlying trace agent.
SetStatsdClient(mclient statsd.ClientInterface)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a way to hack around FX's statsd by allowing us to change the Agent's internal state after startup. This is fragile and probably does not work as expected, given how many places we send the statsd client to on Agent construction:

func NewAgent(ctx context.Context, conf *config.AgentConfig, telemetryCollector telemetry.TelemetryCollector, statsd statsd.ClientInterface) *Agent {
dynConf := sampler.NewDynamicConfig()
log.Infof("Starting Agent with processor trace buffer of size %d", conf.TraceBuffer)
in := make(chan *api.Payload, conf.TraceBuffer)
statsChan := make(chan *pb.StatsPayload, 1)
oconf := conf.Obfuscation.Export(conf)
if oconf.Statsd == nil {
oconf.Statsd = statsd
}
timing := timing.New(statsd)
agnt := &Agent{
Concentrator: stats.NewConcentrator(conf, statsChan, time.Now(), statsd),
ClientStatsAggregator: stats.NewClientStatsAggregator(conf, statsChan, statsd),
Blacklister: filters.NewBlacklister(conf.Ignore["resource"]),
Replacer: filters.NewReplacer(conf.ReplaceTags),
PrioritySampler: sampler.NewPrioritySampler(conf, dynConf, statsd),
ErrorsSampler: sampler.NewErrorsSampler(conf, statsd),
RareSampler: sampler.NewRareSampler(conf, statsd),
NoPrioritySampler: sampler.NewNoPrioritySampler(conf, statsd),
ProbabilisticSampler: sampler.NewProbabilisticSampler(conf, statsd),
EventProcessor: newEventProcessor(conf, statsd),
StatsWriter: writer.NewStatsWriter(conf, statsChan, telemetryCollector, statsd, timing),
obfuscator: obfuscate.NewObfuscator(oconf),
In: in,
conf: conf,
ctx: ctx,
DebugServer: api.NewDebugServer(conf),
Statsd: statsd,
Timing: timing,
}
agnt.Receiver = api.NewHTTPReceiver(conf, dynConf, in, agnt, telemetryCollector, statsd, timing)
agnt.OTLPReceiver = api.NewOTLPReceiver(in, conf, statsd, timing)
agnt.RemoteConfigHandler = remoteconfighandler.New(conf, agnt.PrioritySampler, agnt.RareSampler, agnt.ErrorsSampler)
agnt.TraceWriter = writer.NewTraceWriter(conf, agnt.PrioritySampler, agnt.ErrorsSampler, agnt.RareSampler, telemetryCollector, statsd, timing)
return agnt
}

We should instead either modify the statsd.Component to support creating your custom statsd or create a new implementation of statsd.Component that you can use to provide your client to the agent.

Copy link
Member Author

@songy23 songy23 Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you're right that we shouldn't set the global statsd client here, but rather the one used in OTLP receiver. Fixed that.

For statsd.Component implementation - we cannot do it in this way, because the stats client that we want to supply is created after collector is started. The start sequence is fx -> collector -> collector internal telemetry -> statsd client for OTel. In fx we do not have the reference to the statsd client for OTel, thus it has to be set dynamically (like we discussed in the previous meeting).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, it turns out we need to set the stats client used in the trace agent everywhere.

I'll set up some time to discuss. Doesn't seem straightforward to update the statsd reference everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'm discussing this within the team too.

@pr-commenter

This comment has been minimized.

type Component interface{}
type Component interface {
// SetStatsdClient sets the stats client of the underlying trace agent.
SetStatsdClient(mclient statsd.ClientInterface)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'm discussing this within the team too.

@pr-commenter

This comment has been minimized.

Copy link
Contributor

@knusbaum knusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me.

The SetOTelAttributeTranslator method feels off for the same reason as the statsd one, but it's only affecting the otlp receiver, so blast radius for it is minimal. Maybe this can be improved later.

Comment on lines +24 to +25
// SetOTelAttributeTranslator sets the OTel attributes translator of the underlying trace agent.
SetOTelAttributeTranslator(attrstrans *attributes.Translator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels similarly hacky to the statsd setter. Ideally this could be handled without mutating the agent like we do with statsd.

Copy link
Member Author

@songy23 songy23 Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dineshg13 and I talked a bit about the attributes translator before. This exists for a legacy reason and is not well structured right now, e.g. the attributes translator is in traceagent.config struct while it is not really a config. We have a backlog ticket to refactor & improve the OTLP receiver in trace agent in general, and this will be part of that effort.

@songy23
Copy link
Member Author

songy23 commented Jun 20, 2024

/merge

@dd-devflow
Copy link

dd-devflow bot commented Jun 20, 2024

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 37m.

Use /merge -c to cancel this operation!

@dd-devflow
Copy link

dd-devflow bot commented Jun 20, 2024

MergeQueue: The build pipeline contains failing jobs for this merge request

Build pipeline has failing jobs for 755ac2f

Since those jobs are not marked as being allowed to fail, the pipeline will most likely fail.
Therefore, and to allow other builds to be processed, this merge request has been rejected before the end of the pipeline.

You should have a look at the pipeline, wait for the build to finish and investigate the failures.
When you are ready, re-add your pull request to the queue!

⚠️ Do NOT retry failed jobs directly.

They were executed on a temporary branch created by the merge queue. If you retry them, they are going to fail because the branch no longer exists.

Details

List of failed jobs:

Go to failed Gitlab pipeline.

If you need support, contact us on Slack #devflow with those details!

@pr-commenter

This comment has been minimized.

@pr-commenter
Copy link

pr-commenter bot commented Jun 20, 2024

Benchmarks

Benchmark execution time: 2024-06-20 20:24:48

Comparing candidate commit d082ad9 in PR branch yang.song/otel-trace-exp-2 with baseline commit 4bddf58 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 2 metrics, 1 unstable metrics.

@songy23
Copy link
Member Author

songy23 commented Jun 20, 2024

/merge

@dd-devflow
Copy link

dd-devflow bot commented Jun 20, 2024

🚂 MergeQueue: waiting for PR to be ready

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

Use /merge -c to cancel this operation!

@dd-devflow
Copy link

dd-devflow bot commented Jun 20, 2024

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 37m.

Use /merge -c to cancel this operation!

@dd-mergequeue dd-mergequeue bot merged commit b7ce861 into main Jun 20, 2024
215 of 223 checks passed
@dd-mergequeue dd-mergequeue bot deleted the yang.song/otel-trace-exp-2 branch June 20, 2024 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants