Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix default hostname/port not being used when mixing http and uds configuration + improve warning message #3037

Merged

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Aug 9, 2023

What does this PR do?:

This PR fixes a corner case bug introduced in #2806 when we added support for configuration unix domain socket (UDS) via the DD_TRACE_AGENT_URL environment variable.

Specifically, when a mix of http and uds configuration was specified, we printed a warning and used the http configuration instead.

But, if only the hostname or only the port + uds configuration was provided, we did not correctly use the default hostname and default port as a replacement.

That is:

  • DD_AGENT_HOST=10.128.186.61 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket resulted in hostname="10.128.186.61", port=nil
  • DD_TRACE_AGENT_PORT=1234 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket resulted in hostname=nil, port=1234

This led to a failure to communicate with the Datadog agent in these situations.

Motivation:

Fix issue when communicating with the agent.

Additional Notes:

We caught this issue while testing internal apps. On the plus side, we did not get any customer reports for this issue, so hopefully nobody was bitten by it.

Also included are a few tweaks to the warning message generated + tests for it. See individual commits for detail.

I've tested the fix in the KTG and can confirm data is showing up again for "candidate" releases.

How to test the change?:

Change includes test coverage. You can also compare the output of the AgentSettingsResolver before/after the issue was fixed:

 # Before
$ DD_AGENT_HOST=10.128.186.61 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))"
 #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings
 adapter=:net_http,
 ssl=false,
 hostname="10.128.186.61",
 port=nil,
 uds_path=nil,
 timeout_seconds=nil,
 deprecated_for_removal_transport_configuration_proc=nil>
$ DD_TRACE_AGENT_PORT=1234 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))"
 #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings
 adapter=:net_http,
 ssl=false,
 hostname=nil,
 port=1234,
 uds_path=nil,
 timeout_seconds=nil,
 deprecated_for_removal_transport_configuration_proc=nil>

 # After
$ DD_AGENT_HOST=10.128.186.61 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))"
 #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings
 adapter=:net_http,
 ssl=false,
 hostname="10.128.186.61",
 port=8126,
 uds_path=nil,
 timeout_seconds=nil,
 deprecated_for_removal_transport_configuration_proc=nil>
$ DD_TRACE_AGENT_PORT=1234 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))"
 #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings
 adapter=:net_http,
 ssl=false,
 hostname="127.0.0.1",
 port=1234,
 uds_path=nil,
 timeout_seconds=nil,
 deprecated_for_removal_transport_configuration_proc=nil>

…figuration

**What does this PR do?**:

This PR fixes a corner case bug introduced in #2806 when we added
support for configuration unix domain socket (UDS) via the
`DD_TRACE_AGENT_URL` environment variable.

Specifically, when a mix of http and uds configuration was specified,
we printed a warning and used the http configuration instead.

But, if only the hostname or only the port + uds configuration was
provided, we did not correctly use the default hostname and default
port as a replacement.

That is:

* `DD_AGENT_HOST=10.128.186.61 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket`
  resulted in `hostname="10.128.186.61", port=nil`
* `DD_TRACE_AGENT_PORT=1234 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket`
  resulted in `hostname=nil, port=1234`

This led to a failure to communicate with the Datadog agent in
these situations.

**Motivation**:

Fix issue when communicating with the agent.

**Additional Notes**:

We caught this issue while testing internal apps. On the plus side,
we did not get any customer reports for this issue, so hopefully
nobody was bitten by it.

**How to test the change?**:

Change includes test coverage. You can also compare the output of
the `AgentSettingsResolver` before/after the issue was fixed:

```
 # Before
$ DD_AGENT_HOST=10.128.186.61 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))"
 #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings
 adapter=:net_http,
 ssl=false,
 hostname="10.128.186.61",
 port=nil,
 uds_path=nil,
 timeout_seconds=nil,
 deprecated_for_removal_transport_configuration_proc=nil>
$ DD_TRACE_AGENT_PORT=1234 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))"
 #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings
 adapter=:net_http,
 ssl=false,
 hostname=nil,
 port=1234,
 uds_path=nil,
 timeout_seconds=nil,
 deprecated_for_removal_transport_configuration_proc=nil>

 # After
$ DD_AGENT_HOST=10.128.186.61 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))"
 #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings
 adapter=:net_http,
 ssl=false,
 hostname="10.128.186.61",
 port=8126,
 uds_path=nil,
 timeout_seconds=nil,
 deprecated_for_removal_transport_configuration_proc=nil>
$ DD_TRACE_AGENT_PORT=1234 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))"
 #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings
 adapter=:net_http,
 ssl=false,
 hostname="127.0.0.1",
 port=1234,
 uds_path=nil,
 timeout_seconds=nil,
 deprecated_for_removal_transport_configuration_proc=nil>
```
The `configured_hostname` and `configured_port` don't include the
defaults, so the resulting message was a bit confusing when defaults
were picked for them.

By using `hostname` and `port` instead, we make sure the warning
message shows the values that will be picked.

(I did notice a bit of a weird recursion where calling `hostname` and
`port` does call back into `mixed_http_and_uds?` via `should_use_uds?`,
but it's fine because we already set the `@mixed_http_and_uds` before
doing so. It is a bit awkward but... I feel in a way this is reflecting
the actual complexity of all these priorities and overrides and whatnot,
so I'll leave it to someone else to try to untangle a bit better in
the future.)
`uds_path` vs `parsed_url` is similar to `hostname` vs
`configured_hostname`: The `uds_path` is the actual final value that's
going to be included in the `AgentSettingsResolver` output, so it
will be `nil` when `mixed_http_and_uds?` is true.

In contrast, `parsed_url` contains the configured value which caused
the configuration mismatch, and so that's the correct one to print.
@ivoanjo ivoanjo requested a review from a team August 9, 2023 12:32
@github-actions github-actions bot added the core Involves Datadog core libraries label Aug 9, 2023
ivoanjo added a commit that referenced this pull request Aug 9, 2023
**What does this PR do?**:

This PR tweaks the `HttpTransport` error logging to include the
configuration being used.

For instance, it will now look like:

```
ERROR (lib/datadog/profiling/http_transport.rb:58:in `export') Failed to
report profiling data ({:agent=>"unix:///fake-apm-socket.socket"}):
failed ddog_prof_Exporter_send: error trying to connect: No such file
or directory (os error 2): No such file or directory (os error 2)
```

...I did notice the libdatadog error message seems to be repeated and
isn't particularly great; that's a separate issue for a future fix
(on the libdatadog side).

**Motivation**:

While investigating #3037 I noticed this information was not being
logged when the profiler failed to report, and I thought it would be
come in handy when debugging future issues.

**Additional Notes**:

N/A

**How to test the change?**:

The change includes test coverage. You can also manually trigger an
issue by running dd-trace-rb with an invalid configuration, e.g.
the above error message was triggered by running with

```bash
$ DD_PROFILING_ENABLED=true DD_TRACE_AGENT_URL=unix:///fake-apm-socket.socket \
  bundle exec ddtracerb exec ruby -e "sleep 1"
```
@marcotc
Copy link
Member

marcotc commented Aug 9, 2023

From technical point of view, this PR improves greatly the internal consistent of the agent resolver.
We are discussing internally if DD_TRACE_AGENT_URL should override both DD_AGENT_HOST and DD_TRACE_AGENT_PORT instead though.

Copy link
Contributor

@TonyCTHsu TonyCTHsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

@ivoanjo
Copy link
Member Author

ivoanjo commented Aug 11, 2023

Replying to the DD_TRACE_AGENT_URL concern, the current behavior is to prioritize it, but yeah that doesn't apply in this corner case.

Specifically, it's only in this bizarre corner case where conflicting configuration gets added (supplying both http and uds configuration) that there's a divergence.

The divergence right now is a bit more subtle, because it's not about DD_TRACE_AGENT_URL being downgraded vs DD_AGENT_HOST/DD_TRACE_AGENT_PORT. Instead, currently any configuration for http (including via code) mixed with uds via DD_TRACE_AGENT_URL is prioritized.

Thus, setting Datadog.configure { |c| c.agent.host = 'foo' } and DD_TRACE_AGENT_URL with uds means giving priority to the via-code configuration.

Usually, so far, we've been giving priority to any settings via code. So for this case, we'd either deprioritize settings-via-code vs a uds DD_TRACE_AGENT_URL env (:cry:) or we need to add extra complexity in this case to understand if the conflicting hostname/port is coming via env var or via code (as if our logic for this wasn't kinda complex :cry: ).

Since we've talked about this a bit via slack, and Tony approved this, and this PR is preventing us from seeing data on some of the test apps, I'm going to go ahead and press the merge button.

If we do decide to go in a separate direction, I'm perfectly happy with a revert -- my intention is getting it to work, so any way we can do it is fine by me :)

@ivoanjo ivoanjo merged commit 42f82ae into master Aug 11, 2023
205 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/fix-missing-hostname-port-when-uds-was-specified branch August 11, 2023 16:23
@github-actions github-actions bot added this to the 1.13.1 milestone Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants