Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt msgpack encoding #970

Closed
ngraef opened this issue May 29, 2020 · 5 comments
Closed

Corrupt msgpack encoding #970

ngraef opened this issue May 29, 2020 · 5 comments

Comments

@ngraef
Copy link

ngraef commented May 29, 2020

Describe the bug

I noticed my agents are periodically crashing after a bunch of trace decoding errors.

Agent log
[...]
2020-05-28 22:21:20 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:20 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:20 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "float64" with method for "str"
2020-05-28 22:21:21 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:21 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:21 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:22 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:22 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:22 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:22 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:23 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:23 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:23 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:24 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:24 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:24 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:24 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:25 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:25 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:25 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:26 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:26 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:26 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:26 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:27 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:27 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:27 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:28 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:28 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:28 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:28 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:29 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:29 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:29 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:30 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:30 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:30 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:30 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: unexpected EOF
2020-05-28 22:21:31 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:31 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:31 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:32 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:32 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:32 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:32 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: unexpected EOF
2020-05-28 22:21:33 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:33 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "map" with method for "str"
2020-05-28 22:21:33 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:33 UTC | TRACE | INFO | (pkg/trace/info/stats.go:108 in LogStats) | [lang:nodejs lang_version:v12.16.1 interpreter:v8 tracer_version:0.18.1] -> traces received: 180, traces filtered: 0, traces amount: 94650 bytes, events extracted: 0, events sampled: 0
2020-05-28 22:21:33 UTC | TRACE | INFO | (pkg/trace/info/stats.go:108 in LogStats) | [lang:nodejs lang_version:v12.16.1 interpreter:v8 tracer_version:0.19.1] -> traces received: 30, traces filtered: 0, traces amount: 13440 bytes, events extracted: 0, events sampled: 0
2020-05-28 22:21:33 UTC | TRACE | WARN | (pkg/trace/info/stats.go:111 in LogStats) | [lang:nodejs lang_version:v12.16.1 interpreter:v8 tracer_version:0.19.1] -> traces_dropped(decoding_error:39697). Enable debug logging for more details.
2020-05-28 22:21:34 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "int" with method for "str"
2020-05-28 22:21:34 UTC | TRACE | ERROR | (pkg/trace/api/api.go:379 in handleTraces) | Cannot decode v0.4 traces payload: msgp: attempted to decode type "array" with method for "str"
2020-05-28 22:21:34 UTC | CORE | WARN | (pkg/logs/input/docker/launcher.go:268 in restartTailer) | Could not resume tailing container 7f1dfb34d821
seelog internal error: write /var/log/datadog/agent.log: cannot allocate memory
2020-05-28 22:21:36 UTC | CORE | WARN | (pkg/logs/auditor/auditor.go:172 in run) | write /opt/datadog-agent/run/registry.json: cannot allocate memory
seelog internal error: write /var/log/datadog/agent.log: cannot allocate memory
2020-05-28 22:21:37 UTC | PROCESS | CRITICAL | (pkg/process/util/signal_nowindows.go:21 in HandleSignals) | Caught signal 'terminated'; terminating.
2020-05-28 22:21:37 UTC | PROCESS | INFO | (pkg/forwarder/forwarder.go:243 in Stop) | stopping the Forwarder
2020-05-28 22:21:37 UTC | PROCESS | INFO | (pkg/forwarder/domain_forwarder.go:200 in Stop) | domainForwarder stopped
2020-05-28 22:21:37 UTC | PROCESS | INFO | (pkg/forwarder/forwarder.go:243 in Stop) | stopping the Forwarder
2020-05-28 22:21:37 UTC | CORE | INFO | (cmd/agent/app/run.go:104 in func2) | Received signal 'terminated', shutting down...
2020-05-28 22:21:37 UTC | TRACE | INFO | (main.go:23 in handleSignal) | received signal 15 (terminated)
2020-05-28 22:21:37 UTC | CORE | INFO | (pkg/collector/runner/runner.go:149 in Stop) | Runner is shutting down...
2020-05-28 22:21:37 UTC | CORE | INFO | (pkg/collector/python/subprocesses.go:48 in TerminateRunningProcesses) | Canceling all running python subprocesses
2020-05-28 22:21:37 UTC | CORE | INFO | (pkg/collector/runner/runner.go:166 in func1) | Stopping Check kubelet that is still running...
2020-05-28 22:21:37 UTC | CORE | INFO | (pkg/collector/runner/runner.go:166 in func1) | Stopping Check disk that is still running...
2020-05-28 22:21:37 UTC | CORE | INFO | (pkg/aggregator/aggregator.go:639 in run) | Stopping aggregator
ifelse: fatal: unable to spawn s6-test: Out of memory
2020-05-28 22:21:38 UTC | CORE | WARN | (pkg/logs/auditor/auditor.go:172 in run) | write /opt/datadog-agent/run/registry.json: cannot allocate memory
backtick: fatal: unable to exec importas: Out of memory

So I turned on debug logging in dd-trace in my app to inspect the trace payloads. Here's a sample span that was logged by the tracer:

Encoding trace: [{"trace_id":"31731c1fac15ceea","span_id":"31731c1fac15ceea","parent_id":"0000000000000000","name":"pg.query","resource":"INSERT INTO \"user_role\"(\"roleId\", \"userId\") VALUES ($1, $2)","error":0,"meta":{"service":"moleculer","version":"0.14.7","span.kind":"client","db.type":"postgres","db.name":"iam","db.user":"velocity","out.host":"postgres"},"metrics":{"_sample_rate":1,"out.port":5432,"_dd.agent_psr":1,"_sampling_priority_v1":1},"start":1590733517735480600,"duration":1671143,"service":"moleculer-postgres","type":"sql"}]
Added encoded trace to buffer: 91 8c a8 74 72 61 63 65 5f 69 64 cf 31 73 1c 1f ac 15 ce ea b9 6e 28 38 97 e4 3d 34 cf 31 73 1c 1f ac 15 ce ea 98 b4 51 b9 6c b6 c0 16 13 6b b4 d0 ee 4d 00 a8 64 75 72 61 74 69 6f 6e d3 00 00 00 00 00 19 7f e7 a5 65 72 72 6f 72 00 a9 70 61 72 65 6e 74 5f 69 64 cf 00 00 00 00 00 00 00 00 a4 6e 61 6d 65 a8 70 67 2e 71 75 65 72 79 a8 72 65 73 6f 75 72 63 65 d9 3b 49 4e 53 45 52 54 20 49 4e 54 4f 20 22 75 73 65 72 5f 72 6f 6c 65 22 28 22 72 6f 6c 65 49 64 22 2c 20 22 75 73 65 72 49 64 22 29 20 56 41 4c 55 45 53 20 28 24 31 2c 20 24 32 29 a7 73 65 72 76 69 63 65 b2 6d 6f 6c 65 63 75 6c 65 72 2d 70 6f 73 74 67 72 65 73 a4 74 79 70 65 a3 73 71 6c a4 6d 65 74 61 87 a7 73 65 72 76 69 63 65 a9 6d 6f 6c 65 63 75 6c 65 72 a7 76 65 72 73 69 6f 6e a6 30 2e 31 34 2e 37 a9 73 70 61 6e 2e 6b 69 6e 64 a6 63 6c 69 65 6e 74 a7 64 62 2e 74 79 70 65 a8 70 6f 73 74 67 72 65 73 a7 64 62 2e 6e 61 6d 65 a3 69 61 6d a7 64 62 2e 75 73 65 72 a8 76 65 6c 6f 63 69 74 79 a8 6f 75 74 2e 68 6f 73 74 a8 70 6f 73 74 67 72 65 73 a7 6d 65 74 72 69 63 73 84 ac 5f 73 61 6d 70 6c 65 5f 72 61 74 65 cb 3f f0 00 00 00 00 00 00 a8 6f 75 74 2e 70 6f 72 74 cb 40 b5 38 00 00 00 00 00 ad 5f 64 64 2e 61 67 65 6e 74 5f 70 73 72 cb 3f f0 00 00 00 00 00 00 b5 5f 73 61 6d 70 6c 69 6e 67 5f 70 72 69 6f 72 69 74 79 5f 76 31 cb 3f f0 00 00 00 00 00 00

The printed buffer appears to be corrupt:

91 = 1-item array
  8c = 12-entry map
    --
    a8 = 8-byte string
      74 72 61 63 65 5f 69 64 = "trace_id"
    cf = uint64
      31 73 1c 1f ac 15 ce ea
    --
    b9 = 25-byte string
      6e 28 38 97 e4 3d 34 cf 31 73 1c 1f ac 15 ce ea 98 b4 51 b9 6c b6 c0 16 13 = garbage
    6b = 107
    --
    b4 = 20-byte string
      d0 ee 4d 00 a8 64 75 72 61 74 69 6f 6e d3 00 00 00 00 00 19 = garbage, including the string "duration"
    7f = 127
    --
[...]

It should be:

91 = 1-item array
  8c = 12-entry map
    --
    a8 = 8-byte string
      74 72 61 63 65 5f 69 64 = "trace_id"
    cf = uint64
      31 73 1c 1f ac 15 ce ea
    --
-   b9 6e 28 38 97 e4 3d 34
+   a7 = 7-byte string
+     73 70 61 6e 5f 69 64 = "span_id"
    cf = uint64
      31 73 1c 1f ac 15 ce ea
    --
-   98 b4 51 b9 6c b6 c0 16 13 6b b4 d0 ee 4d 00
+   a5 = 5-byte string
+     73 74 61 72 74 = "start"
+   d3 = int64
+     16 13 6b b4 d0 ee 4d 18 = 1590733517735480600
    --
    a8 = 8-byte string
      64 75 72 61 74 69 6f 6e = "duration"
    d3 = int64
      00 00 00 00 00 19 7f e7 = 1671143
    --
    a5 = 5-byte string
      65 72 72 6f 72 = "error"
    00 = 0
    --
    a9 = 9-byte string
      70 61 72 65 6e 74 5f 69 64 = "parent_id"
    cf = uint64
      00 00 00 00 00 00 00 00
    --
    a4 = 4-byte string
      6e 61 6d 65 = "name"
    a8 = 8-byte string
      70 67 2e 71 75 65 72 79 = "pg.query"
    --
[...]

Oddly, when I add this span data to the encoder unit tests in the dd-trace project, it encodes correctly. So I'm stumped. Any idea why the printed trace and encoded buffer don't match? Is this the cause of the agent decoding errors?

Environment

  • Operating system: Linux / macOS 10.15
  • Node version: 12.16.1
  • Tracer version: 0.21.0
  • Agent version: 7.19.2
@ngraef ngraef added the bug Something isn't working label May 29, 2020
@bengl
Copy link
Collaborator

bengl commented Jun 2, 2020

This is certainly quite odd. For reference, we encode MessagePack manually, rather than using an off-the-shelf library.

A few questions:

  1. Are the trace decoding errors constant? By that I mean, is every single payload sent to the agent broken, or just some?
  2. Is it always those two fields that are broken? (i.e. span_id and start)
  3. Is a small reproduction case possible?

@ngraef
Copy link
Author

ngraef commented Jun 3, 2020

Turns out this is an issue with moleculer (microservice framework) implementation of a Datadog trace exporter. (See moleculerjs/moleculer#725 (comment).) Basically, it's overwriting the trace_id and span_id with 16-byte GUIDs and the MessagePack encoder is expecting them to be 8 bytes, so the copied id buffers overwrite the header tokens. This can be easily reproduced by setting longer ids in the encoder unit tests and viewing the decoded result.

Additionally, once a span with a bad id has been encoded, all future encodes will also be corrupt, even spans with valid ids (like the one in the original post's example), due to the reuse of a previously-corrupted headerBuffer. This can also be easily reproduced in the encoder unit tests.

I'm not sure if there's any work to be done here, given that the issue appears to be misuse of the library's private internals, but this may spark some ideas for improvement on your end. I will also likely file the agent crash as a memory leak issue.

@rochdev rochdev added community and removed bug Something isn't working labels Jun 3, 2020
@rochdev
Copy link
Member

rochdev commented Jun 3, 2020

Using these externally is definitely not supported since they are internal APIs. I left a comment on moleculerjs/moleculer#725 to try to see how we can improve the situation.

@bengl
Copy link
Collaborator

bengl commented Jun 4, 2020

#976 truncates the IDs to 8 bytes. This should prevent corrupting the headerBuffer with longer IDs. It's now merged so it will be in the next release.

@rochdev
Copy link
Member

rochdev commented Jun 11, 2020

Fix released in 0.21.1

@rochdev rochdev closed this as completed Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants