Add APM tracing support #294

astuyve · 2024-06-27T20:04:33Z

Adds support for APM tracing via v4 and v5 endpoints, sending data as v7 to the backend
mostly a copy paste of the mini agent, but a bunch of stuff changed as we can't flush on a schedule.

…ersion. Pass trace provider. Manual stats flushing. Custom create endpoint until we clean up that code in libdatadog.

bottlecap/Cargo.toml

scripts/Dockerfile.bottlecap.build

bottlecap/src/traces/trace_processor.rs

bottlecap/src/traces/http_utils.rs

duncanista · 2024-06-27T23:13:26Z

bottlecap/src/traces/trace_processor.rs

+        if let Some(response) = http_utils::verify_request_content_length(
+            &parts.headers,
+            MAX_CONTENT_LENGTH,
+            "Error processing traces",
+        ) {
+            return response;
+        }


Reading this made me think if it would be better to change the return from the function into a Result<(), Err>, or else rename this response to something negative, since it wasn't clear that verify_request_content_length was supposed to return Some(response) which in turn is used for a negative behavior.

We can, but we'd need to do it in the mini agent

duncanista · 2024-06-27T23:14:30Z

bottlecap/src/traces/trace_processor.rs

+                // trace_utils::set_serverless_root_span_tags(
+                //     &mut chunk.spans[root_span_index],
+                //     config.function_name.clone(),
+                //     &config.env_type,
+                // );


Should we remove this or keep it?

Good question, as far as I can tell we need these tags on every span, not just the root span. We can discuss this on slack a bit.

duncanista · 2024-06-27T23:16:00Z

bottlecap/src/traces/trace_processor.rs

+                chunk.spans.retain(|span| {
+                    (span.resource != "127.0.0.1" || span.resource != "0.0.0.0")
+                        && span.name != "dns.lookup"
+                });


Might be good to refactor it into a function which specifies which spans we don't want?

can do that if you'd rather that then a closure

Maybe not now, but I wonder if in the future we will need more spans that we want to ignore

bottlecap/src/traces/trace_flusher.rs

duncanista · 2024-06-27T23:28:25Z

bottlecap/src/traces/trace_agent.rs

+        let trace_processor = self.trace_processor.clone();
+        let stats_processor = self.stats_processor.clone();
+        let endpoint_config = self.config.clone();
+        let tags_provider = self.tags_provider.clone();
+
+        let make_svc = make_service_fn(move |_| {
+            let trace_processor = trace_processor.clone();
+            let trace_tx = trace_tx.clone();
+
+            let stats_processor = stats_processor.clone();
+            let stats_tx = stats_tx.clone();
+
+            let endpoint_config = endpoint_config.clone();
+            let tags_provider = tags_provider.clone();


Rust Q, sooo many clones, can't we just move them all the way down? 🤔

nah, need to clone into make_service_fn and then again into service_fn. This all changed in hyper 1.x, so maybe we can look forward to fixing it then?

duncanista · 2024-06-27T23:30:05Z

bottlecap/src/traces/stats_processor.rs

+        if let Some(response) = http_utils::verify_request_content_length(
+            &parts.headers,
+            MAX_CONTENT_LENGTH,
+            "Error processing trace stats",
+        ) {
+            return response;
+        }


Same comment that I had for the other file

duncanista · 2024-06-27T23:32:37Z

bottlecap/src/config/mod.rs

    pub serverless_flush_strategy: FlushStrategy,
+    pub trace_enabled: bool,
+    pub serverless_trace_enabled: bool,
+    pub capture_lambda_payload: bool,


I was gonna say should we remove this? Then I remembered that we are currently targeting Node + Python

yeah need all these

duncanista

Left some comments – impressive work as always! 🔥

duncanista

LGTM –

@rochdev

# This PR Change the trace request limit from 2 MiB to 50 MiB. # Motivation When the Node.js tracer layer sends a request to Lambda extension that's between 2 MiB and 50 MiB, the extension closes the HTTP connection, the tracer gets an `EPIPE` error and breaks. (Maybe the tracer should handle the error better, but that's out of scope of this PR.) According to @rochdev: > the agent is supposed to have a limit of 50mb So let's change the limit on agent side to match the expectation. # Testing Tested with Node.js 22 Lambda with this handler: ``` import tracer from 'dd-trace'; import crypto from 'crypto'; tracer.init(); function randomGarbage(len) { // low-compressibility payload (random bytes -> base64) return crypto.randomBytes(len).toString('base64'); } export const handler = async (event) => { const SPANS = 3000; const TAG_BYTES_PER_SPAN = 20_000; // ~20 KB per span tag (base64 expands a bit) const root = tracer.startSpan('repro.root'); root.setTag('dd.repro', 'true'); for (let i = 0; i < SPANS; i++) { console.log(`Sending the ${i}-th span`); const span = tracer.startSpan('repro.child', { childOf: root }); span.setTag('blob', randomGarbage(TAG_BYTES_PER_SPAN)); span.finish(); } root.finish(); const response = { statusCode: 200, body: JSON.stringify('Hello from Lambda!'), }; return response; }; ``` ### Before: There are errors like: ``` Error: write EPIPE at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:95:16) at WriteWrap.callbackTrampoline (node:internal/async_hooks:130:17) ``` ``` LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403. {"errorMessage":"State transition from Ready to InvocationErrorResponse failed for runtime. Error: State transition is not allowed","errorType":"InvalidStateTransition"} ``` ### After When Lambda's memory is 1024 MB, the error no longer happens. When Lambda's memory is 512 MB, the invocation can fail due to OOM. But I think that's a legit error. We can ask customers to increase memory limit for high-volume workload like this. # Notes cc @astuyve who set a `MAX_CONTENT_LENGTH` of 10 MiB in #294. This PR increases it to 50 MiB as well. Thanks @dougqh @duncanista @lucaspimentel @rochdev for discussion. #899 Jira: https://datadoghq.atlassian.net/browse/SVLS-7777

astuyve added 15 commits June 17, 2024 13:18

wip: tracing

5b68c4d

feat: tracing WIP

c532836

feat: rename mini agent to trace agent

8b54c2b

feat: fmt

c58c786

feat: Fix formatting after rename

7a98a1f

fix: remove extra tokio task

a44be55

feat: allow tracing

442039d

feat: working v5 traces

8a02ce8

feat: Update to use my branch of libdatadog so we have v5 support

d029b50

Merge branch 'main' into aj/add-trace-agent

3f99b22

feat: Update w/ libdatadog to pass trace encoding version

e95c8eb

feat: update w/ merged libdatadog changes

dcb19eb

feat: Refactor trace agent, reduce code duplication, enum for trace v…

d53eb85

…ersion. Pass trace provider. Manual stats flushing. Custom create endpoint until we clean up that code in libdatadog.

feat: Unify config, remove trace config. Tests pass

7171b61

feat: fmt

ed76cf3

astuyve requested a review from a team as a code owner June 27, 2024 20:04

astuyve added 4 commits June 27, 2024 16:08

feat: fmt

e435e89

clippy fixes

2ce64fd

parse time

c06c5a2

feat: clippy again

aeb64cb