-
Notifications
You must be signed in to change notification settings - Fork 15
Add APM tracing support #294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ersion. Pass trace provider. Manual stats flushing. Custom create endpoint until we clean up that code in libdatadog.
| if let Some(response) = http_utils::verify_request_content_length( | ||
| &parts.headers, | ||
| MAX_CONTENT_LENGTH, | ||
| "Error processing traces", | ||
| ) { | ||
| return response; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this made me think if it would be better to change the return from the function into a Result<(), Err>, or else rename this response to something negative, since it wasn't clear that verify_request_content_length was supposed to return Some(response) which in turn is used for a negative behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but we'd need to do it in the mini agent
| // trace_utils::set_serverless_root_span_tags( | ||
| // &mut chunk.spans[root_span_index], | ||
| // config.function_name.clone(), | ||
| // &config.env_type, | ||
| // ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this or keep it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, as far as I can tell we need these tags on every span, not just the root span. We can discuss this on slack a bit.
| chunk.spans.retain(|span| { | ||
| (span.resource != "127.0.0.1" || span.resource != "0.0.0.0") | ||
| && span.name != "dns.lookup" | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to refactor it into a function which specifies which spans we don't want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can do that if you'd rather that then a closure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not now, but I wonder if in the future we will need more spans that we want to ignore
| let trace_processor = self.trace_processor.clone(); | ||
| let stats_processor = self.stats_processor.clone(); | ||
| let endpoint_config = self.config.clone(); | ||
| let tags_provider = self.tags_provider.clone(); | ||
|
|
||
| let make_svc = make_service_fn(move |_| { | ||
| let trace_processor = trace_processor.clone(); | ||
| let trace_tx = trace_tx.clone(); | ||
|
|
||
| let stats_processor = stats_processor.clone(); | ||
| let stats_tx = stats_tx.clone(); | ||
|
|
||
| let endpoint_config = endpoint_config.clone(); | ||
| let tags_provider = tags_provider.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust Q, sooo many clones, can't we just move them all the way down? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, need to clone into make_service_fn and then again into service_fn. This all changed in hyper 1.x, so maybe we can look forward to fixing it then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg!
| if let Some(response) = http_utils::verify_request_content_length( | ||
| &parts.headers, | ||
| MAX_CONTENT_LENGTH, | ||
| "Error processing trace stats", | ||
| ) { | ||
| return response; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment that I had for the other file
| pub serverless_flush_strategy: FlushStrategy, | ||
| pub trace_enabled: bool, | ||
| pub serverless_trace_enabled: bool, | ||
| pub capture_lambda_payload: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was gonna say should we remove this? Then I remembered that we are currently targeting Node + Python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah need all these
duncanista
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments – impressive work as always! 🔥
duncanista
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM – ![]()
# This PR Change the trace request limit from 2 MiB to 50 MiB. # Motivation When the Node.js tracer layer sends a request to Lambda extension that's between 2 MiB and 50 MiB, the extension closes the HTTP connection, the tracer gets an `EPIPE` error and breaks. (Maybe the tracer should handle the error better, but that's out of scope of this PR.) According to @rochdev: > the agent is supposed to have a limit of 50mb So let's change the limit on agent side to match the expectation. # Testing Tested with Node.js 22 Lambda with this handler: ``` import tracer from 'dd-trace'; import crypto from 'crypto'; tracer.init(); function randomGarbage(len) { // low-compressibility payload (random bytes -> base64) return crypto.randomBytes(len).toString('base64'); } export const handler = async (event) => { const SPANS = 3000; const TAG_BYTES_PER_SPAN = 20_000; // ~20 KB per span tag (base64 expands a bit) const root = tracer.startSpan('repro.root'); root.setTag('dd.repro', 'true'); for (let i = 0; i < SPANS; i++) { console.log(`Sending the ${i}-th span`); const span = tracer.startSpan('repro.child', { childOf: root }); span.setTag('blob', randomGarbage(TAG_BYTES_PER_SPAN)); span.finish(); } root.finish(); const response = { statusCode: 200, body: JSON.stringify('Hello from Lambda!'), }; return response; }; ``` ### Before: There are errors like: ``` Error: write EPIPE at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:95:16) at WriteWrap.callbackTrampoline (node:internal/async_hooks:130:17) ``` ``` LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403. {"errorMessage":"State transition from Ready to InvocationErrorResponse failed for runtime. Error: State transition is not allowed","errorType":"InvalidStateTransition"} ``` ### After When Lambda's memory is 1024 MB, the error no longer happens. When Lambda's memory is 512 MB, the invocation can fail due to OOM. But I think that's a legit error. We can ask customers to increase memory limit for high-volume workload like this. # Notes cc @astuyve who set a `MAX_CONTENT_LENGTH` of 10 MiB in #294. This PR increases it to 50 MiB as well. Thanks @dougqh @duncanista @lucaspimentel @rochdev for discussion. #899 Jira: https://datadoghq.atlassian.net/browse/SVLS-7777
Adds support for APM tracing via v4 and v5 endpoints, sending data as v7 to the backend
mostly a copy paste of the mini agent, but a bunch of stuff changed as we can't flush on a schedule.