Skip to content

Conversation

@astuyve
Copy link
Contributor

@astuyve astuyve commented Jun 27, 2024

  • Adds support for APM tracing via v4 and v5 endpoints, sending data as v7 to the backend

  • mostly a copy paste of the mini agent, but a bunch of stuff changed as we can't flush on a schedule.

@astuyve astuyve requested a review from a team as a code owner June 27, 2024 20:04
Comment on lines +57 to +63
if let Some(response) = http_utils::verify_request_content_length(
&parts.headers,
MAX_CONTENT_LENGTH,
"Error processing traces",
) {
return response;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this made me think if it would be better to change the return from the function into a Result<(), Err>, or else rename this response to something negative, since it wasn't clear that verify_request_content_length was supposed to return Some(response) which in turn is used for a negative behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but we'd need to do it in the mini agent

Comment on lines 94 to 98
// trace_utils::set_serverless_root_span_tags(
// &mut chunk.spans[root_span_index],
// config.function_name.clone(),
// &config.env_type,
// );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this or keep it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, as far as I can tell we need these tags on every span, not just the root span. We can discuss this on slack a bit.

Comment on lines +99 to +102
chunk.spans.retain(|span| {
(span.resource != "127.0.0.1" || span.resource != "0.0.0.0")
&& span.name != "dns.lookup"
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to refactor it into a function which specifies which spans we don't want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do that if you'd rather that then a closure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not now, but I wonder if in the future we will need more spans that we want to ignore

Comment on lines +79 to +92
let trace_processor = self.trace_processor.clone();
let stats_processor = self.stats_processor.clone();
let endpoint_config = self.config.clone();
let tags_provider = self.tags_provider.clone();

let make_svc = make_service_fn(move |_| {
let trace_processor = trace_processor.clone();
let trace_tx = trace_tx.clone();

let stats_processor = stats_processor.clone();
let stats_tx = stats_tx.clone();

let endpoint_config = endpoint_config.clone();
let tags_provider = tags_provider.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust Q, sooo many clones, can't we just move them all the way down? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, need to clone into make_service_fn and then again into service_fn. This all changed in hyper 1.x, so maybe we can look forward to fixing it then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg!

Comment on lines +42 to +48
if let Some(response) = http_utils::verify_request_content_length(
&parts.headers,
MAX_CONTENT_LENGTH,
"Error processing trace stats",
) {
return response;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment that I had for the other file

pub serverless_flush_strategy: FlushStrategy,
pub trace_enabled: bool,
pub serverless_trace_enabled: bool,
pub capture_lambda_payload: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was gonna say should we remove this? Then I remembered that we are currently targeting Node + Python

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah need all these

Copy link
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments – impressive work as always! 🔥

Copy link
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM – :shipit:

@astuyve astuyve merged commit a95e0a1 into main Jun 28, 2024
@astuyve astuyve deleted the aj/add-trace-agent branch June 28, 2024 15:29
lym953 added a commit that referenced this pull request Oct 21, 2025
# This PR
Change the trace request limit from 2 MiB to 50 MiB.

# Motivation
When the Node.js tracer layer sends a request to Lambda extension that's
between 2 MiB and 50 MiB, the extension closes the HTTP connection, the
tracer gets an `EPIPE` error and breaks. (Maybe the tracer should handle
the error better, but that's out of scope of this PR.)

According to @rochdev:
> the agent is supposed to have a limit of 50mb

So let's change the limit on agent side to match the expectation.

# Testing
Tested with Node.js 22 Lambda with this handler:
```
import tracer from 'dd-trace';
import crypto from 'crypto';
tracer.init();

function randomGarbage(len) {
  // low-compressibility payload (random bytes -> base64)
  return crypto.randomBytes(len).toString('base64');
}

export const handler = async (event) => {
  const SPANS = 3000;
  const TAG_BYTES_PER_SPAN = 20_000; // ~20 KB per span tag (base64 expands a bit)

  const root = tracer.startSpan('repro.root');
  root.setTag('dd.repro', 'true');

  for (let i = 0; i < SPANS; i++) {
    console.log(`Sending the ${i}-th span`);
    const span = tracer.startSpan('repro.child', { childOf: root });
    span.setTag('blob', randomGarbage(TAG_BYTES_PER_SPAN));
    span.finish();
  }
  root.finish();

  const response = {
    statusCode: 200,
    body: JSON.stringify('Hello from Lambda!'),
  };
  return response;
};
```

### Before:
There are errors like:
```
Error: write EPIPE
at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:95:16)
at WriteWrap.callbackTrampoline (node:internal/async_hooks:130:17)
```
```
LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403. {"errorMessage":"State transition from Ready to InvocationErrorResponse failed for runtime. Error: State transition is not allowed","errorType":"InvalidStateTransition"}
```

### After
When Lambda's memory is 1024 MB, the error no longer happens.
When Lambda's memory is 512 MB, the invocation can fail due to OOM. But
I think that's a legit error. We can ask customers to increase memory
limit for high-volume workload like this.

# Notes
cc @astuyve who set a `MAX_CONTENT_LENGTH` of 10 MiB in
#294. This PR
increases it to 50 MiB as well.

Thanks @dougqh @duncanista @lucaspimentel @rochdev for discussion.

#899
Jira: https://datadoghq.atlassian.net/browse/SVLS-7777
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants