Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate trace context from SQS events #142

Merged
merged 9 commits into from
Sep 14, 2023
Merged

Conversation

purple4reina
Copy link
Contributor

@purple4reina purple4reina commented Sep 11, 2023

What does this PR do?

When this lambda function is a consumer for a SQS queue, collect the trace context passed back from the extension and apply it.

Motivation

Traces with a go SQS consumer were broken.

Testing Guidelines

Sample trace using these updates

Screenshot 2023-09-12 at 12 21 13 PM

Note that any child span created inside the lambda function currently will not appear in the trace. This is because the extension isn't reading sampling priority and therefore the tracer defaults to priority 0 (throw it away). See https://datadoghq.atlassian.net/browse/SVLS-3537.

Additional Notes

For some reason I am not seeing the aws.lambda span created for my sqs consumer. This should be created in the extension when universal instrumentation is turned on. I haven't been able to get to the bottom of this. However, I have been able to show that any child spans created in the SQS comsumer lambda will have the proper trace ID and show up properly in the trace.

Types of changes

  • Bug fix
  • New feature
  • Breaking change
  • Misc (docs, refactoring, dependency upgrade, etc.)

Checklist

  • This PR's description is comprehensive
  • This PR contains breaking changes that are documented in the description
  • This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
  • This PR impacts documentation, and it has been updated (or a ticket has been logged)
  • This PR's changes are covered by the automated tests
  • This PR collects user input/sensitive content into Datadog

@purple4reina purple4reina requested a review from a team as a code owner September 11, 2023 17:41
parentId := traceId
if pid := response.Header.Get(string(DdParentId)); pid != "" {
parentId = pid
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the parent id is not set, later on we will throw out the found trace context completely.

@@ -126,21 +127,36 @@ func createDummySubsegmentForXrayConverter(ctx context.Context, traceCtx TraceCo
return nil
}

func getTraceContext(context map[string]string) (TraceContext, bool) {
func getTraceContext(ctx context.Context, headers map[string]string) (TraceContext, bool) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the extension is in charge of parsing for trace context, it is included in the response from calling start invocation. It is then added to the context object. If present, it therefore should be read in case there is no trace context found in the event payload headers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by event payload headers, you mean the actual event that triggered the go lambda right?
Just wanna make sure I understand right, so we get trace context by either pinging the extension here which populates ctx

and headers could have trace context via some event such as api gw like this?

So to be sure we extract and attach context we do check both ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is exactly correct. Reading the "event payload headers" is hold over from before we added universal instrumentation. We could potentially get rid of it completely and just rely on the extension to do this work for us.

@@ -64,6 +64,10 @@ func (l *Listener) HandlerStarted(ctx context.Context, msg json.RawMessage) cont
return ctx
}

if l.universalInstrumentation && l.extensionManager.IsExtensionRunning() {
ctx = l.extensionManager.SendStartInvocationRequest(ctx, msg)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling SendStartInvocationRequest gives the extension the event payload. It then parses it for trace context, creates any inferred spans, and returns the trace context. Therefore, we need to get this trace context from the extension before starting the execution span here in the function.

@@ -149,7 +149,9 @@ func startFunctionExecutionSpan(ctx context.Context, mergeXrayTraces bool, isDdS
span.SetTag("_dd.parent_source", "xray")
}

return span
ctx = context.WithValue(ctx, extension.DdSpanId, fmt.Sprint(span.Context().SpanID()))
Copy link
Contributor Author

@purple4reina purple4reina Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the span id to the context for use later when calling end invocation on the extension. (see

req.Header.Set(string(DdSpanId), spanId)
)

The extension uses this span id to replace the span id of the aws.lambda span that it creates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly right. You'll notice that the tracer and the extension are creating aws.lambda spans. However, the extension knows to throw out the one made in the tracer. The extension will ensure that both of these aws.lambda spans have the same trace context (span id, trace, id, parent id).

We do it this way because we want to make sure the tracer has access to the aws.lambda span as the parent of any spans it creates itself.

@codecov-commenter
Copy link

codecov-commenter commented Sep 11, 2023

Codecov Report

Merging #142 (7ba9495) into main (ba44882) will increase coverage by 0.46%.
Report is 2 commits behind head on main.
The diff coverage is 88.46%.

❗ Current head 7ba9495 differs from pull request most recent head e44935f. Consider uploading reports for the commit e44935f to get more accurate results

@@            Coverage Diff             @@
##             main     #142      +/-   ##
==========================================
+ Coverage   83.13%   83.60%   +0.46%     
==========================================
  Files          13       13              
  Lines         848      860      +12     
==========================================
+ Hits          705      719      +14     
+ Misses        114      113       -1     
+ Partials       29       28       -1     
Files Changed Coverage Δ
internal/trace/listener.go 54.79% <50.00%> (+0.62%) ⬆️
internal/extension/extension.go 80.76% <100.00%> (+0.50%) ⬆️
internal/trace/context.go 80.00% <100.00%> (+2.90%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zARODz11z
Copy link
Contributor

It seems your description sc shows the aws.lambda consumer span. Assuming the description is just out of date? if so np, just want to make sure I understand

For some reason I am not seeing the aws.lambda span created for my sqs consumer.

@purple4reina
Copy link
Contributor Author

It seems your description sc shows the aws.lambda consumer span. Assuming the description is just out of date? if so np, just want to make sure I understand

For some reason I am not seeing the aws.lambda span created for my sqs consumer.

Yes, great catch! I did get to the bottom of why I wasn't seeing the aws.lambda consumer span. It's due to sampling priority getting set to 0. I took that screenshot after putting in a hardcoded hacky fix.

@zARODz11z
Copy link
Contributor

np doing a full review now :)

Copy link
Contributor

@zARODz11z zARODz11z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions, but everything looks good

@purple4reina purple4reina merged commit 74bb732 into main Sep 14, 2023
8 checks passed
@purple4reina purple4reina deleted the rey.abolofia/sqs-consumer branch September 14, 2023 17:05
peterdeme pushed a commit to spacelift-io/datadog-lambda-go that referenced this pull request Nov 15, 2023
* Default parent id to be trace id if not found elsewhere.

* Look for trace context in context object as well as headers.

* Apply trace context before starting the function execution span.

* Update signature in tests.

* Add spanid of execution span to context.

* Do not ignore priority "-128".

* Test that default parent id set to trace id.

* Test span id added to context.

* Test uses trace context from context object.
peterdeme added a commit to spacelift-io/datadog-lambda-go that referenced this pull request Dec 4, 2023
* Create codeql-analysis.yml (DataDog#100)

* Create codeql-analysis.yml

* Update codeql-analysis.yml

* Update run_integration_tests.sh

* Do not show error messages even if neither DD_API_KEY nor DD_KMS_API_KEY is set when Lambda Extension is running (DataDog#102)

* Bump version to 1.4.0

* Bump go + fasthttp + lint (DataDog#104)

* Consolidate serverless configurations into one place (DataDog#105)

* Update README.md

* Update README.md

* Bump dd-trace-go to latest version to address some vulnerabilities (DataDog#109)

* Bump dd-trace-go to latest version to address some vulnaribilities
* update go.sum with `go mod tidy`

* Bump version to 1.6.0

* bump codeql (DataDog#112)

* Bump dd-trace-go to v1.41 (DataDog#115)

* Bump version to 1.7.0

* [SLS-2330] Add support for universal instrumentation with the extension (DataDog#116)

add option to use universal instrumentation

* [EEP-444] include error in failed metric send log (DataDog#118)

Co-authored-by: Corey Griffin <CoreyGriffin@users.noreply.github.com>

* [SLS-2492] Upgrade aws sdk v2 (DataDog#113)

upgrade sdk

* Bump version to 1.8.0

* Use new account in integration tests (DataDog#119)

* set the architecture explicitely (DataDog#122)

* mask init runtime logs (DataDog#123)

* Update libs (DataDog#121)

* bump go 1.18 (DataDog#125)

* Retry sending trace payloads on failure. (DataDog#128)

* Bump version to 1.9.0

* Update DD Trace to  v1.51.0(DataDog#133)

* Bump go version to 1.20 (DataDog#140)

Bump go version to 1.20

* Upgrade version of dd-trace-go to v1.54.1 (DataDog#141)

* Bump version to 1.10.0

* Propagate trace context from SQS events (DataDog#142)

* Default parent id to be trace id if not found elsewhere.

* Look for trace context in context object as well as headers.

* Apply trace context before starting the function execution span.

* Update signature in tests.

* Add spanid of execution span to context.

* Do not ignore priority "-128".

* Test that default parent id set to trace id.

* Test span id added to context.

* Test uses trace context from context object.

* Bump version to 1.11.0

* feat: automate AppSec enablement setup (e.g: `AWS_LAMBDA_RUNTIME_API`) (DataDog#143)

* feat: honor AWS_LAMBDA_EXEC_WRAPPER when AWS Lambda does not

In order to simplify onboarding & make it more uniform across languages,
inspect the value of the `AWS_LAMBDA_EXEC_WRAPPER` environment variable
and apply select environment variable changes it perofrms upon
decorating a handler.

This is necessary/useful because that environment variable is not
honored by custom runtimes (`provided`, `provided.al2`) as well as the
`go1.x` runtime (which is a glorified provided runtime). The datadog
Lambda wrapper starts a proxy to inject ASM functionality directly on
the Lambda runtime API instead of having to manually instrument each and
every lambda handler/application, and modifies `AWS_LAMBDA_RUNTIME_API`
to instruct Lambda language runtime client libraries to go through it
instead of directly interacting with the Lambda control plane.

APPSEC-11534

* pivot to a different, cheaper strategy

* typo fix

* PR feedback

* minor fixups

* add warning in go1.x runtime if lambda.norpc build tag was not enabled

* Bump version to 1.12.0

* Re-add configs after upstream rebase

* Bump packages

* Remove deprecated `io/ioutil` calls

---------

Co-authored-by: Tian Chu <tian.chu@datadoghq.com>
Co-authored-by: Soshi Katsuta <skatsuta@users.noreply.github.com>
Co-authored-by: Maxime David <maxime.david@datadoghq.com>
Co-authored-by: kimi <47579703+kimi-p@users.noreply.github.com>
Co-authored-by: Kimi Wu <kimi.wu@datadoghq.com>
Co-authored-by: Dylan Yang <dylan.yang@datadoghq.com>
Co-authored-by: Corey Griffin <15809365+CoreyGriffin@users.noreply.github.com>
Co-authored-by: Corey Griffin <CoreyGriffin@users.noreply.github.com>
Co-authored-by: Marcin Rabenda <xrn.design@gmail.com>
Co-authored-by: Rey Abolofia <purple4reina@gmail.com>
Co-authored-by: Rey Abolofia <rey.abolofia@datadoghq.com>
Co-authored-by: Andrew Rodriguez <49878080+zARODz11z@users.noreply.github.com>
Co-authored-by: Ivan Topolcic <IvanTopolcic@users.noreply.github.com>
Co-authored-by: Romain Marcadier <romain.muller@telecomnancy.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants