Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The new tracing enhancement seems to break filtered SNS/SQS subscriptions #232

Closed
dls314 opened this issue Feb 24, 2022 · 23 comments · Fixed by DataDog/datadog-lambda-js#269
Assignees

Comments

@dls314
Copy link
Contributor

dls314 commented Feb 24, 2022

Expected Behavior

Using the new tracing feature doesn't change how my system was working; even if it uses SNS/SQS message filtering

Actual Behavior

When the plugin / lambda layer adds in the SNS message attribute that enables tracing, existing SNS/SQS message filters don't seem to deliver the messages.

Steps to Reproduce the Problem

  1. Using some SNS/SQS filters that worked when using plugin version 3.4.0 and lambda layer version Datadog-Node12-x:66

  2. Update and re-build using plugin version 3.7.1 and lambda layer version Datadog-Node14-x:70

  3. Subscribe something non-filtered to the SNS topic (like an email address)

  4. Run some tests, see non-delivery to SQS that is filtered and delivery to email that is unfiltered and includes the _datadog message attribute

  5. (to confirm) downgrade plugin and re-build/deploy (in progress ). Downgrading the serverless-plugin-datadog to 3.6.0 fixed the issue for me

Specifications

  • Serverless Framework version: 2.30.3
  • Datadog Serverless Plugin version: 3.7.1
    • Datadog Lambda Layer version: Datadog-Node14-x:70
  • Lambda function runtime (Python 3.7, Node 10, etc.): Node 14.x

Stacktrace

No errors are shown, SNS isn't delivering because of something in the filtering. I'm not certain that you're able to put structured text into the SNS string message attribute in the way that this change is doing

I see a message with an attribute like this (taken from a JSON-email delivery, the escaping may be an artifact of that)

"MessageAttributes" : {
    "_datadog" : {"Type":"String","Value":"{\"x-datadog-trace-id\":\"--omitted--\",\"x-datadog-parent-id\":\"--omitted--\",\"x-datadog-sampling-priority\":\"1\",\"x-datadog-tags\":\"_dd.p.upstream_services=--omitted--\"}"},
    // other attributes are omitted here, but present in my testing
 }
@astuyve
Copy link
Contributor

astuyve commented Feb 24, 2022

Hi @dls314 - unfortunately MessageAttributes is our only option to pass trace context via SNS.

If you're using topic filters and Datadog, please try adding _datadog to the filter policy and try again with 3.7.x.
If you're working in an environment with Serverless Offline or other emulation, or are subscribing to a topic with messages coming from traced and non-traced functions, unfortunately you'll need to add two filter policies to your subscription.

This is a limitation of AWS (as documented here: https://docs.aws.amazon.com/sns/latest/dg/and-or-logic.html)
image

We will update our documentation to reflect this.

@astuyve astuyve self-assigned this Feb 24, 2022
@dls314
Copy link
Contributor Author

dls314 commented Feb 24, 2022

I have other message attributes that don't appear in my normal filters, so adding an existence filter on the _datadaog attribute seems odd, but I'll try it after seeing if downgrading to 3.6.0 gets me unstuck.

Do you have a CFN filter syntax snippet that you can share?

@astuyve
Copy link
Contributor

astuyve commented Feb 24, 2022

Can you share your current filters?

Here's a example:

{
  "type": [
    {
      "prefix: "my-prefix"
    }
  ],
  "_datadog": [
    {
      "exists: true
    }
  ]
}

@dls314
Copy link
Contributor Author

dls314 commented Feb 24, 2022

Usually I use a filter policy that matches 2 out of the 5 attributes on a message, that one looks something like this

        "FilterPolicy": {
          "schemaName": [
            "AnEventSchemaName"
          ],
          "schemaVersion": [
            "1"
          ]
        },

Messages that successfully pass that filter also have other attributes, and before the stringified JSON in the new _datadog property, this wasn't an issue.

Do you know if there's something special about the use of the stringified JSON that requires the addition of the exists filter for _datadog?

:-) This feels like a bug special case somewhere in a deserializer in SNS message filtering that's taking a different path when it hits a stringified JSON in the message attributes than it takes for simple strings

@astuyve
Copy link
Contributor

astuyve commented Feb 24, 2022

Gah! Okay, I'm sorry about that. We're working on ensuring our documentation is complete for trace propagation via MessageAttributes, including the workaround I described here. Until then I intend to leave this issue open so others might find it.

Again, I apologize for the inconvenience. I do hope the value of completed traces across sns and sqs is worthwhile - here is what it looks like completed:
image

@dls314
Copy link
Contributor Author

dls314 commented Feb 24, 2022

It's definitely worthwhile -- I wonder if "not working without an explicit filter" is AWS's plan here? I've opened a support ticket to ask that :-)

@astuyve
Copy link
Contributor

astuyve commented Feb 24, 2022

I wonder if "not working without an explicit filter" is AWS's plan here?
That's a great question, which we'll also follow up on internally.

Meanwhile, please do confirm if allowing _datadog does work for your application with 3.7.x.

Thanks again!

@dls314
Copy link
Contributor Author

dls314 commented Feb 24, 2022

screen

Maybe the _datadog attribute could be a binary type? That might get it ignored. Also the limit of 5 filters is going to hit me in some places if I add one for the _datadog attribute

https://docs.aws.amazon.com/sns/latest/dg/subscription-filter-policy-constraints.html

@astuyve
Copy link
Contributor

astuyve commented Feb 25, 2022

I see, it's interesting that there is a limit of 5 filters, when the limit for the number of message attributes is 10.

We could explore using a binary data type in this context.

@dls314
Copy link
Contributor Author

dls314 commented Feb 25, 2022

After looking at our use cases, I'm sorry, but we can't add an exists filter on the _datadog message attribute in most of our use cases.

In most places, our SNS topics are published to by some publishers that do come from lambda functions instrumented by the Datadog lambda layer, but those same topics are also published to by other publishers that may/do not use the Datadog lambda layer (e.g. directly from Lambda step functions, from test automation, etc).

I'll complete some testing to see if the exists filter on the _datadog message attribute works as mentioned in the thread here, but in the end, this is not an answer I can use.

If the binary data type doesn't work out, could the plugin add a flag to disable the new SNS tracing so I can stay on the latest plugin versions instead of pinning?

@dls314
Copy link
Contributor Author

dls314 commented Feb 25, 2022

I know this is not the right place (but the discussion is open) to discuss the dd-trace changes, but I'm not sure that using only the first message in a batch publication as the injectPath is going to work out over here: https://github.com/DataDog/dd-trace-js/blob/master/packages/datadog-plugin-aws-sdk/src/services/sns.js#L27

With filtering, some messages might be handled by one subscriber and some by another -- if only the first message in a batch is traced, there will be some holes.

@astuyve
Copy link
Contributor

astuyve commented Feb 25, 2022

After looking at our use cases, I'm sorry, but we can't add an exists filter on the _datadog message attribute in most of our use cases.
In most places, our SNS topics are published to by some publishers that do come from lambda functions instrumented by the Datadog lambda layer, but those same topics are also published to by other publishers that may/do not use the Datadog lambda layer (e.g. directly from Lambda step functions, from test automation, etc).

Right, as I pointed out above if you're in a situation where some messages have datadog trace context and others don't, the solution from AWS is to create two subscriptions pointed towards your function, one with a filter including the _datadog key, and one not.

There is a limit of 200 total filter policies per AWS account, but that can be increased via a quota request.

I know this is not the right place (but the discussion is open) to discuss the dd-trace changes, but I'm not sure that using only the first message in a batch publication as the injectPath is going to work out over here: https://github.com/DataDog/dd-trace-js/blob/master/packages/datadog-plugin-aws-sdk/src/services/sns.js#L27

Yes this was subject to significant internal debate and ultimately decided against to limit overhead.

We can discuss a solution where multiple messages could be instrumented, or perhaps could also provide these methods as helpers for users to inject trace context themselves (which is still an option)

@dls314
Copy link
Contributor Author

dls314 commented Feb 25, 2022

I was able to test the suggested change of adding an existence filter for the _datadog message attribute, and it hasn't helped. Messages are still not delivered through that filter.

It looks like 3.6.0 was before this feature, so for the moment I have to pin back there everywhere that I'm using message attributes and filters today.

I hope that a change to the Binary data type for the _datadog attribute is something that can work for dd-trace; will try again

@astuyve
Copy link
Contributor

astuyve commented Feb 28, 2022

Hi @dls314 - I've replicated this issue, it's baffling. Please bear with me as we seek clarity from AWS.

Unfortunately binary did not work in my tests.
Binary did work, we're going to move to that

@dls314
Copy link
Contributor Author

dls314 commented Feb 28, 2022

Unfortunately binary did not work in my tests.

That's unfortunate. I don't know if dd-trace has the hooks to de/serialize attribute content, but if so, and there's no better answer from AWS, maybe a base64 encode of the JSON would pass as a string or binary value.

For working around this issue,

Is version 3.6.0 of the plugin the last version before the change (downgrading there does work for me)?
Is version 69 of the lambda layer the last version before the change?
Is version 2.0.1 of dd-trace (https://github.com/DataDog/dd-trace-js/releases/tag/v2.0.1) the last version there before the change (I have some places where I'm not relying on plugin or the lambda layer)?

Is there anything else that should be pinned back?

@astuyve
Copy link
Contributor

astuyve commented Feb 28, 2022

Yes, your versions are all correct.

I've been able to get this to work using a base64 encoded string, which we'll utilize if we get no answer from AWS going forward.
Once again, thank you very much for your detailed report. I'll update this issue as I learn more, and will notify you when a new version is released.

@astuyve
Copy link
Contributor

astuyve commented Mar 1, 2022

Hi @dls314 - quick update here. I spent yesterday constructing a minimally reproducible example of this bug, and sent it to AWS https://github.com/astuyve/sns-filters.

The initial response from AWS Support is that they were able to confirm this is occurring, and that it is a bug.
We're hoping they can fix it quickly.

Regardless of if they fix this issue, we're going to b64 encode the trace context and pass it as binary instead, which should prevent the need for any filter changes.

Thanks again for your patience and detailed report!

@astuyve
Copy link
Contributor

astuyve commented Mar 3, 2022

Hi @dls314, @nvinchhi-extron, @gbo-actual, and anyone else watching this issue.

I've just released a new version of this library which handles trace context via b64 encoded strings in the Binary. Its version 5.72.0, and Layer version 72.
Please do give this a shot and let me know on this issue if it's resolved for you, as I can't wait for AWS to fix this on their side.

Thanks!

@dls314
Copy link
Contributor Author

dls314 commented Mar 4, 2022

...Layer version 72. Please do give this a shot and let me know on this issue if it's resolved for you....

Testing with arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Node14-x:72, and I think the issue was resolved for me -- I had to apply the layer by hand here since the serverless plugin isn't updated yet. My tests pass, but I need to make sure that it was actually trying to send the message attributes (I think I'm using dd-trace from the lambda layer, but still making sure).

@astuyve
Copy link
Contributor

astuyve commented Mar 4, 2022

@dls314 Sounds good, I'll publish a new version of the datadog plugin today!

@dls314
Copy link
Contributor Author

dls314 commented Mar 4, 2022

It's working

image

@colesiegel
Copy link

colesiegel commented Sep 29, 2022

Gah! Okay, I'm sorry about that. We're working on ensuring our documentation is complete for trace propagation via MessageAttributes, including the workaround I described here. Until then I intend to leave this issue open so others might find it.

Again, I apologize for the inconvenience. I do hope the value of completed traces across sns and sqs is worthwhile - here is what it looks like completed: image

Hey there @astuyve, are there any demo examples of the sqs/sns trace context propagation with serverless-plugin-datadog? From what I can see in the documentation, this should all happen automatically assuming the plugin is installed, but I don't see any _datadog section appended to the MessageAttributes when I publish to an SNS topic. It's tough to know where I've made a mistake because I can't find any examples, only documentation notes stating that it should happen out of the box. Are there specific configurations/settings required in the plugin to support this feature?

@astuyve
Copy link
Contributor

astuyve commented Oct 3, 2022

Hi @colesiegel - No configuration changes should be required, but I'd recommend enabling debug logger for the tracer via DD_TRACE_DEBUG: true and examining the logs for the reason why we may not be able to inject trace context.

Here's the logic (for nodejs) which injects context into sns: https://github.com/DataDog/dd-trace-js/blob/master/packages/datadog-plugin-aws-sdk/src/services/sns.js#L20-L47 as you can see there are a couple of cases (eg: your sns message already has 10 MessageAttribute items) where we won't inject context.

If you're still experiencing this, please open a new issue with a minimally reproducible example so that we can triage and give more direct support.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants