Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting errors after trying to deploy Node.js lambdas with arm architecture #110

Closed
vvo opened this issue Jun 10, 2022 · 26 comments
Closed
Assignees

Comments

@vvo
Copy link

vvo commented Jun 10, 2022

Expected Behavior

Selecting the arm architecture for a lambda should not fail the runtime.

Actual Behavior

When trying to deploy to arm architecture, I got errors about the Datadog agent no able to start well.

Steps to Reproduce the Problem

  1. Add architecture: cdk.aws_lambda.Architecture.ARM_64, to an cdk.aws_lambda_nodejs.NodejsFunction
  2. cdk deploy
  3. runtime fails with:
RequestId: b6d7748b-1ec0-42dc-8cf6-513e295003d7 Error: fork/exec /opt/extensions/datadog-agent: exec format error
Extension.LaunchError
EXTENSION	Name: datadog-agent	State: LaunchError	Events: []	Error Type: UnknownError

Deployed arns:

  • arn:aws:lambda:us-east-2:464622532012:layer:Datadog-Node14-x:76
  • arn:aws:lambda:us-east-2:464622532012:layer:Datadog-Extension-ARM:21

npm list (ran on my machine):

> npm list
├─┬ [redacted-cdk-project-name]@0.0.0 -> ./services/[redacted-cdk-project-name]
│ ├── @aws-sdk/client-cloudwatch@3.100.0
│ ├── @aws-sdk/client-kinesis@3.100.0
│ ├── @aws-sdk/client-lambda@3.100.0
│ ├── @aws-sdk/client-s3@3.100.0
│ ├── @swc/core@1.2.182
│ ├── @swc/helpers@0.3.13
│ ├── @swc/jest@0.2.21
│ ├── @types/aws-lambda@8.10.97
│ ├── aws-cdk-lib@2.21.1
│ ├── aws-cdk@2.21.1
│ ├── constructs@10.0.126
│ ├── datadog-cdk-constructs-v2@0.2.0
│ ├── datadog-lambda-js@5.76.0
│ ├── dd-trace@2.5.0
│ ├── esbuild@0.14.36
│ ├── get-stream@6.0.1
│ ├── jest@28.1.0
│ ├── reflect-metadata@0.1.13
│ └── source-map-support@0.5.21
└── yaml@1.10.2 extraneous

Guess: Because I list dd-trace in my own package.json, because cdk deploy happened on an x86_64 (GitHub action), then the agent that was loaded was not arm compatible?

Thanks!

@astuyve
Copy link
Contributor

astuyve commented Jun 10, 2022

Hi @vvo! Thanks for reaching out. I'm going to chat with the team and try to reproduce, and then I'll get back to you.

@astuyve astuyve self-assigned this Jun 10, 2022
@astuyve
Copy link
Contributor

astuyve commented Jun 10, 2022

Hi @vvo can you verify in the lambda console that the architecture is correctly set to ARM/Graviton? The layer you've provided for the extension is indeed an ARM-specific build (arn:aws:lambda:us-east-2:464622532012:layer:Datadog-Extension-ARM:21), but I'm wondering if somehow the lambda function got stuck on x86?

@vvo
Copy link
Author

vvo commented Jun 11, 2022

@astuyve We had to revert so I can't check anymore. But here is an excerpt of the cdk synth that was done in GH action right before doing the deploy:

  [redacted]:
    Type: AWS::Lambda::Function
    Properties:
      Architectures:
        - arm64
      Handler: /opt/nodejs/node_modules/datadog-lambda-js/handler.handler
      Layers:
        - arn:aws:lambda:us-east-2:464622532012:layer:Datadog-Node14-x:76
        - arn:aws:lambda:us-east-2:464622532012:layer:Datadog-Extension-ARM:21
      MemorySize: 1024
      Runtime: nodejs14.x

So yes it was deployed with arm64. But again, from GH action (x86), and a project listing datadog js dependencies in package.json.

@vvo
Copy link
Author

vvo commented Jun 13, 2022

Another bit of info, my cdk esbuild lambda settings includes:

    externalModules: ['graphql/*', 'datadog-lambda-js', 'dd-trace'],

When you use the CDK constructs it is said the Datadog js agent will be installed/launched automatically. But if you need to do import { sendDistributionMetric } from 'datadog-lambda-js'; in your own code, then your only way is to add the package in package.json and then ignore it in esbuild. Otherwise you can't even create unit tests.

But I sense this is causing more harm than good maybe? Thanks!

@astuyve
Copy link
Contributor

astuyve commented Jun 13, 2022

Hi @vvo - The agent and datadog-lambda-js are separate (and using it with webpack has some workaround).

The Agent is deployed via Lambda Extensions automatically when the layer is added, it's completely separate from datadog-lambda-js as well as dd-trace. The Agent is what seems to be crashing here, but I still suspect some kind of architecture mismatch.

Unfortunately I haven't been able to reproduce this on ARM, here's my service definition:

const cdk = require("aws-cdk-lib");
const { Construct } = require("constructs");
const apigateway = require("aws-cdk-lib/aws-apigateway");
const lambda = require("aws-cdk-lib/aws-lambda");
const s3 = require("aws-cdk-lib/aws-s3");
const { Datadog } = require('datadog-cdk-constructs-v2');

class WidgetService extends Construct {
  constructor(scope, id) {
    super(scope, id);

    const bucket = new s3.Bucket(this, "WidgetStore");

    const handler = new lambda.Function(this, "WidgetHandler", {
      runtime: lambda.Runtime.NODEJS_14_X,
      code: lambda.Code.fromAsset("resources"),
      architecture: lambda.Architecture.ARM_64,
      handler: "widgets.main",
      environment: {
        BUCKET: bucket.bucketName
      }
    });

    bucket.grantReadWrite(handler); // was: handler.role);

    const api = new apigateway.RestApi(this, "widgets-api", {
      restApiName: "Widget Service",
      description: "This service serves widgets."
    });

    const getWidgetsIntegration = new apigateway.LambdaIntegration(handler, {
      requestTemplates: { "application/json": '{ "statusCode": "200" }' }
    });

    api.root.addMethod("GET", getWidgetsIntegration); // GET /

    const datadog = new Datadog(this, "Datadog", {
      nodeLayerVersion: 77,
      addLayers: true,
      extensionLayerVersion: "21",
      apiKey: <api key>,
    })
    datadog.addLambdaFunctions([handler])
  }
}

module.exports = { WidgetService }

And to confirm, here's the lambda configuration:
image

And the traces sent to datadog:
image

I was wondering if you might be able to try again, and/or perhaps share the datadog configuration you've used in your module?

Thanks!

@vvo
Copy link
Author

vvo commented Jun 13, 2022

Hey there, did you deploy this from an arm machine (like a mac m1) or from and amd64 machine (like github actions)? Thanks!

@vvo
Copy link
Author

vvo commented Jun 13, 2022

Also, did you add the dd trace dependency directly to hour package.json like I did?

@astuyve
Copy link
Contributor

astuyve commented Jun 13, 2022

I deployed from an Intel mac, but that wouldn't impact this because this library detects architecture based on the lambda handler setting, not the machine deploying it.

I can try the dd-trace dependency directly as well, but that's highly unlikely given that the log was emitted from the extension itself, which is ran from/opt/extensions/datadog-extension and is separate from the tracer.

@astuyve astuyve assigned vvo and unassigned astuyve Jun 21, 2022
@astuyve
Copy link
Contributor

astuyve commented Jun 22, 2022

Hey @vvo - any luck reproducing this here? I realize my minimal example likely does not resemble your project. If you're still experiencing this issue, it might be best to move to a support ticket where we can privately collaborate on your specific project.

Thanks!

@vvo
Copy link
Author

vvo commented Jun 22, 2022

@astuyve I am reluctant to try to redeploy in production as this broke our production last time. My next try is to create a second lambda, set it to x86, same configuration as the failed one, then switch it to arm. If this fails then I have a reproduction. Unless you have any other insight then this is what I will try, not sure when though as I deprioritized this part for now.

@astuyve
Copy link
Contributor

astuyve commented Jun 22, 2022

Could you test it in a staging environment, or deploy a new stack?

Did this occur when flipping a lambda function from x86 to ARM?

@vvo
Copy link
Author

vvo commented Jun 22, 2022

Could you test it in a staging environment, or deploy a new stack?

When doing so, it worked (separate AWS account). But I could not reproduce the x86 => arm failures. Only production suffered this.

Did this occur when flipping a lambda function from x86 to ARM?

Yes, it failed when moving from x86 to arm.

@astuyve
Copy link
Contributor

astuyve commented Jun 22, 2022

Okay thanks for that context!

I'll do more experimentation, but given that neither of us can reproduce this and it only seems to impact the move from x86 to ARM in your production environment (as you cannot reproduce this error even in staging), I'm not sure how much more I can help.

From your report, it appears the CloudFormation generated by this library is correctly applying the ARM layer when the ARM architecture is set for the function. Given that this does work when tested in isolation, I'm not sure what we could change on our side. At this point, I can only conclude that something on AWS's side is causing this issue during the switch from x86 to ARM - the CloudFormation looks right.

Given that, I might suggest switching your service from x86 to ARM without Datadog installed, and then after the function is running ARM successfully, apply this construct and doing a second deploy.

If I can reproduce this issue, I'll raise a ticket internally with AWS support.

Thanks again @vvo!

@astuyve astuyve assigned astuyve and unassigned vvo Jun 22, 2022
@astuyve
Copy link
Contributor

astuyve commented Jun 22, 2022

Hi @vvo - I was able to reproduce this successfully by repeated hitting an endpoint while the cloudformation deploy processed. There was a brief window (7s in my tests) where it appears that Lambda was using the ARM extension alongside the x86 lambda instance:
image

I'll follow up when I know more.

@vvo
Copy link
Author

vvo commented Jun 22, 2022

Nice finding!

@vvo
Copy link
Author

vvo commented Jun 27, 2022

@astuyve I see DataDog/datadog-lambda-extension#67. I this issue resolved?

Also, maybe there's no way to resolve it given how AWS works and in this case we can just say "first remove layers, then move to arm, then re-add layers"?

@astuyve
Copy link
Contributor

astuyve commented Jun 27, 2022

Hey @vvo - good eye! In my testing, setting the compatible architectures flag prevented this issue from occurring - but another unrelated issue prevented us from doing a new release with the flag set. My intention is to leave this issue open until a new release is made, and then I'll close it after verifying the solution.

Thanks again for sticking with us through this process!

@paco-sparta
Copy link

paco-sparta commented Jul 11, 2022

I'm seeing this issue in the JVM with both public.ecr.aws/lambda/java:11-arm64 and public.ecr.aws/lambda/java:11-x86_64 while using https://dtdg.co/latest-java-tracer

EDIT: It may need to be mandatory to set the architecture flag.

EDIT2: It only works on arm64 for me.

@vvo
Copy link
Author

vvo commented Sep 26, 2022

Hey, @astuyve, any news on this issue? We'd like to try again moving to arm.

@astuyve
Copy link
Contributor

astuyve commented Sep 26, 2022

Hey @vinvol - I just tried this again and unfortunately even with the compatible architecture flag set for the extension, I still get launch errors for a few seconds when flipping from x86 -> ARM:
image

I'll reach out to AWS again and see if they've got any further recommendations, but for now I'd recommend switching from the Datadog Extension to the Datadog Forwarder during the migration to avoid possible interruptions.

@astuyve
Copy link
Contributor

astuyve commented Oct 6, 2022

My ticket with AWS has been escalated. I'll keep you updated @vinvol.

@astuyve
Copy link
Contributor

astuyve commented Nov 2, 2022

Hi @vinvol, AWS is still reviewing this. It's unclear what the delay is, but I want to assure you that not only are they looking at it, but I was able to reproduce this bug using the Serverless Framework as well: https://github.com/astuyve/lambda-architecture-bug

@thiagosanches
Copy link

thiagosanches commented Nov 3, 2022

Hi @astuyve , good afternoon. Just my two cents here...

What if you try to deploy that serverless (https://github.com/astuyve/lambda-architecture-bug) as REGIONAL on ApiGateway? You can configure it using the endpointType property (see documentation below).

Recently, we were trying to use lambda@edge as arm64 and AWS complained about it during the cloudformation execution. On serverless, by default, if you do not specify the type, all api gateways are deployed as EDGE and not REGIONAL. It seems there is no support yet for arm64 on lambda@edges.

By default, the Serverless Framework deploys your REST API using the EDGE endpoint configuration. If you would like to use the REGIONAL or PRIVATE configuration, set the endpointType parameter in your provider block.
https://www.serverless.com/framework/docs/providers/aws/events/apigateway/

@astuyve
Copy link
Contributor

astuyve commented Nov 3, 2022

Hi @thiagosanches!

Thanks for this note. As you mentioned, I don't believe lambda@edge supports arm64 so this issue doesn't apply in that context.

I haven't experimented with the various configurations of API Gateway within the context of this issue because it doesn't seem particularly relevant. API Gateway is correctly proxying the HTTP request to my function, which we can determine because the function itself is receiving the request and throwing an exec format error.

In addition as you can see in the code, I'm using API Gateway V2 (also called httpAPI), which is only available regionally (docs), so the endpointType configuration wouldn't apply here.

In this case, I'm only using API Gateway as it's a convenient way to trigger the function.

I can't say much more publicly at this time, but please be assured that AWS Support is aware of this issue.

I will share more information as soon as AWS give me permission to.

Thanks!
AJ

@astuyve
Copy link
Contributor

astuyve commented Nov 7, 2022

Hi everyone, I've been cleared by AWS to share an update from their end regarding this bug. The full quote is as follows:

It looks like the Function CFN Construction is running both UpdateFunctionConfiguration and UpdateFunctionCode in parallel. For these functions (functions using Layers attempting to switch architectures), this seems to mean that UpdateFunctionCode is completing first which results in the change in architecture (the error you are receiving indicates architecture is being changed on the code artifact first).
This fails reliably for 10 seconds when you are invoking $LATEST for each Invoke that provisioned a new sandbox on this version because each update is processed serially and cached. This means the sandbox initialized with the UpdateFunctionCode only (since you're catching the function mid update) which means the extension cannot be executed since the extension is only for x86_64 until the version of the extension is updated.
Unfortunately, If we force the ordering the calls to be in reverse, then you will be unable to change your runtime consistently and could retrigger this (and other scenarios) in the same way. E.g. When changing from nodejs14.x -> nodejs16.x you can make the code backwards compatible but not forward compatible, you need to change the code first and the runtime second. Since the CDK Construct wraps both calls, we don't have a clear mechanism to differentiate these scenarios and required ordering.
You can work around this issue by first updating the Layers to the cross runtime compatible Layer and then updating the function runtime (performing this in two updates, which is safer in any case). We would also highly recommend that if you are using SAM / Serverless Toolkit to be using Aliases and versions to work around this problem, as invoking against $LATEST is always an unsafe operation since any update to $LATEST could break your production functions. Currently there isn't a workaround we can perform from the Lambda side without breaking other use cases.

I think there are a couple of key takeaways here. Firstly - this issue is a consequence of how CloudFormation and Lambda integrate; namely using UpdateFunction and UpdateFunctionConfiguration in parallel. Unfortunately that means there's nothing in this library we can fix.

The best recommendation I can offer is to pin lambda function versions to your API Gateway endpoints using the Version class. The documentation is available here. As the AWS engineer mentioned, this is a best practice for Lambda-backed API endpoints regardless of the architecture migration discussion we're having here.

Using Lambda versions would allow you to deploy a new version of your Lambda function using ARM, along with the ARM version of the Datadog Extension, and then flip over to it after both UpdateFunction and UpdateFunctionConfiguration has finished.

Alternatively, you could utilize the Datadog Forwarder during the x86 -> ARM migration, as that can be used in lieu of the Datadog Extension. After the deployment has fully finished, you can verify logs, metrics, profiles, and trace telemetry data for your function; and then follow up with a subsequent deploy migrating back to the Datadog Extension.

Finally on our end, we can explore methods of packaging the Datadog Extension in a way that supports either ARM or x86 architectures. Because our extension is a compiled binary, this would have some drawbacks (including possible doubling the size of the extension) - but it's something we can investigate.

For now, I'll make a note in our documentation for both this library and other serverless deployment tool libraries before closing this issue.

Thank you again for your patience!

@astuyve
Copy link
Contributor

astuyve commented Nov 17, 2022

This is now documented publicly: https://docs.datadoghq.com/serverless/configuration/?tab=datadogcli#migrating-between-x86-to-arm64-with-the-datadog-lambda-extension

I'm closing this issue, thanks for joining me on what's assuredly been an eye-opening journey.

@astuyve astuyve closed this as completed Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants