Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECSRun fails with boto InvalidParameterException needs awsvpc network mode #4243

Closed
marvin-robot opened this issue Mar 11, 2021 · 2 comments · Fixed by #4325
Closed

ECSRun fails with boto InvalidParameterException needs awsvpc network mode #4243

marvin-robot opened this issue Mar 11, 2021 · 2 comments · Fixed by #4325

Comments

@marvin-robot
Copy link
Member

Opened from the Prefect Public Slack Community

leeca.jinlee: Hello all, trying out the latest Prefect version 0.14.12

Running into this error when attempting to run a flow using ECS agent and ECSRun:

botocore.errorfactory.InvalidParameterException: An error occurred (InvalidParameterException) when calling the RunTask operation: Task definition does not support launch_type FARGATE.

I have a working config for prefect agent that executes the flow without errors. However, this involves creating a task-definitions.yaml:

 prefect agent ecs start -t token \
    -n aws-ecs-agent \
    -l label \
    --task-definition /path/to/task-definition.yaml \
    --cluster cluster_arn

task-definitions.yaml

networkMode: awsvpc
cpu: 1024
memory: 2048
taskRoleArn: task_role_arn
executionRoleArn: execution_role_arn

The flow runs without errors, so the error is not due to IAM permissions.

However, when running the ECS Agent using the --task-role-arn and --execution-role-arn CLI args, I run into the above-mentioned error. I have also tried running Prefect agent using --launch-type FARGATE , which I believe is the default and does not need to be specified, but this does not work too.

prefect agent ecs start -t token \
    -n aws-ecs-agent \
    -l ecs \
    --task-role-arn task_role_arn \
    --execution-role-arn execution_role_arn \
    --cluster cluster_arn

I have also tried to pass in task_role_arn and execution_role_arn into the ECSRun() function within my flow, and ran into the same error.

Is there any way to run ECS Agent using CLI args without using the task-definition file?

leeca.jinlee: In the AWS docs for Task Definition, under <https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-ecs-taskdefinition.html#cfn-ecs-taskdefinition-networkmode|Network Mode>, it says:

> If you are using the Fargate launch type, the awsvpc network mode is required.
However, it does not seem that there is a cli arg to pass to prefect agent ecs start.

I have also tried passing in a networkConfiguration dict to run_task_kwargs arg in ECSRun():

"networkConfiguration": {
    "awsvpcConfiguration": {
        'assignPublicIp': 'ENABLED', 
        'subnets': ['subnet-1', 'subnet-2', 'subnet-3'], 
        'securityGroups': []
    }
}

Still running into the same error. Seems that the agent needs to know about awsvpc as the network mode, but there doesn't seem to be a way to tell it without using a task definitions file

michael054: Hey <@U01AYG8QZ4Y>, thanks for the thorough explanation. I'm going to open an issue for this in the Prefect Core repo as this looks like it may need a PR to address. <@ULVA73B9P> open "ECSRun fails with boto InvalidParameterException needs awsvpc network mode"

Original thread can be found here.

@zanieb
Copy link
Contributor

zanieb commented Mar 15, 2021

Additional info reported at https://prefect-community.slack.com/archives/CL09KU1K7/p1615830767346000

latest task definition

  "compatibilities": [
    "EC2"
  ],

with 0.14.6

"compatibilities": [
    "EC2",
    "FARGATE"
  ],

@rbastian
Copy link

From what I can tell the task definition submitted to ECS for 0.14.12 differs significantly from the task definition submitted for versions < 0.14.12.

Given this simple flow:

import prefect
from prefect.storage import S3
from prefect.run_configs import ECSRun
from prefect import task, Flow

RUN_CONFIG = ECSRun(
    image= '{redacted}.dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest',
    labels=["s3-flow-storage"],
    memory="512",
    cpu="256",
)
STORAGE = S3(bucket="prefect-rai-dev", stored_as_script=True)

@task
def say_hello():
    logger = prefect.context.get("logger")
    logger.info(f"Hello from prefect!")

with Flow("say_hello_flow", storage=STORAGE, run_config=RUN_CONFIG) as flow:
    say_hello()

Here is the task definition for 0.14.11 submitted by the Agent via boto3:

{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::{redacted}:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": null,
      "entryPoint": null,
      "portMappings": [],
      "command": [
        "/bin/sh",
        "-c",
        "prefect execute flow-run"
      ],
      "linuxParameters": null,
      "cpu": 0,
      "environment": [
        {
          "name": "PREFECT__CLOUD__USE_LOCAL_SECRETS",
          "value": "false"
        },
        {
          "name": "PREFECT__CONTEXT__IMAGE",
          "value": "{redacted}.dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest"
        },
        {
          "name": "PREFECT__ENGINE__FLOW_RUNNER__DEFAULT_CLASS",
          "value": "prefect.engine.cloud.CloudFlowRunner"
        },
        {
          "name": "PREFECT__ENGINE__TASK_RUNNER__DEFAULT_CLASS",
          "value": "prefect.engine.cloud.CloudTaskRunner"
        }
      ],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "{redacted}.dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "flow"
    }
  ],
  "placementConstraints": [],
  "memory": "512",
  "taskRoleArn": "arn:aws:iam::{redacted}:role/Prefect_Container_Role",
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:us-east-1:{redacted}:task-definition/prefect-say-hello-flow:10",
  "family": "prefect-say-hello-flow",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-ecr-pull"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [],
  "networkMode": "awsvpc",
  "cpu": "256",
  "revision": 10,
  "status": "ACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": []
}

Here is the task definition for 0.14.12 submitted by the Agent via boto3:

{
  "ipcMode": null,
  "executionRoleArn": null,
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": null,
      "entryPoint": null,
      "portMappings": [],
      "command": null,
      "linuxParameters": null,
      "cpu": 0,
      "environment": [
        {
          "name": "PREFECT__CONTEXT__IMAGE",
          "value": "{redacted}.dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest"
        }
      ],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "{redacted}.dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "flow"
    }
  ],
  "placementConstraints": [],
  "memory": "2048",
  "taskRoleArn": null,
  "compatibilities": [
    "EC2"
  ],
  "taskDefinitionArn": "arn:aws:ecs:us-east-1:{redacted}:task-definition/prefect-say-hello-flow:11",
  "family": "prefect-say-hello-flow",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [],
  "networkMode": "awsvpc",
  "cpu": "1024",
  "revision": 11,
  "status": "ACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": []
}

The most significant changes in my view are the missing execution role and task role arns and compatibilities section. From what I can tell, if the executionRoleArn is not populated the compatibilities section automatically complied by ECS will only have "EC2" as an option. Submitting an executionRoleArn, even a phony one will get ECS to add "FARAGE" to the compatibilities section.

I can work-around this issue by manually adding a task definition to the run config as shown below:

import prefect
from prefect.storage import S3
from prefect.run_configs import ECSRun
from prefect import task, Flow

import yaml

definition = yaml.safe_load(
    """
    networkMode: awsvpc
    cpu: 1024
    memory: 2048
    containerDefinitions:
        - name: flow
    executionRoleArn: aws:iam::{redacted}:role/Prefect_Task_Execution_Role
    """
)

RUN_CONFIG = ECSRun(
    image= '{redacted}.dkr.ecr.us-east-1.amazonaws.com/prefect-aws:latest',
    task_definition=definition,
    labels=["s3-flow-storage"],
    memory="512",
    cpu="256",
)
STORAGE = S3(bucket="prefect-rai-dev", stored_as_script=True)

@task
def say_hello():
    logger = prefect.context.get("logger")
    logger.info(f"Hello from prefect!")

with Flow("say_hello_flow", storage=STORAGE, run_config=RUN_CONFIG) as flow:
    say_hello()

Please note that the task execution role specified above DOES NOT EXIST, yet works.
Please note that the Agent is currently specifying both the executionRoleArn and taskRoleArn.
Please note that adding the executionRoleArn and taskRoleArn directly to the Run Config doesn't solve the problem.

In addition, there was some question about how the ECS/Fargate cluster was created. Originally we had created the cluster manually via the CLI. We did update our cluster definition to include the FARGATE and FARGATE_SPOT capacity providers and we added a default capacity provider strategy. None of those changes made any difference.

We also went back and created a new cluster using the "Getting Started" UI flow from the AWS console. This also didn't make any difference.

Lastly, on a lark, we changed our task execution role to be exactly "ecsTaskExecutionRole" thinking there might be some magic in the naming and that also didn't make any difference.

To be complete with information here is our Agent Dockerfile:

FROM python:3.9-slim-buster

ENV PREFECT_VERSION=0.14.12

RUN apt-get update && apt-get install -y gcc

RUN pip install prefect[aws]==${PREFECT_VERSION}

COPY agent.py /agent.py
COPY --from=arpaulnet/s6-overlay-stage:2.0 / /

ENTRYPOINT ["/init"]

CMD ["python", "agent.py"]

Here is the agent.py:

import prefect
from prefect.agent.ecs import ECSAgent

TASK_ROLE_ARN = prefect.config.task.role.arn
EXECUTION_ROLE_ARN = prefect.config.execution.role.arn

logger = prefect.context.get("logger")
logger.info(f"task_role_arn={TASK_ROLE_ARN}, execution_role_arn={EXECUTION_ROLE_ARN}")

ECSAgent(
    name="RAI ECS/Fargate Agent",
    cluster="RAI",
    task_role_arn=TASK_ROLE_ARN,
    execution_role_arn=EXECUTION_ROLE_ARN,
    env_vars={"PREFECT__CONSUL__DNS": "consul.dev.drillinginfo.com"},
).start()

Here is the ECS cluster description:

{
    "clusters": [
        {
            "clusterArn": "arn:aws:ecs:us-east-1:{redacted}:cluster/RAI",
            "clusterName": "RAI",
            "status": "ACTIVE",
            "registeredContainerInstancesCount": 0,
            "runningTasksCount": 0,
            "pendingTasksCount": 1,
            "activeServicesCount": 0,
            "statistics": [],
            "tags": [],
            "settings": [
                {
                    "name": "containerInsights",
                    "value": "enabled"
                }
            ],
            "capacityProviders": [
                "FARGATE_SPOT",
                "FARGATE"
            ],
            "defaultCapacityProviderStrategy": []
        }
    ],
    "failures": []
}

sharkinsspatial added a commit to pangeo-forge/pangeo-forge-aws-bakery that referenced this issue Mar 27, 2021
sharkinsspatial added a commit to pangeo-forge/pangeo-forge-aws-bakery that referenced this issue Apr 8, 2021
* Update CDK

* Upgrade CDK

* Add Dockerfile for ECS Agent

* Add base bakery stack to deploy a cluster with 1 ECS agent running

* Add instructions on creating RUNNER TOKEN secret pre-deployment

* Upgrade prefect, add example flow

* Refactor out permissions for ecs tasks into a specific role

* Add ECS Run as we need to specify the image ECS users

* Begin to add DaskExecutor to Flow - Need to build a image with Dask deps

* Split out agent and worker docker images

* Import container from ecr repository that is pre-populated

* Migrate ECSRun to use dynamic task definition ref PrefectHQ/prefect/issues/4243

* Install dev deps via make due to pipenv locking bugs. Temporary hack.

* Include dask-cloudprovider dependencies in worker image.

* Stack role and export updates to support a DaskExecutor in test Flow.

* Inclued agent label environment variable for deployment.

* Initial test flow to validate DaskExecutor functionality.

* Add additional bucket for processing cache and target output.

* Add full zarr transform flow and move test flows to flow_test directory.

* Include deps for transform_flow to avoid pipenv locking issue.

* Add necessary dependencies for transform_flow execution by Dask workers.

* Move test flows to flow_test.

* Include local path for importing Flow tasks.

* Fix linting issues.

* Consolidate stack output value retrieval.

* Move dev dependencies into Pipfile.

* Propagate stack and flow tags to dynamically created ECS tasks.

* Pin version of Github Actions ubuntu to support pipenv install.

* Update directory paths for linting.

* isort linting fixes.

* Black formatting fixes.

* Pin dependencies used by test flows.

* Include detailed descriptions of new environment variables.

* Update ids and stack exports to correctly use identifier in formatting.

* Remove legacy comments from test flows and flow utils.

* Add pre-commit hooks for linting and direct fixes.

* Linting and formatting fixes to conform to pre-commit specs.

Co-authored-by: Ciaran Evans <ciaran@developmentseed.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants