-
Notifications
You must be signed in to change notification settings - Fork 45
Adds Terraform files as alternative Cloud Formation #26
Conversation
Co-authored-by: Gregory Hilston <66077036+GregHilstonHop@users.noreply.github.com> Co-authored-by: Ankur Kalra <64661093+ankur-hop@users.noreply.github.com>
Thanks @GregHilstonHop! We will start reviewing this! |
This is really cool @GregHilstonHop , I've played with this a bit today. The doc is great, I've been able to run everything pretty easily. As someone who tried to get a working tf setup for AWS Batch at $JOB-1, I know it is far from trivial! I wonder if it be possible to separate VPC setup bits from Metaflow-specific resources in In fact it seems like it should be possible to turn |
Hey @oavdeev , thanks for trying it out, very happy you're having a great experience and enjoying the documentation! You're absolutely right that setting up an AWS Batch environment is not a simple task :) I think you have a really good point about how we don't want to shoehorn people into being required to stand up a VPC exactly as the Since this pull request is rather large, clocking in at >3k lines, I'm wondering if you and the Metaflow team would be open to us creating a Github issue for your idea and tackling this on a subsequent PR. That way I/we/whoever attacks this followup, can submit a much smaller and more focused PR. Thoughts oavdeev/metaflow team? |
I agree, this could be done as a followup. I've left some comments, and create the issue here #27 |
Taking @oavdeev's suggestion to make Metadata API authentication description more easily understandable Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>
Awesome! Thank you for creating that Github issue! I ended up taking all your advice and added it on this branch. |
Great! I will test out the PR early next week! |
Awesome excited to hear your thoughts and feedback! |
* add dynamo access to the batch service role like: https://github.com/Netflix/metaflow-tools/blob/master/aws/cloudformation/metaflow-cfn-template.yml#L1066 * fix: use variable since policy may not resolve
hi @savingoyal has anyone had a chance to look at this yet? We've done the work to move the metaflow bits to a standalone module, so if no one has had a chance to review yet, I'm wondering if it makes sense to update this PR with the modularized version instead of doing it separately for #27 |
Sure, it would be great if you can update the PR with the modularized version. We can get started with the review right after! |
@GregHilstonHop yes, I'll open a PR in your repo for this branch later today |
statement { | ||
effect = "Allow" | ||
|
||
# TODO - reduce to Encrypt, Decrypt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks like this TODO is done :)
*/ | ||
data "aws_availability_zones" "available" { | ||
state = "available" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little micro-bug here. For accounts using AWS Localzones, this occasionally returns those AZ's for the VPC deployment, and LZ doesn't support RDS or NLB at the moment. A filter stanza like the following can correct that:
filter {
name = "opt-in-status"
values = ["opt-in-not-required"]
}
- For accounts using AWS Localzones, this occasionally returns those AZ's for the VPC deployment, and LZ doesn't support RDS or NLB at the moment
pr feedback
|
||
api_stages { | ||
api_id = aws_api_gateway_rest_api.this.id | ||
stage = aws_api_gateway_deployment.this.stage_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean stage = aws_api_gateway_stage.this.stage_name
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that makes more sense 👍🏻 I'll update
} | ||
|
||
resource "aws_lb_target_group" "db_migrate" { | ||
name = "${var.resource_prefix}dbtrgtgrp${var.resource_suffix}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a cascading issue that affects other stuff, but I'm going to draw attention to it here since I've hit it twice. Certain resources, target groups included, have a 32 character limit. The baseline example.tfvars and this naming convention result in a name that ends up being 33 characters. I fix it by changing this line to "${var.resource_prefix}dbtg${var.resource_suffix}"
. We should definitely ensure that the template works out-of-the-box with the provided variables either by shortening this line or providing shorter variable defaults, but it's still worth calling out that these naming conventions teeter dangerously close to character limits, and even a wee bit of user-fiddling in the labeling could result in overflow. It's not a deal-breaker, but I'd anticipate running into it in the wild occasionally.
pr feedback: refing wrong entity for stage_name
aws/terraform/README.md
Outdated
|
||
Initialize the terraform: | ||
|
||
`cd infra && terraform init` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cd infra
is called out in three different spots, one for each section. I assume the intention was for it to indicate the specific project name, e.g. cd infra
, cd metaflow
, and cd sagemaker
. Just a copy-paste type-o, perhaps?
tags = var.standard_tags | ||
} | ||
|
||
resource "aws_lb_target_group" "this" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to the issue below: https://github.com/Netflix/metaflow-tools/pull/26/files#r677874253. Hit this problem with character limit again with this resource when I tested on Gov Cloud. Gov Cloud's region identifier is longer (us-gov-west-1
), so public sector users will hit this limit both for this resource as well as the DB Target Group I called out below.
] | ||
|
||
resources = [ | ||
"arn:${var.iam_partition}:iam:${local.aws_region}:${local.aws_account_id}:role/${local.resource_prefix}*${local.resource_suffix}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only gotten a chance to test this in Gov Cloud, and not in its entirety due to another issue I'll highlight elsewhere, but this wildcard structure seems to confuse Terraform pretty dramatically. I was able to get it to at least apply using the following syntax: "arn:${var.iam_partition}:iam:${local.aws_region}:${local.aws_account_id}:role/${local.resource_prefix}-*-${local.resource_suffix}"
. I'm uncertain if the policies will work, though, because I'm stalled at the same KMS S3 issue we saw with the Batch user. I assume we'll just need to apply the same permission setup here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created a shorted region name (4 character) mapping dict to hopefully resolve this
effect = "Allow" | ||
|
||
actions = [ | ||
"kms:Decrypt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placing this here as a highlight. I'm hitting the same S3 forbidden errors and KMS denials as I had for the Batch role. I assume we'll need the same KMS adjustment for this role as we did for that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about the resource specified here. It does look like the same as what is present in the cloudformation template, but it seems like it should be: arn:${var.iam_partition}:kms:${local.aws_region}:${local.aws_account_id}:key/*
The other kms permissions are not using a wildcard and have a specific key resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is actually unrelated to the above. The "metaflow" policy on the custom role had a passrole permission with a region in the arn which was causing the role not to be fully generated. Pushed a commit to fix and verified role/policy is generated now
Pr feedback
fix: reduce target group name len to avoid issues with resourcing naming limits
] | ||
|
||
resources = [ | ||
"arn:${var.iam_partition}:events:${local.aws_region}:${local.aws_account_id}:rule:*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this needs to be arn:${var.iam_partition}:events:${local.aws_region}:${local.aws_account_id}:rule/*
Hey Metaflow,
This pull request adds a Terraform stack for deploying Metaflow to AWS. As requested by the Metaflow CONTRIBUTING.md file, we've checked for an existing Github issue before creating our own and this pull request can be considered a direct response to the existing Metaflow Github issue #38. Additionally, as requested, we've reached out to the Metaflow team on the Gitter chatroom to ensure this submission provides value and is aligned with Metaflow's direction.
We're aware of the outstanding pull request 14 that also adds Terraform code. Since pull request 14 and the discussion on Metaflow Github issue #38 have been stale for some time, we decided we'd use this opportunity to share our Terraform implementation.
Technical Breakdown
We've designed the Terraform code to be composed of three separates projects:
infra
metaflow
sagemaker-notebook
Which are documented in the various
README.md
files included in this pull request, but for convenience we'll provide a high level overview here.infra
Mostly stands up and configures the VPC.
metaflow
This has been separated into four modules based on the service architecture.
Depends on the output of the
infra
projects.Computation
Sets up the Batch AWS resources to support remote experiment execution.
Datastore
Sets up the RDS and S3 AWS resources to provide storage for the Metadata Service.
Metadata Service
Sets up the API Gateway, load balancer and ECS cluster that powers the Metadata Service.
Step Functions
Sets up the Step Function AWS resources to provide productionified remote experiment executions.
sagemaker-notebook
Sets up the Sagemaker Notebook AWS resources to provide a central location to interact with the Metadata Service.
How To Deploy
This is documented in the pull request and can be found under
metaflow-tools/aws/terraform/README/md