Skip to content
This repository was archived by the owner on Apr 20, 2023. It is now read-only.

Adds Terraform files as alternative Cloud Formation #26

Merged
merged 38 commits into from
Aug 11, 2021

Conversation

GregHilstonHop
Copy link
Contributor

@GregHilstonHop GregHilstonHop commented Mar 30, 2021

Hey Metaflow,

This pull request adds a Terraform stack for deploying Metaflow to AWS. As requested by the Metaflow CONTRIBUTING.md file, we've checked for an existing Github issue before creating our own and this pull request can be considered a direct response to the existing Metaflow Github issue #38. Additionally, as requested, we've reached out to the Metaflow team on the Gitter chatroom to ensure this submission provides value and is aligned with Metaflow's direction.

We're aware of the outstanding pull request 14 that also adds Terraform code. Since pull request 14 and the discussion on Metaflow Github issue #38 have been stale for some time, we decided we'd use this opportunity to share our Terraform implementation.

Technical Breakdown

We've designed the Terraform code to be composed of three separates projects:

  1. infra
  2. metaflow
  3. sagemaker-notebook

Which are documented in the various README.md files included in this pull request, but for convenience we'll provide a high level overview here.

infra

Mostly stands up and configures the VPC.

metaflow

This has been separated into four modules based on the service architecture.

Depends on the output of the infra projects.

Computation

Sets up the Batch AWS resources to support remote experiment execution.

Datastore

Sets up the RDS and S3 AWS resources to provide storage for the Metadata Service.

Metadata Service

Sets up the API Gateway, load balancer and ECS cluster that powers the Metadata Service.

Step Functions

Sets up the Step Function AWS resources to provide productionified remote experiment executions.

sagemaker-notebook

Sets up the Sagemaker Notebook AWS resources to provide a central location to interact with the Metadata Service.

How To Deploy

This is documented in the pull request and can be found under metaflow-tools/aws/terraform/README/md

@savingoyal
Copy link
Collaborator

Thanks @GregHilstonHop! We will start reviewing this!

@savingoyal savingoyal self-requested a review March 30, 2021 15:11
@oavdeev
Copy link
Contributor

oavdeev commented Mar 31, 2021

This is really cool @GregHilstonHop , I've played with this a bit today. The doc is great, I've been able to run everything pretty easily. As someone who tried to get a working tf setup for AWS Batch at $JOB-1, I know it is far from trivial!

I wonder if it be possible to separate VPC setup bits from Metaflow-specific resources in infra (like that IAM policy)? I imagine a lot of people would already have some kind of preferred VPC configuration already in place, maybe using a module like terraform-aws-vpc. It would be cool if this module played nicely with this kind of setup, and had an option to provide subnets ids and whatnot externally.

In fact it seems like it should be possible to turn metaflow into a module that people can use directly from this Github repo. What do you think?

@GregHilstonHop
Copy link
Contributor Author

Hey @oavdeev , thanks for trying it out, very happy you're having a great experience and enjoying the documentation!

You're absolutely right that setting up an AWS Batch environment is not a simple task :)

I think you have a really good point about how we don't want to shoehorn people into being required to stand up a VPC exactly as the infra project dictates. I think the separating the VPC from the Metaflow-specific resources is a really good idea and would make this much more approachable for those already with AWS resources.

Since this pull request is rather large, clocking in at >3k lines, I'm wondering if you and the Metaflow team would be open to us creating a Github issue for your idea and tackling this on a subsequent PR.

That way I/we/whoever attacks this followup, can submit a much smaller and more focused PR.

Thoughts oavdeev/metaflow team?

@oavdeev
Copy link
Contributor

oavdeev commented Apr 2, 2021

That way I/we/whoever attacks this followup, can submit a much smaller and more focused PR.

I agree, this could be done as a followup. I've left some comments, and create the issue here #27

@GregHilstonHop
Copy link
Contributor Author

I agree, this could be done as a followup. I've left some comments, and create the issue here #27

Awesome! Thank you for creating that Github issue!

I ended up taking all your advice and added it on this branch.

@savingoyal
Copy link
Collaborator

Great! I will test out the PR early next week!

@GregHilstonHop
Copy link
Contributor Author

Awesome excited to hear your thoughts and feedback!

@kldavis4
Copy link
Contributor

hi @savingoyal has anyone had a chance to look at this yet?

We've done the work to move the metaflow bits to a standalone module, so if no one has had a chance to review yet, I'm wondering if it makes sense to update this PR with the modularized version instead of doing it separately for #27

@savingoyal
Copy link
Collaborator

Sure, it would be great if you can update the PR with the modularized version. We can get started with the review right after!

@GregHilstonHop
Copy link
Contributor Author

@kldavis4 How do you want to go about updating this branch?

Do you want to submit a PR for the work you did for issue #27 to merge said work into this branch?

@kldavis4
Copy link
Contributor

@GregHilstonHop yes, I'll open a PR in your repo for this branch later today

statement {
effect = "Allow"

# TODO - reduce to Encrypt, Decrypt?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like this TODO is done :)

*/
data "aws_availability_zones" "available" {
state = "available"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little micro-bug here. For accounts using AWS Localzones, this occasionally returns those AZ's for the VPC deployment, and LZ doesn't support RDS or NLB at the moment. A filter stanza like the following can correct that:

filter {
  name  = "opt-in-status"
  values = ["opt-in-not-required"]
}


api_stages {
api_id = aws_api_gateway_rest_api.this.id
stage = aws_api_gateway_deployment.this.stage_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean stage = aws_api_gateway_stage.this.stage_name here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that makes more sense 👍🏻 I'll update

}

resource "aws_lb_target_group" "db_migrate" {
name = "${var.resource_prefix}dbtrgtgrp${var.resource_suffix}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a cascading issue that affects other stuff, but I'm going to draw attention to it here since I've hit it twice. Certain resources, target groups included, have a 32 character limit. The baseline example.tfvars and this naming convention result in a name that ends up being 33 characters. I fix it by changing this line to "${var.resource_prefix}dbtg${var.resource_suffix}". We should definitely ensure that the template works out-of-the-box with the provided variables either by shortening this line or providing shorter variable defaults, but it's still worth calling out that these naming conventions teeter dangerously close to character limits, and even a wee bit of user-fiddling in the labeling could result in overflow. It's not a deal-breaker, but I'd anticipate running into it in the wild occasionally.


Initialize the terraform:

`cd infra && terraform init`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd infra is called out in three different spots, one for each section. I assume the intention was for it to indicate the specific project name, e.g. cd infra, cd metaflow, and cd sagemaker. Just a copy-paste type-o, perhaps?

tags = var.standard_tags
}

resource "aws_lb_target_group" "this" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the issue below: https://github.com/Netflix/metaflow-tools/pull/26/files#r677874253. Hit this problem with character limit again with this resource when I tested on Gov Cloud. Gov Cloud's region identifier is longer (us-gov-west-1), so public sector users will hit this limit both for this resource as well as the DB Target Group I called out below.

]

resources = [
"arn:${var.iam_partition}:iam:${local.aws_region}:${local.aws_account_id}:role/${local.resource_prefix}*${local.resource_suffix}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only gotten a chance to test this in Gov Cloud, and not in its entirety due to another issue I'll highlight elsewhere, but this wildcard structure seems to confuse Terraform pretty dramatically. I was able to get it to at least apply using the following syntax: "arn:${var.iam_partition}:iam:${local.aws_region}:${local.aws_account_id}:role/${local.resource_prefix}-*-${local.resource_suffix}". I'm uncertain if the policies will work, though, because I'm stalled at the same KMS S3 issue we saw with the Batch user. I assume we'll just need to apply the same permission setup here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a shorted region name (4 character) mapping dict to hopefully resolve this

effect = "Allow"

actions = [
"kms:Decrypt",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing this here as a highlight. I'm hitting the same S3 forbidden errors and KMS denials as I had for the Batch role. I assume we'll need the same KMS adjustment for this role as we did for that one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about the resource specified here. It does look like the same as what is present in the cloudformation template, but it seems like it should be: arn:${var.iam_partition}:kms:${local.aws_region}:${local.aws_account_id}:key/*

The other kms permissions are not using a wildcard and have a specific key resource

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually unrelated to the above. The "metaflow" policy on the custom role had a passrole permission with a region in the arn which was causing the role not to be fully generated. Pushed a commit to fix and verified role/policy is generated now

]

resources = [
"arn:${var.iam_partition}:events:${local.aws_region}:${local.aws_account_id}:rule:*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs to be arn:${var.iam_partition}:events:${local.aws_region}:${local.aws_account_id}:rule/*

@savingoyal savingoyal merged commit e6f5ba4 into Netflix:master Aug 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants