Create AWS deployment infrastructure #366

choldgraf · 2021-04-28T16:47:49Z

Summary

We currently focus much of our deployment infrastructure around Google Cloud rather than AWS or Azure. We also have a few clients that would like their hubs working on AWS. We should improve our AWS deployment infrastructure and use these use-cases as forcing functions.

The two use-cases are:

The OpenScapes Hub (New Hub: OpenScapes #363 and discussion at #365)
The Carbon Plan Hub (New Hub: Carbon Plan #291)

Given that two groups wish to use this infrastructure now, and that AWS is extremely popular and will likely be a commonly-requested provider, I think we should prioritize this one.

Acceptance criteria

We should be able to spin up an AWS Pangeo-style hub with the same ease that we currently have with GKE.

Tasks to complete

A few that came to mind...

Get info about current pangeo-hub AWS deployment
Manually deploy a kops AWS cluster resembling the pangeo-hubs one
Adapt our deploy scripts to support AWS as well.
Prototype this with the OpenScapes hub, refinements if necessary.
Prototype this with the Carbon Plan hub, refinements if necessary.
Finalize and document the AWS setup

ping to @jhamman as well as @consideRatio who may be interested in tracking this (or helping out!) as well

yuvipanda · 2021-04-28T20:08:00Z

#368 is related

damianavila · 2021-04-30T13:21:20Z

#135 is also related

damianavila · 2021-04-30T13:22:02Z

#50 might be related as well...

damianavila · 2021-04-30T13:22:49Z

Another related one: 2i2c-org/farallon-image#28

damianavila · 2021-04-30T14:20:40Z

OK, I have looked into the pilot-hubs and pangeo-hubs codebases. I have also looked at other issues referred here.
This is the summary I was able to make (alongside some questions, of course 😉):

The pilot-hubs repo has a terrraform-based GCP cluster (manual?) deployment without any documentation (or did I miss it somehow?)
I suspect the GCP cluster was deployed manually (is there any issue or docs recording the info about how that process was done?)
I see https://github.com/2i2c-org/org-ops as, maybe, the place to actually host the (automated?) process to create the cluster. Is that the proper interpretation. Currently, I do not see a link between the content in the org-ops repo and the process to deploy the terraform template living inside the pilot-hubs repo (but I could be missing stuff since terraform is new to me).

I will continue my comments in a subsequent message, otherwise, it gets pretty long...

yuvipanda · 2021-04-30T14:32:02Z

The pilot-hubs repo has a terrraform-based GCP cluster (manual?) deployment without any documentation (or did I miss it somehow?)

Correct! No docs, just #275. I can walk you through this if you would like, but it is a mess.

3\. I see https://github.com/2i2c-org/org-ops as, maybe, the place to actually host the (automated?) process to create the cluster. Is that the proper interpretation. Currently, I do not see a link between the content in the org-ops repo and the process to deploy the terraform template living inside the pilot-hubs repo (but I could be missing stuff since terraform is new to me).

The intent of that repo is to only host infrastructure that's org-wide - so manage projects access, terraform state workspaces, maybe a centralized grafana (if we get there), etc. Not for per-project terraform. I also continue to find automating terraform deployments terrifyingly complex and super easy to get wrong.

yuvipanda · 2021-04-30T14:36:05Z

I opened #369 fixing an issue (but it makes the code scarier!), and adding some docs on the current terraform workspaces in use.

damianavila · 2021-04-30T14:39:50Z

Assuming I know the answer to some of the questions I raised above, I think the plan for the item on the list ("Adapt our deploy scripts to support AWS as well") should be:

Manually deploy the kops based cluster into AWS land
a. Planning to use standard zero2jhub docs to create the cluster with @yuvipanda config files. Do you suggest any other resource?
b. Planning to use the same guide to set up EFS support instead of the NFS VM story. I think that is what @yuvipanda did in the issue referenced above and I agree with that. Any other thoughts? What is the story about using an NFS VM in GCP land instead of Filestore (maybe?)
Start digging into the pilots-hub repo to make it compatible with an AWS cluster:
a. The Cluster class should be aware of AWS-based cluster
b. Where are we pushing the image to? Are we quay.io, right?
c. Looking at the Hub class, I do not see anything GCP-specific, am I missing something (that is most probably the case)?
d. Still not sure how the NFS to EFS story will play out, I still need to think about that one (connected to 1.b), but I see some relevant config files here that I think I would need to work with...

General thoughts? 😜

yuvipanda · 2021-04-30T14:53:43Z

1. b. Planning to use the same guide to set up EFS support instead of the NFS VM story. I think that is what @yuvipanda did in the issue referenced above and I agree with that. Any other thoughts? What is the story about using an NFS VM in GCP land instead of Filestore (maybe?)

On AWS, EFS is indeed used. On GCP, Filestore is extremely expensive - unlike EFS, there's a minimum disk size of 1TB. Easily a few hundred dollars a month just on that. EFS is much more real pay per use.

2\. a. The [Cluster class](https://github.com/2i2c-org/pilot-hubs/blob/7934081b6afaa4e03d49946c0943c63f599f400f/deployer/hub.py#L24) should be aware of AWS-based cluster

Yep! For kops I was thinking we can just ship the kubeconfig generated by KUBECONFIG=secrets/farallon.yaml kops export kubecfg farallon-2i2c.k8s.local --admin but there are probably other options too

b. Where are we pushing the image to? Are we quay.io, right?

So right now, that's the suggestion for cases when our users are building the docker image themeselves. I think for us, we should try get the image in ECR. It definitely makes node spin up time much faster, and this is super important with dask

2\. d. Still not sure how the NFS to EFS story will play out, I still need to think about that one (connected to 1.b), but I see some relevant config files [here](https://github.com/2i2c-org/pilot-hubs/tree/master/hub-templates/base-hub/templates) that I think I would need to work with...

I just mount all EFS as NFS. Just setting something like https://github.com/2i2c-org/pangeo-hubs/blob/staging/deployments/farallon/config/common.yaml#L4 as our nfs.server should 'just work' with the NFS setup we have. https://pilot-hubs.2i2c.org/en/latest/topic/storage-layer.html has some info, particularly the client mounts might be useful.

damianavila · 2021-04-30T19:27:12Z

The intent of that repo is to only host infrastructure that's org-wide - so manage projects access, terraform state workspaces, maybe a centralized grafana (if we get there), etc. Not for per-project terraform.

OK, thanks for the clarification!

I also continue to find automating terraform deployments terrifyingly complex and super easy to get wrong.

Yep, I am kind of getting that whereas I read about it...

damianavila · 2021-04-30T19:29:37Z

On GCP, Filestore is extremely expensive - unlike EFS, there's a minimum disk size of 1TB. Easily a few hundred dollars a month just on that.

I supposed that was the case... thanks for confirming it!

yuvipanda · 2021-04-30T19:30:29Z

Filestore is also just NFSv3, and in general doesn't have a lot of the features of EFS that make it so desirable. SIGH

Pangeo hubs have a `PANGEO_SCRATCH` env variable that points to a GCS bucket, used to share data between users. We implement that here too, but with a more generic `SCRATCH_BUCKET` env var (`PANGEO_SCRATCH` is also set for backwards compat). pangeo-data/pangeo-cloud-federation#610 has some more info on the use cases for `PANGEO_SCRATCH` Right now, we use Google Config Connector (https://cloud.google.com/config-connector/docs/overview) to set this up. We create Kubernetes CRDs, and the connector creates appropriate cloud resources to match them. We use this to provision a GCP Serivce account and a Storage bucket for each hub. Since these are GCP specific, running them on AWS fails. This PR puts them behind a switch, so we can work on getting things to AWS. Eventually, it should also support AWS resources via the AWS Service broker (https://aws.amazon.com/partners/servicebroker/) Ref 2i2c-org#366

yuvipanda · 2021-05-03T18:15:49Z

#374 puts some GCP specific stuff behind a feature flag

damianavila · 2021-05-04T01:21:25Z

#379 (by @yuvipanda) collects several PRs toward this goal.

jhamman · 2021-05-04T14:56:25Z

hey all -- just chiming in here to say that we're super interested in these developments as we're looking to setup a new Pangeo-like hub on AWS in the near future. If there's anything we can do to help move things along, just let me know.

yuvipanda · 2021-05-04T18:49:47Z

We'll need to figure out how to manage 2i2c user access to these AWS credentials. I asked @jhamman for access here. We should make sure that:

That we have this access is documented
How users can give us this access is documented
New engineers should get this access is also documented

https://pilot-hubs.2i2c.org/en/latest/topic/storage-layer.html has more info on the nfs-share-creator. This PR adds support for setting baseSharePath to `/`, which is sometimes needed on EFS. Ref 2i2c-org#366

damianavila · 2021-05-07T23:34:37Z

Update: I have a kops-based cluster (mimicking the Farallon one) already deployed in OpenScapes AWS land.

damianavila · 2021-05-07T23:36:11Z

Also, #379 (supporting hubs deployment in AWS land from the pilot-hub repo) was merged today!

yuvipanda · 2021-05-14T13:41:56Z

#389 + #391 setup new kops based kops clusters + a small script to setup EFS properly

damianavila · 2021-06-07T17:55:44Z

#453 deals with replication/validation + documentation of the current deployment story.

damianavila · 2021-06-18T21:59:41Z

#453 was merged, so ticking the last item in the first message of this thread and finally closing this one!!

Btw, there could be some other remaining things to be done but those are described and captured in follow-up issues.

choldgraf · 2021-06-18T22:43:59Z

choldgraf added Enhancement An improvement to something or creating something new. prio: high labels Apr 28, 2021

choldgraf added this to Ready to work 👍 in Deliverables Backlog via automation Apr 28, 2021

choldgraf moved this from Ready to work 👍 to In progress ⚡ in Deliverables Backlog Apr 28, 2021

yuvipanda mentioned this issue Apr 28, 2021

Migrate Farallon hub to this repo #368

Closed

7 tasks

damianavila self-assigned this Apr 30, 2021

choldgraf mentioned this issue May 3, 2021

Team Sync - May 03, 2021 2i2c-org/team-compass#91

Closed

yuvipanda mentioned this issue May 3, 2021

Put GCP-only 'scratch bucket' behind a flag #374

Merged

yuvipanda mentioned this issue May 4, 2021

Cleanup nfs-share-creator #380

Merged

damianavila assigned yuvipanda May 7, 2021

choldgraf mentioned this issue May 10, 2021

Team Sync - May 10, 2021 2i2c-org/team-compass#99

Closed

This was referenced May 17, 2021

Team Sync - May 17, 2021 2i2c-org/team-compass#104

Closed

Migrate the Pangeo AWS hub to 2i2c infrastructure #427

Closed

Team Sync - May 24, 2021 2i2c-org/team-compass#106

Closed

This was referenced May 31, 2021

Team Sync - May 31, 2021 2i2c-org/team-compass#111

Closed

Team Sync - Jun 07, 2021 2i2c-org/team-compass#114

Closed

damianavila mentioned this issue Jun 7, 2021

Document our current AWS infrastrcuture #453

Closed

damianavila mentioned this issue Jun 12, 2021

Push preliminary AWS deployment documentation #467

Merged

choldgraf mentioned this issue Jun 14, 2021

Team Sync - Jun 14, 2021 2i2c-org/team-compass#118

Closed

damianavila closed this as completed Jun 18, 2021

Deliverables Backlog automation moved this from In progress ⚡ to Done 🎉 Jun 18, 2021

damianavila mentioned this issue Jun 21, 2021

Team Sync - Jun 21, 2021 2i2c-org/team-compass#122

Closed

choldgraf moved this from Done 🎉 to Managed JupyterHub Service Launch in Deliverables Backlog Aug 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create AWS deployment infrastructure #366

Create AWS deployment infrastructure #366

choldgraf commented Apr 28, 2021 •

edited by damianavila

Loading

yuvipanda commented Apr 28, 2021

damianavila commented Apr 30, 2021

damianavila commented Apr 30, 2021

damianavila commented Apr 30, 2021

damianavila commented Apr 30, 2021

yuvipanda commented Apr 30, 2021

yuvipanda commented Apr 30, 2021

damianavila commented Apr 30, 2021

yuvipanda commented Apr 30, 2021

damianavila commented Apr 30, 2021

damianavila commented Apr 30, 2021

yuvipanda commented Apr 30, 2021

yuvipanda commented May 3, 2021

damianavila commented May 4, 2021 •

edited

Loading

jhamman commented May 4, 2021

yuvipanda commented May 4, 2021

damianavila commented May 7, 2021

damianavila commented May 7, 2021

yuvipanda commented May 14, 2021

damianavila commented Jun 7, 2021

damianavila commented Jun 18, 2021

choldgraf commented Jun 18, 2021

Create AWS deployment infrastructure #366

Create AWS deployment infrastructure #366

Comments

choldgraf commented Apr 28, 2021 • edited by damianavila Loading

Summary

Acceptance criteria

Tasks to complete

yuvipanda commented Apr 28, 2021

damianavila commented Apr 30, 2021

damianavila commented Apr 30, 2021

damianavila commented Apr 30, 2021

damianavila commented Apr 30, 2021

yuvipanda commented Apr 30, 2021

yuvipanda commented Apr 30, 2021

damianavila commented Apr 30, 2021

yuvipanda commented Apr 30, 2021

damianavila commented Apr 30, 2021

damianavila commented Apr 30, 2021

yuvipanda commented Apr 30, 2021

yuvipanda commented May 3, 2021

damianavila commented May 4, 2021 • edited Loading

jhamman commented May 4, 2021

yuvipanda commented May 4, 2021

damianavila commented May 7, 2021

damianavila commented May 7, 2021

yuvipanda commented May 14, 2021

damianavila commented Jun 7, 2021

damianavila commented Jun 18, 2021

choldgraf commented Jun 18, 2021

choldgraf commented Apr 28, 2021 •

edited by damianavila

Loading

damianavila commented May 4, 2021 •

edited

Loading