Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create AWS deployment infrastructure #366

Closed
6 tasks done
choldgraf opened this issue Apr 28, 2021 · 22 comments
Closed
6 tasks done

Create AWS deployment infrastructure #366

choldgraf opened this issue Apr 28, 2021 · 22 comments
Assignees
Labels
Enhancement An improvement to something or creating something new.

Comments

@choldgraf
Copy link
Member

choldgraf commented Apr 28, 2021

Summary

We currently focus much of our deployment infrastructure around Google Cloud rather than AWS or Azure. We also have a few clients that would like their hubs working on AWS. We should improve our AWS deployment infrastructure and use these use-cases as forcing functions.

The two use-cases are:

Given that two groups wish to use this infrastructure now, and that AWS is extremely popular and will likely be a commonly-requested provider, I think we should prioritize this one.

Acceptance criteria

We should be able to spin up an AWS Pangeo-style hub with the same ease that we currently have with GKE.

Tasks to complete

A few that came to mind...

  • Get info about current pangeo-hub AWS deployment
  • Manually deploy a kops AWS cluster resembling the pangeo-hubs one
  • Adapt our deploy scripts to support AWS as well.
  • Prototype this with the OpenScapes hub, refinements if necessary.
  • Prototype this with the Carbon Plan hub, refinements if necessary.
  • Finalize and document the AWS setup

ping to @jhamman as well as @consideRatio who may be interested in tracking this (or helping out!) as well

@choldgraf choldgraf added Enhancement An improvement to something or creating something new. prio: high labels Apr 28, 2021
@choldgraf choldgraf added this to Ready to work 👍 in Deliverables Backlog via automation Apr 28, 2021
@choldgraf choldgraf moved this from Ready to work 👍 to In progress ⚡ in Deliverables Backlog Apr 28, 2021
@yuvipanda
Copy link
Member

#368 is related

@damianavila
Copy link
Contributor

#135 is also related

@damianavila
Copy link
Contributor

#50 might be related as well...

@damianavila
Copy link
Contributor

Another related one: 2i2c-org/farallon-image#28

@damianavila
Copy link
Contributor

OK, I have looked into the pilot-hubs and pangeo-hubs codebases. I have also looked at other issues referred here.
This is the summary I was able to make (alongside some questions, of course 😉):

  1. The pilot-hubs repo has a terrraform-based GCP cluster (manual?) deployment without any documentation (or did I miss it somehow?)
  2. I suspect the GCP cluster was deployed manually (is there any issue or docs recording the info about how that process was done?)
  3. I see https://github.com/2i2c-org/org-ops as, maybe, the place to actually host the (automated?) process to create the cluster. Is that the proper interpretation. Currently, I do not see a link between the content in the org-ops repo and the process to deploy the terraform template living inside the pilot-hubs repo (but I could be missing stuff since terraform is new to me).

I will continue my comments in a subsequent message, otherwise, it gets pretty long...

@yuvipanda
Copy link
Member

The pilot-hubs repo has a terrraform-based GCP cluster (manual?) deployment without any documentation (or did I miss it somehow?)

Correct! No docs, just #275. I can walk you through this if you would like, but it is a mess.

3\. I see https://github.com/2i2c-org/org-ops as, maybe, the place to actually host the (automated?) process to create the cluster. Is that the proper interpretation. Currently, I do not see a link between the content in the org-ops repo and the process to deploy the terraform template living inside the pilot-hubs repo (but I could be missing stuff since terraform is new to me).

The intent of that repo is to only host infrastructure that's org-wide - so manage projects access, terraform state workspaces, maybe a centralized grafana (if we get there), etc. Not for per-project terraform. I also continue to find automating terraform deployments terrifyingly complex and super easy to get wrong.

@yuvipanda
Copy link
Member

I opened #369 fixing an issue (but it makes the code scarier!), and adding some docs on the current terraform workspaces in use.

@damianavila
Copy link
Contributor

Assuming I know the answer to some of the questions I raised above, I think the plan for the item on the list ("Adapt our deploy scripts to support AWS as well") should be:

  1. Manually deploy the kops based cluster into AWS land
    a. Planning to use standard zero2jhub docs to create the cluster with @yuvipanda config files. Do you suggest any other resource?
    b. Planning to use the same guide to set up EFS support instead of the NFS VM story. I think that is what @yuvipanda did in the issue referenced above and I agree with that. Any other thoughts? What is the story about using an NFS VM in GCP land instead of Filestore (maybe?)

  2. Start digging into the pilots-hub repo to make it compatible with an AWS cluster:
    a. The Cluster class should be aware of AWS-based cluster
    b. Where are we pushing the image to? Are we quay.io, right?
    c. Looking at the Hub class, I do not see anything GCP-specific, am I missing something (that is most probably the case)?
    d. Still not sure how the NFS to EFS story will play out, I still need to think about that one (connected to 1.b), but I see some relevant config files here that I think I would need to work with...

General thoughts? 😜

@yuvipanda
Copy link
Member

1. b. Planning to use the same guide to set up EFS support instead of the NFS VM story. I think that is what @yuvipanda did in the issue referenced above and I agree with that. Any other thoughts? What is the story about using an NFS VM in GCP land instead of Filestore (maybe?)

On AWS, EFS is indeed used. On GCP, Filestore is extremely expensive - unlike EFS, there's a minimum disk size of 1TB. Easily a few hundred dollars a month just on that. EFS is much more real pay per use.

2\. a. The [Cluster class](https://github.com/2i2c-org/pilot-hubs/blob/7934081b6afaa4e03d49946c0943c63f599f400f/deployer/hub.py#L24) should be aware of AWS-based cluster

Yep! For kops I was thinking we can just ship the kubeconfig generated by KUBECONFIG=secrets/farallon.yaml kops export kubecfg farallon-2i2c.k8s.local --admin but there are probably other options too

b. Where are we pushing the image to? Are we quay.io, right?

So right now, that's the suggestion for cases when our users are building the docker image themeselves. I think for us, we should try get the image in ECR. It definitely makes node spin up time much faster, and this is super important with dask

2\. d. Still not sure how the NFS to EFS story will play out, I still need to think about that one (connected to 1.b), but I see some relevant config files [here](https://github.com/2i2c-org/pilot-hubs/tree/master/hub-templates/base-hub/templates) that I think I would need to work with...

I just mount all EFS as NFS. Just setting something like https://github.com/2i2c-org/pangeo-hubs/blob/staging/deployments/farallon/config/common.yaml#L4 as our nfs.server should 'just work' with the NFS setup we have. https://pilot-hubs.2i2c.org/en/latest/topic/storage-layer.html has some info, particularly the client mounts might be useful.

@damianavila
Copy link
Contributor

The intent of that repo is to only host infrastructure that's org-wide - so manage projects access, terraform state workspaces, maybe a centralized grafana (if we get there), etc. Not for per-project terraform.

OK, thanks for the clarification!

I also continue to find automating terraform deployments terrifyingly complex and super easy to get wrong.

Yep, I am kind of getting that whereas I read about it...

@damianavila
Copy link
Contributor

On GCP, Filestore is extremely expensive - unlike EFS, there's a minimum disk size of 1TB. Easily a few hundred dollars a month just on that.

I supposed that was the case... thanks for confirming it!

@yuvipanda
Copy link
Member

Filestore is also just NFSv3, and in general doesn't have a lot of the features of EFS that make it so desirable. SIGH

@damianavila damianavila self-assigned this Apr 30, 2021
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 3, 2021
Pangeo hubs have a `PANGEO_SCRATCH` env variable that
points to a GCS bucket, used to share data between users.
We implement that here too, but with a more generic `SCRATCH_BUCKET`
env var (`PANGEO_SCRATCH` is also set for backwards compat).
pangeo-data/pangeo-cloud-federation#610
has some more info on the use cases for `PANGEO_SCRATCH`

Right now, we use Google Config Connector
(https://cloud.google.com/config-connector/docs/overview)
to set this up. We create Kubernetes CRDs, and the connector
creates appropriate cloud resources to match them. We use this
to provision a GCP Serivce account and a Storage bucket for each
hub.

Since these are GCP specific, running them on AWS fails. This
PR puts them behind a switch, so we can work on getting things to
AWS.

Eventually, it should also support AWS resources via the
AWS Service broker (https://aws.amazon.com/partners/servicebroker/)

Ref 2i2c-org#366
@yuvipanda
Copy link
Member

#374 puts some GCP specific stuff behind a feature flag

@damianavila
Copy link
Contributor

damianavila commented May 4, 2021

#379 (by @yuvipanda) collects several PRs toward this goal.

@jhamman
Copy link

jhamman commented May 4, 2021

hey all -- just chiming in here to say that we're super interested in these developments as we're looking to setup a new Pangeo-like hub on AWS in the near future. If there's anything we can do to help move things along, just let me know.

@yuvipanda
Copy link
Member

We'll need to figure out how to manage 2i2c user access to these AWS credentials. I asked @jhamman for access here. We should make sure that:

  • That we have this access is documented
  • How users can give us this access is documented
  • New engineers should get this access is also documented

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 4, 2021
https://pilot-hubs.2i2c.org/en/latest/topic/storage-layer.html
has more info on the nfs-share-creator. This PR adds support for
setting baseSharePath to `/`, which is sometimes needed on EFS.

Ref 2i2c-org#366
@damianavila
Copy link
Contributor

Update: I have a kops-based cluster (mimicking the Farallon one) already deployed in OpenScapes AWS land.

@damianavila
Copy link
Contributor

Also, #379 (supporting hubs deployment in AWS land from the pilot-hub repo) was merged today!

@yuvipanda
Copy link
Member

#389 + #391 setup new kops based kops clusters + a small script to setup EFS properly

@damianavila
Copy link
Contributor

#453 deals with replication/validation + documentation of the current deployment story.

@damianavila
Copy link
Contributor

#453 was merged, so ticking the last item in the first message of this thread and finally closing this one!!

Btw, there could be some other remaining things to be done but those are described and captured in follow-up issues.

Deliverables Backlog automation moved this from In progress ⚡ to Done 🎉 Jun 18, 2021
@choldgraf
Copy link
Member Author

@choldgraf choldgraf moved this from Done 🎉 to Managed JupyterHub Service Launch in Deliverables Backlog Aug 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement An improvement to something or creating something new.
Projects
No open projects
Deliverables Backlog
Managed JupyterHubs Infrastructure
Development

No branches or pull requests

4 participants