Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy and operate a BinderHub for Pangeo #919

Open
6 of 7 tasks
choldgraf opened this issue Jan 4, 2022 · 55 comments
Open
6 of 7 tasks

Deploy and operate a BinderHub for Pangeo #919

choldgraf opened this issue Jan 4, 2022 · 55 comments

Comments

@choldgraf
Copy link
Member

choldgraf commented Jan 4, 2022

Description / problem to solve

Problem description
The Pangeo BinderHub has been down for about a month (due to crypto mining, but also because it did not have operational support to keep it going sustainably). The Pangeo community made heavy of use their Binder deployment, and it powered a lot of reproducible sharing (e.g., via gallery.pangeo.io.).

Proposed solution
We should deploy a BinderHub on the 2i2c deployment infrastructure that can live in parallel to the JupyterHub we run for the Pangeo community. We'll need to make a few modifications to their setup (including using up-to-date binderhub versions and locking down auth more reliably).

What's the value and who would benefit
This would allow the Pangeo community to re-gain the use of their BinderHub, which would benefit many people!

Implementation guide and constraints

There are a few things that we should consider here:

  • We'll need to update our configuration and CI/CD infrastructure to be able to deploy a BinderHub chart, since we're currently assuming JupyterHub.
  • We want to make sure we don't deploy a BinderHub and immediately run into the same crypto issues that Pangeo ran into before

Here's a GitHub issue where @scottyhq describes the environment that was available on the Pangeo BinderHub: pangeo-data/pangeo-binder#195 (comment)

Updates and ongoing work

Here are a few major issues that would need to be tackled as part of this effort:

Admin

@choldgraf
Copy link
Member Author

cc @rabernat and @sgibson91 - is there anything major here that I am missing? I believe that @sgibson91 is working with @consideRatio on #857 right now, which is laying the foundation to letting us deploy BinderHubs from the Pangeo cluster CI/CD.

@sgibson91
Copy link
Member

@choldgraf I think this is a really nice outline of the work that needs to be done to get us into a position where we are ready to deploy a BinderHub. I'm happy with how this is and add to the list of tasks as an when they arise

@sgibson91
Copy link
Member

pangeo-data/pangeo-binder#194

I'm also linking this issue as a future reminder to myself to ask about container registries for Pangeo Binder, but that is a ways down the road yet.

@alxmrs
Copy link

alxmrs commented Mar 16, 2022

Here's a left-field suggestion: What if we don't implement a log-in system, because we never host a remote server for the jupyter notebook -- and the experience stays mostly the same?

Specifically, could we run the notebooks in the browser in a Wasm python environment via JupyterLite? Here, the demo notebooks could be hosted on a static webpage.

@alxmrs
Copy link

alxmrs commented Mar 16, 2022

For context: The use case I had in mind was for distributing low-friction demos that don't require a log-in. This is related to the "whitelist vs blacklist" discussion around log ins in today's meeting.

@choldgraf
Copy link
Member Author

@alxmrs if we could define a subset of workflows and/or datasets that were possible to use in JupyterLite, this would definitely be a faster way to onboard people into the Pangeo community. I think the trick will be figuring out the "hand-off" between JupyterLite and a situation where you need a fully-loaded environment, so that it doesn't confuse or frustrate people.

But at the very least, it shouldn't be too hard to try a demo out. For example, here's the repository that serves the JupyterLite instance linked from try.jupyter.org:

https://github.com/jupyter/try-jupyter

That shouldn't be too hard to replicate for Pangeo's use-case. I bet you could curate a few notebooks that showed off basic functionality to get people started (but it probably wouldn't work for the more advanced things like Dask Gateway, Zarr, etc).

@sgibson91
Copy link
Member

sgibson91 commented Apr 11, 2022

I'd like to start working on this in the next couple of weeks. @yuvipanda are there any strategy discussions we need to have?

Questions I have:

  • Is it sensible to reuse the existing pangeo-hubs cluster for this binderhub or is a new cluster needed?

@damianavila
Copy link
Contributor

I presume this project might need a dedicated project board to collect all the associated issues.

@rabernat
Copy link
Contributor

I am very happy to see this moving forward! 🤩

Is it sensible to reuse the existing pangeo-hubs cluster for this binderhub or is a new cluster needed?

This will be paid from the same grant that is covering the current GCP Pangeo Hub (EarthCube Pangeo Forge award). So they will go to the same billing account. If it is easier to put everything in one cluster, that's fine with me. From the "hub owner" perspective, it would still be useful to be able to segregate costs for the binder.

@yuvipanda
Copy link
Member

So sorry for the delay, @sgibson91.

  • I think this should run in a different GCP project, and ideally a project that 2i2c bills for rather than one that columbia manages. I can't find the issue where we discussed this, but I remember @choldgraf mentioning that we can setup a project ourselves and bill columbia for it. Let's do that so we simplify our cloud access story? https://infrastructure.2i2c.org/en/latest/howto/cloud/new-gcp-project.html has info on setting up a new one. We can temporarily start out on the existing project to prototype if needed, but should switch out to new project.
  • Structurally, I'd imagine we'd make a 'binderhub' helm-chart, which has a dependency on both binderhub and dask-gateway. We can use condition to disable dask-gateway in future binderhubs that don't need this. A big problem here is the lack of composability in helm, and we will have to duplicate a bunch of things from our basehub and daskhub chart values.yaml files :( Is there any way to avoid this? We'd also need to make sure the z2jh version matches what we have in basehub, and I'm not sure how exactly to do that either.
  • For Auth, let's use CILogon. And during development, I think we can restrict login (in similar way to Restrict cloudbank demo hub to edu & 2i2c users #1218), but longer term we should open it up to everyone with CILogon access (many people do not have a .edu email address!). But this should protect us from miners while we figure that out.

@choldgraf
Copy link
Member Author

Regarding 2i2c paying for cloud. I think that this would require a change to the contract that 2i2c has with Pangeo (which currently only covers personnel costs). Can we do two things:

  1. @rabernat could you confirm that the approach @yuvipanda describes above is what you'd like to go with?
  2. If "yes", then our next step on the admin side is to ask CS&S to request an amendment (or an addition?) to the current sub-award contract.

@sgibson91
Copy link
Member

sgibson91 commented May 6, 2022

While we wait for @rabernat to update us on the contracting question, I believe the below issue is at least actionable. I will open an issue to track it.

  • Structurally, I'd imagine we'd make a 'binderhub' helm-chart, which has a dependency on both binderhub and dask-gateway. We can use condition to disable dask-gateway in future binderhubs that don't need this. A big problem here is the lack of composability in helm, and we will have to duplicate a bunch of things from our basehub and daskhub chart values.yaml files :( Is there any way to avoid this? We'd also need to make sure the z2jh version matches what we have in basehub, and I'm not sure how exactly to do that either.

EDIT: Issue is here #1280

@rabernat
Copy link
Contributor

rabernat commented May 6, 2022

I think that this would require a change to the contract that 2i2c has with Pangeo

Let's get the relationships straight. Pangeo has no contract with anyone. Columbia has a contract with 2i2c. ACAICT there are in fact 3 separate contracts now supporting Pangeo-related things (NSF Earthcube @ Columbia, LEAP @ Columbia, M2LInES @ NYU).

  • I think this should run in a different GCP project, and ideally a project that 2i2c bills for rather than one that columbia manages. I can't find the issue where we discussed this, but I remember @choldgraf mentioning that we can setup a project ourselves and bill Columbia for it. Let's do that so we simplify our cloud access story?

This will be complicated to set up. We have only established such a contract already with NYU, not Columbia. It will require considerable administrative overhead. I would estimate 2 months to revise the existing contract. And there is always the possibility that Columbia may reject the proposal that 2i2c will bill us directly for cloud usage.

Because the cloud costs for this project are exempt from ICR, it is essential that the cloud bill be segregated from the "services" bill.

All that said, I'm fine with trying.

@yuvipanda
Copy link
Member

@rabernat thanks for offering to try! I think it'll definitely simplify setup and longer term operations.

@choldgraf
Copy link
Member Author

Thanks @rabernat for sharpening my language - I agree that we need to be clear what organizations are on each side of contracts!

For this case, it sounds like:

  1. It would be easier for 2i2c and operations in the long-term if we have control over the cloud infrastructure.
  2. However it might be complicated to set this up with Columbia.

So, how about I ask CS&S to investigate with the Columbia admin whether this would be complicated to set up. If it seems like it will be massively complicated, then we stick with the status quo and kick the can down the road. If it will not be complicated (say, will take ~ 1 month to set up) then we give this a shot.

If we do set this up, we'd also need the following constraints:

  • There is a separate invoice sent for cloud infrastructure (or @rabernat is it enough that it be a separate line item on a single invoice?)
  • This only applies to the BinderHub, not the JupyterHub we're already running
  • Anything else?

@rabernat
Copy link
Contributor

  • There is a separate invoice sent for cloud infrastructure (or @rabernat is it enough that it be a separate line item on a single invoice?)

I think it would really be easiest if we got two separate invoices. Otherwise our admins will have to split the charge manually between two different accounts.

@choldgraf
Copy link
Member Author

Hey all - I fleshed out some of the issues around the administrative / cloud payment challenges here, and added that to our list at the top. See some more conversation in that here:

@sgibson91
Copy link
Member

We have a test Binder that is up and running on our pilot-hubs cluster! 🎉 All the infrastructure is there to make this repeatable, including auto-deployment through CI/CD. So the only thing blocking progress on reinstating the Pangeo Binder on GCP is the credits situation with Columbia.

@choldgraf
Copy link
Member Author

Wanted to note that I heard recently from @cgentemann that there are several communities within the NASA ecosystem that would also benefit from having BinderHubs for their workshops and events. This isn't quite the Pangeo community, but it's a useful datapoint to know where people would find value in these Binder services.

The only catch is that all of their data lives in AWS, not in GCP. I don't know how difficult it would be to adapt our infrastructure to AWS as well but just wanted to note this.

@sgibson91
Copy link
Member

I don't know how difficult it would be to adapt our infrastructure to AWS as well but just wanted to note this

At the minute, it's very hacked together to specifically work with Google Artifact Registry for image storage. We should absolutely fix that, but I actually think we could use an eks cluster with a GAR since the cluster and registry are connected through a service account that is provided as a username/password in the hub config, rather than any k8s-level connection. It shouldn't be too much effort to get the BinderHub working on AWS, BinderHub is already cloud-agnostic, it's more about picking the right templates/config from basehub/daskhub to get the features the community need/want.

Generally, this BinderHub is sort of hacked together because we don't know how #1382 will pan out and it didn't feel beneficial to get a full solution for BinderHub up-and-running when it could all be torn down and refactored in the not-too-distant future.

@choldgraf
Copy link
Member Author

That's helpful context! So it sounds like:

@sgibson91
Copy link
Member

sgibson91 commented Jun 24, 2022

Yeah, BUT I also don't want us to start running a whole bunch of hacked together BinderHubs, as that is just loading us up for a giant migration effort when #1382 takes shape/lands. We should maybe cap ourselves at 2-3 (or some other reasonable amount)?

@rabernat
Copy link
Contributor

FWIW, we have another zombie binder running on AWS, https://hub.aws-uswest2-binder.pangeo.io/. It is being run by a skeleton crew of @scottyhq.

As long as we are looking at AWS, I would be very happy to see a path towards moving this binder into a more stable situation. Perhaps we can kill multiple birds here.

@damianavila
Copy link
Contributor

I actually think we could use an eks cluster with a GAR since the cluster and registry are connected through a service account that is provided as a username/password in the hub config, rather than any k8s-level connection.

Even when that is possible, it maybe makes sense to also explore AWS ECR as well? I guess there will be some benefits to having everything in AWS land at the time to retrieve/fetch images...

@yuvipanda
Copy link
Member

Specifying passwords as we have done is the only way binderhub can push to registries right now (I opened jupyterhub/binderhub#1506), and that's also mostly ok in this context I think. I also don't think your GAR setup is too hacky, @sgibson91! It could be extended to AWS without too much difficulty I think.

If we have the money to run other binderhubs, I think we can now.

I agree #1382 is the way to go but I also worry that's a long way away, and as long as we don't make decisions here that bind possible ways to pursue #1382 i think it's ok to get some more binderhubs running.

@damianavila
Copy link
Contributor

  1. @yuvipanda get access to @scottyhq AWS for 2i2c engs (will be assigned to @yuvipanda).
  2. Reproducing Kube workflow in my binder-org repo (@sgibson91 to link hijacked issue)

@sgibson91
Copy link
Member

sgibson91 commented Oct 26, 2022

I've been discussing this idea on this issue in the team-compass repo:

@sgibson91
Copy link
Member

sgibson91 commented Nov 9, 2022

Current status:

We have $10k in AWS credits remaining from our original NASA ACCESS 2017 project and would love for 2i2c to use that account for prototyping and development of an binderhub on AWS.

Given the above caveat of Scott's AWS account, how/where should we track the setup of the AWS account associated with the Columbia grant?

@damianavila
Copy link
Contributor

Given the above caveat of Scott's AWS account, how/where should we track the setup of the AWS account associated with the Columbia grant?

I do not fully understand your question, @sgibson91, can you elaborate a little bit more, thanks!

@sgibson91
Copy link
Member

  • We are setting up the Binder to replace the AWS Binder the Pangeo community were originally operating and funding it out of the Columbia grant
  • There is only $10,000 USD on the account Scott gave us access to but it is not an active account. When that money is gone, it's gone. It's also not currently connected to the Columbia account.
  • We need a long-term account that will be paid for by the Columbia grant for the sustainability of the new AWS Binder. Who is/should be in charge of setting that up?

@damianavila
Copy link
Contributor

Thanks for the additional context.

We need a long-term account that will be paid for by the Columbia grant for the sustainability of the new AWS Binder. Who is/should be in charge of setting that up?

That is a really good question I am not sure of the answer to it...
Given the previous experience, I would say let's make 2i2c responsible to create the AWS account and then pass through the costs, but that also exposes us to some additional risk, I would say. Additionally, I am not sure what is possible from the Columbia grant side, actually...

@sgibson91
Copy link
Member

Right, and if we pass through costs like that I believe we actually have to change our contract with Columbia, as documented below regarding moving the GCP infrastructure to a 2i2c-managed project

@damianavila
Copy link
Contributor

Adding @jmunroe into this conversation because there will be contract amendments involved/needed.

@sgibson91
Copy link
Member

I opened the following upstream issue to track the technical deployment of the infrastructure to mybinder.org-deploy

@rabernat
Copy link
Contributor

Great to see progress on this. Let me know how I can help.

@sgibson91
Copy link
Member

sgibson91 commented Nov 28, 2022

@rabernat I think the biggest way you can help is with @jmunroe around the Columbia contract so that we can add cloud billing as a line item on invoicing. That will unblock us on two fronts:

  1. We can setup a sustainable AWS account for this deployment (atm, the plan is to use the account that Scott graciously gave us access to, but those credits will run out eventually and then we need to start billing the Columbia grant)
  2. We will be able to move the current GCP JupyterHub to a 2i2c-managed account and make that more sustainable too, ref: https://github.com/2i2c-org/meta/issues/279#issuecomment-1285294965

@rabernat
Copy link
Contributor

With @yuvipanda we recently learned that Columbia AWS accounts have none of the restrictions of the GCP accounts. Anyone can get access. Does that change the calculation of the tradeoffs here?

@sgibson91
Copy link
Member

sgibson91 commented Nov 28, 2022

The contracting change still needs to happen for the GCP deployment. I think the fact that AWS restrictions are less is why we decided to go with this binder deployment first. But it would be nice to have a sustainable source of credits/money for it.

@rabernat
Copy link
Contributor

What's the definition of "sustainable" here? We have about one year of funding on the Moore Foundation award left.

@sgibson91
Copy link
Member

I was just under the impression that this was supposed to be funded from that pot. If I can avoid having to do a migration between AWS projects in the future, I would prefer it.

@rabernat
Copy link
Contributor

Sounds good 👍 . Just trying to weight the relative costs of various technical workarounds vs. the cost of amending the subaward. We have lost admin staff at Columbia recently, so our ability to execute complex budgeting actions is really degraded.

@sgibson91
Copy link
Member

I think setting up an AWS account attached to that pot of money is a quick win right now. However, when CUIT didn't respond to support our application to join the Incommon Federation, we ran out of other pathways around amending the subaward, in terms of the GCP deployment. I appreciate that it's going to take work, but 2i2c have also been trying to find a way to make working on that deployment less of a headache for a long time and have been repeatedly let down on the Columbia side of operations.

@choldgraf
Copy link
Member Author

Hey all - I will put together a budget proposal and narrative that includes a line item for cloud costs, and see if we can get this arrangement settled quickly. If we can do this without many months or administrative slowness, then I think it would be worth it in order to reduce the stress of maintaining the infrastructure, and to give us more flexibility in access + configuration that will lead to a better service. I'll report back when we have an idea of how that process goes.

My plan will be for 2i2c to include a budget line item for expected cloud costs, this will be a conservative estimate, and we can include in our invoices the actual cloud costs as a direct pass-through.

I'll confirm with CS&S that they won't take any indirect costs on top of these cloud infrastructure costs.

@sgibson91
Copy link
Member

sgibson91 commented Apr 19, 2023

Just an update to this thread. The credits Scott offered have now gone https://discourse.pangeo.io/t/aws-pangeo-jupyterhubs-to-shut-down-friday-march-17/3228 So we need to figure out how else to fund a Pangeo Binder.

@choldgraf
Copy link
Member Author

choldgraf commented Apr 19, 2023

I think that means that the funding would need to come from the Columbia grants themselves, is that right? (maybe @rabernat can comment?)

If that is the case then I think we have two options1.

  1. Deploy on a Columbia cloud project
    • Unclear to me if this is an option. Maybe @2i2c-org/engineering thinks of it as a non-starter? Please advise.
  2. Deploy on a Pangeo cloud project.

I likely don't have the capacity to spearhead this, so we'll need somebody (@jmunroe @colliand @damianavila) to track and move this forward.

Footnotes

  1. Assuming there's not some other pot of money to fund the infrastructure.

@pnasrat
Copy link
Contributor

pnasrat commented Apr 19, 2023

As I understand from onboarding it not all engineers have columbia accounts, and we don't have a clear process to request. From my perspective if the whole team is not able to support a cluster/hub then it is not sustainable to have just limited access to it.

Ref #1799

@sgibson91
Copy link
Member

sgibson91 commented Apr 19, 2023

Also, it appears Yuvi and I can no longer login to the Columbia emails we have anyway. Yuvi sent an email to Ryan and Julius. So major problems going to Columbia account route.

@damianavila
Copy link
Contributor

Yep, setting up in Columbia land is a no-go. This meta issue will deal with the pieces needed for a pass-through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

8 participants