Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy support chart for the openscapes hub #827

Closed
wants to merge 1 commit into from

Conversation

GeorgianaElena
Copy link
Member

@GeorgianaElena GeorgianaElena commented Nov 11, 2021

After this PR is merged and the support chart is deployed, we should manually trigger the grafana deployer action in order to have the grafana dashboards available for this hub.

Ref: #810

Note:

(from jupyterhub/grafana-dashboards)

NOTE: ANY CHANGES YOU MAKE VIA THE GRAFANA UI WILL BE OVERWRITTEN NEXT TIME YOU RUN deploy.bash. TO MAKE CHANGES, EDIT THE JSONNET FILE AND DEPLOY AGAIN

So just a heads up that if we deploy the dashboards through the action, any changes made to the other's hub grafanas will be ovewriten.

Copy link
Contributor

@damianavila damianavila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I think there are two follow-ups:

  1. DNS entry for the grafana DNS you picked: Setup prometheus + grafana for carbonplan #533 (comment)
  2. The grafana dashboards you mentioned

@GeorgianaElena
Copy link
Member Author

GeorgianaElena commented Nov 11, 2021

LGTM, although I think there are two follow-ups:

Thanks @damianavila! :D

DNS entry for the grafana DNS you picked: #533 (comment)

Waa, didn't know we had steps for this written somewhere! This is great <3! TY!
(We have them in the docs too https://infrastructure.2i2c.org/en/latest/howto/operate/grafana.html?highlight=support#deploy-the-support-chart btw 🎉 )

Update: I'm manually deploying the support chart now!

@GeorgianaElena
Copy link
Member Author

@damianavila, I deployed the support chart manually, but finished with a timeout error.
The support pods is in a pending state and I see this warning during scheduling:

0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had taint {hub.jupyter.org_dedicated: user}, that the pod didn't tolerate.

Any ideas on how to spin up a node for the support pods?

@damianavila
Copy link
Contributor

Mmm... I was expecting most of the support-related pods to be scheduled on the master node that also serves as the core node in our kops-based deployments.
Wondering if we might need to work around some of the support-related pods to run on master nodes as we do with CoreDNS 🤔

@yuvipanda might have more details/thoughts/ideas...

Also, wondering if we would face this one as well: #594
Notice that, in the old carbonplan PR (#533), a few more things were needed to set up support (check the LoadBalancer and traefik removals). And I think that PR was on top of some previos updates where the master node was actually split into master + core independent nodes: #532. So, adding support to kops-based deployments could be more complex than we originally thought... 😢

@damianavila
Copy link
Contributor

So, adding support to kops-based deployments could be more complex than we originally thought... 😢

Btw, if things get really complicated here, my inclination would be to default to the current stability and keep the status quo even when that means no support (nor grafana) stuff in the openscapes cluster. We are super close to the event start date and changes should be as minimal and simple as possible if we want to avoid surprises during the event...

@yuvipanda
Copy link
Member

Yeah, you definitely need the same extra tolerations the core dns pod gets on all the pods we schedule in the master nodes.

Might help to timebox the support node deployment setup. You don't have to use the nginx ingress for the hubs, so the load balancer can just stay for the hubs - we don't have to touch that. However, we must make sure that the extra pods being present on the master node - particularly prometheus - doesn't put undue extra stress on the master node components. So let's make sure they have cpu and memory limits set, and maybe are on a node by themselves?

@yuvipanda
Copy link
Member

But let's think of this as 'adding Prometheus and grafana' rather than 'migrating a kops based deployment to use the support chart'. Let's migrate off kops soon instead!

@choldgraf
Copy link
Member

From these conversations, how about this proposal:

Proposal

  • Try deploying the support resources for, say, 1 hour. If that doesn't work, then go with @damianavila's suggestion to focus more on doing what we think will make the infrastructure stable ahead of Monday's event. If somebody thinks "this will definitely take more than an hour", then we should just skip this bit and know that we'll need to pay extra attention via kubectl during the event.
  • Regarding EKS, I made this proposal here to remove kops from our infrastructure and would love to hear thoughts once we have bandwidth to revisit.

@damianavila
Copy link
Contributor

damianavila commented Nov 12, 2021

This will definitely take more than 1 hour, IMHO (in fact, I have already spent more than 1 hour looking at some of these details).

And, more importantly, implies modifications bigger enough I am not conformable with pushing forward on a Friday evening before the Monday start of the event.

I would suggest closing this PR now and focusing on the transition to EKS after the event.

@choldgraf
Copy link
Member

I would suggest closing this PR now and focusing on the transition to EKS after the event.

This sounds like a good plan to me. I agree this feels like we are doing too many things "for the first time and with uncertainty about whether it'll work" just before an event...

@damianavila
Copy link
Contributor

Closing this one now to avoid confusion about what things are really active for the upcoming event.

Thanks for opening the PR, @GeorgianaElena!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants