Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidelines for using kops vs EKS #431

Closed
Tracked by #627
yuvipanda opened this issue May 25, 2021 · 15 comments
Closed
Tracked by #627

Guidelines for using kops vs EKS #431

yuvipanda opened this issue May 25, 2021 · 15 comments
Labels
Enhancement An improvement to something or creating something new.

Comments

@yuvipanda
Copy link
Member

yuvipanda commented May 25, 2021

We currently use kops to manage AWS clusters. This is primarily driven by clunkiness of EKS and lack of features (see aws/containers-roadmap#724 for example), and has worked out well for us. However, this does mean that we're responsible for the k8s master - and that's a pretty big responsibility! I also made the decision single-handedly at that time, and it's useful to properly evaluate it with some set criteria I think.

We initially started with EKS, and this issue documents some of the process behind switching to kops. However, it's not structured enough to give me confidence that it is the right thing to do, and to help re-evaluate the decision as EKS is fast moving.

Resolution

We've decided to use EKS instead of kops for AWS. See this comment for a rationale: #431 (comment)

@yuvipanda yuvipanda added 🏷️ cost Enhancement An improvement to something or creating something new. labels May 25, 2021
@yuvipanda yuvipanda added this to Ready to work 👍 in Activity Backlog May 25, 2021
@damianavila
Copy link
Contributor

I think making this comparison is a great idea!
I will start filling some of those empty buckets soon (and maybe add some other rows).

@consideRatio
Copy link
Member

is is primarily driven by clunkiness of EKS and lack of features (see aws/containers-roadmap#724 for example), and has worked out well for us.

I agree on some clunkiness, but to me it mostly relate to the idea of managed nodes which is something I figured that if avoided is quite unproblematic. I have not used kops though. I'll try list all the observed clunkiness of the various options in this issue together going onwards.

@yuvipanda
Copy link
Member Author

There's also two real ways to use EKS - via Terraform or eksctl. pangeo-data/terraform-deploy took the terraform approach, and our previous AWS setups used the eksctl approach. For terraform, you pretty much end up needing to use https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest if you want to use things other than managed nodegroups. Or you need to use eksctl. Ideally, everything will be managed by Terraform - but I think right now, the only way to do that is to use that terraform module. It's a pretty big module as well, and I remember running into enough (essential) clunkiness when working on the terraform-deploy repo to move away from it...

@damianavila
Copy link
Contributor

There's also two real ways to use EKS - via Terraform or eksctl.

CloudFormation as well if you are really buried in AWS land 😉.

Ideally, everything will be managed by Terraform

I like the idea to keep things as agnostic of the could provider as we can, so terraform feel tempting...
Even more when we are using terraform for the GCP stuff. Although I have to say, having played with kops in the last few weeks also gave that agnostic feeling I always like to experience. And the setup was pretty straightforward (even without non-prior experience with kops) besides some issues because of fast moving configuration I was dealing with 😛 .

One thing I am curious about (and I think the experience will give us) is the maintainability load that kops could potentially bring and if that is not too expensive pay to make for the customizability you now have...

@yuvipanda
Copy link
Member Author

One thing I am curious about (and I think the experience will give us) is the maintainability load that kops could potentially bring and if that is not too expensive pay to make for the customizability you now have...

Yeah, I feel like this will be the ultimate primary differentiator. When a kops master fails, what do we do?

@damianavila
Copy link
Contributor

When a kops master fails, what do we do?

We have a troubleshooting section here: https://kops.sigs.k8s.io/operations/troubleshoot/
But that's not enough. I think we should get experience from real failures... that means being exposed to kops for some time. How we could accelerate that learning phase?

@yuvipanda
Copy link
Member Author

Something like https://github.com/Netflix/chaosmonkey maybe? I'm not sure.

@damianavila
Copy link
Contributor

Yep, I was thinking of some tool to actually create random failures... so something like that project could help.
But I am not sure as well, maybe we are still early for even that process.

@yuvipanda
Copy link
Member Author

@damianavila yeah, I agree that we're still early for that

@choldgraf
Copy link
Member

choldgraf commented Nov 12, 2021

Just wanted to note this comment thread from the OpenScapes support stuff. It sounds like things with kops are complicated and require more special-casing and manual steps in general. What do people think about this proposal:

Proposal

Stop using kops, and migrate our current kops-based clusters to eks as soon as we can. Remove documentation about kops and replace it with eks-focused docs.

Rationale

I am sure that both options have pros/cons, and certain situations where one is better than the other. But, right now we are spread pretty thin in terms of the different clouds we use, and our bottleneck is human capacity, cloud-specific expertise, and information silos. We should be standardizing on the smallest possible subset of options for our deployments, and "choosing" between EKS vs. kops gives us an unnecessary degree of freedom that makes it harder for everybody to get on the same page. Moreover, it feels like kops requires more manual intervention, so eks is the service we should use.

Next steps

I suggest we take the following next steps:

  • Write a short rationale for why EKS is the right choice for us, relative to kops in this issue.
  • Replace documentation about kops with EKS-focused docs instead.
  • Migrate pre-existing clusters to use eks

Thoughts?

Do others object to this proposal, or think we should take a different approach?

@choldgraf choldgraf changed the title Lay out guidelines for where we use kops, and where we use EKS Guidelines for using kops vs EKS Nov 12, 2021
@yuvipanda
Copy link
Member Author

I'm in favor, @choldgraf

@yuvipanda
Copy link
Member Author

When we were trying out kops, EKS was about 74$ a month for the control plane
(now I think it's 44$?). In addition, we would have had to run at least one
node for our hub infra Together, this meant that for smaller users (like
Openscapes), the base cost of keeping the infrastructure running even with no
users can be pretty high - a few hundred dollars a month. With kops the idea
was that we could run the hub infra on the master nodes as well, cutting down
this cost significantly.

However, running resilient kops control plane is actually more expensive than
what EKS charges! You need big boxes, k8s control plane processes aren't cheap.
We discovered this the hard way when CarbonPlan was scaling up their hub and the
k8s api would just stop responding. See
#524 and
#526. This convinced us to move
to EKS, as the cost saving goal was actually not met.

While Openscapes hasn't had this issue (they do not use their hub as much as
CarbonPlan does), it takes a lot of effort to maintain infra for both EKS and
kops. As such, we should just move everyone to EKS and abandon kops.

The issue of base cost reduction for users who only ocassionally use their hubs
is still present, however - and something we should tackle. But kops is not
the solution.

@consideRatio
Copy link
Member

A very big +1 for anything that makes us use less tech options, in practice: @choldgraf's suggestion! The cost of having a higher complexity is certainly outweighing the cost of machines etc.

@choldgraf
Copy link
Member

OK, I've updated the top comment here and rescoped #737 to cover migrating to eks. I'll close this one!

@damianavila
Copy link
Contributor

Belated 👍 as well.
I think kops is actually a pretty interesting beast and I do think it could be an interesting approach in certain scenarios... but in our current state, we should build on top and leverage cloud providers' infrastructure as long as we can to be more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement An improvement to something or creating something new.
Projects
No open projects
Development

No branches or pull requests

4 participants