Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basehub: rely on a single user-scheduler replica #3869

Merged

Conversation

consideRatio
Copy link
Member

@consideRatio consideRatio commented Mar 27, 2024

The issue being fixed (#3865) includes the motivation and why I think its safe enough to do.

This comment was marked as resolved.

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the scheduler is down, it means that no user pods will be scheduled. Due to this, I'm concerned about moving this to 1 replica. Figuring out if this is acceptable and what failure modes are needed requires more research, as node failure is not necessarily the only reason to have any tool be HA.

Please bring this up in the next sprint planning meeting so we can prioritize this accordingly with the rest of our tasks.

@consideRatio
Copy link
Member Author

Please bring this up in the next sprint planning meeting so we can prioritize this accordingly with the rest of our tasks.

I'll try to do that, but I propose if we don't manage to allocate time to review this in the next sprint, I'd like to propose we merge it without further review. I'm very confident on this change is fine to make without regressions.

I'm looking at shared cluster cost in aws, and found that we have two code nodes running due to too many pods, where each node can schedule up to 57 pods. With this change, we can almost (2 pods too much still) go below the need to have two core nodes.

@consideRatio
Copy link
Member Author

With this PR merged and #3999 resolved, we would cut core node costs by half in 2i2c-aws-us

@consideRatio consideRatio force-pushed the pr/reduce-user-scheduler-replicas branch from af2bfd8 to 8c4e472 Compare May 7, 2024 13:13
@consideRatio
Copy link
Member Author

I'll go for a merge here to save 2i2c-aws-us and catalystproject-africa almost 200 USD a month per cluster in core node pool costs. I'm very confident on this change being safe to make @yuvipanda and have self-reviewed my thinking about this a ~fourth time or so now.

@consideRatio consideRatio merged commit b8d0345 into 2i2c-org:main May 7, 2024
38 checks passed
Copy link

github-actions bot commented May 7, 2024

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/8986088624

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

user-scheduler: default to only one replica?
2 participants