Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try spread pods in HA to different nodes on single zone deployments #7666

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

filintod
Copy link
Contributor

@filintod filintod commented Apr 2, 2024

Description

Add soft antiaffinity with topology key kubernetes.io/hostname with a weight of 50 to still give more priority to zone antiaffinity if multiple zones available

Issue reference

Please reference the issue this PR will close: #7665

Checklist

Please make sure you've completed the relevant tasks for this PR, out of the following list:

@filintod filintod requested review from a team as code owners April 2, 2024 04:47
Signed-off-by: Filinto Duran <filinto@diagrid.io>
Copy link

codecov bot commented Apr 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.83%. Comparing base (091a204) to head (c1370dc).
Report is 3 commits behind head on master.

❗ Current head c1370dc differs from pull request most recent head 3db261a. Consider uploading reports for the commit 3db261a to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7666      +/-   ##
==========================================
+ Coverage   61.39%   61.83%   +0.44%     
==========================================
  Files         265      245      -20     
  Lines       22609    22418     -191     
==========================================
- Hits        13880    13863      -17     
+ Misses       7579     7393     -186     
- Partials     1150     1162      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JoshVanL
Copy link
Contributor

JoshVanL commented Apr 2, 2024

@filintod, the kube-scheduler already scores based on Node locality for a replica set. Setting this value to me seems like we are just adding more processing time to the scheduling process. Same with the exiting zone affinity rule- do we actually need this?

@filintod
Copy link
Contributor Author

filintod commented Apr 2, 2024

@JoshVanL was seeing some pods for the same dapr system (ie dapr-sentry) to be running on the same nodes in some customers, even though it could also be there was a single node, but seem strange for prod systems. Where do you see this default node anti-affinity for replicasets, as that was not expected in the past, but maybe nowadays, I saw in the code, but the score is also taking into account load of the nodes in general and we might really want to separate them onto different nodes, specially because we don't have guaranteed cos defined for them.

in relation to the zone anti-affinity, that was already there, it is kind of a common best practice for high availability, but more for stateful systems where you would care more about a whole zone going down, so probably not needed for all dapr systems (ie stateless). On the other hand if your load is spread in different zones, you might also get smaller latency talking to your in zone application, but latency should not be that big inte-zone but it is another consideration.

@JoshVanL
Copy link
Contributor

JoshVanL commented Apr 2, 2024

@filintod I think we should do some experimenting with also removing the zone affinity rules. Scheduler also takes this into account by default. By adding custom rules we are causing changes to the default scoring which might be having unintended consequences and fighting sane defaults.

I would have thought having better pod priority to be the thing to focus on to ensure uptime of the control plane.

@filintod
Copy link
Contributor Author

filintod commented Apr 2, 2024

one thing is that you cannot say HA and have things running on the same node, and in some way for multi-zone, I need to check more on the score nowadays to see how weight is given to each

@JoshVanL
Copy link
Contributor

JoshVanL commented Apr 2, 2024

I said uptime not HA 🙂 I see these as two separate things where uptime is a subset property of achieving HA, but HA also incorporating the idea of replication.

The Dapr control plane is not sensitive to churn like an application serving business traffic needing network failover would be. Single replica Dapr control plane can handle plenty, it just needs to be up, somewhere, and needs have higher priority of being up over consuming Dapr apps.

@filintod
Copy link
Contributor Author

filintod commented Apr 2, 2024

yes, I meant for you cannot say in term of dapr cannot say as we have here https://github.com/dapr/dapr/blob/master/charts/dapr/values.yaml#L29 HA and people might expect fault tolerance and ensuring spreading on different nodes/az is part of it where the scheduler might not give it as high priority as load.

Uptime is for sure important and we should find ways to make these services increase in priority for when one of them is going to be kick out to not be the one of the last in the list. but probably different to this PR.

@filintod
Copy link
Contributor Author

filintod commented Apr 3, 2024

btw, actually thought I had seen the replica set node spread locality you mentioned but actually it was different. https://kubernetes.io/docs/reference/scheduling/config/ . So lmk if you can point to me where you see it. From the default ones, you could get balance if all nodes have somewhat balanced load allocatable, but overtime that is usually not feasible without some sort of rebalancing (that there are tools around like descaler). If that is the case, we do need to have antiaffinity to tilt the balance on not having all pods on the same node if possible.

@daixiang0
Copy link
Member

I think we should do some experimenting with also removing the zone affinity rules. Scheduler also takes this into account by default. By adding custom rules we are causing changes to the default scoring which might be having unintended consequences and fighting sane defaults.

I would have thought having better pod priority to be the thing to focus on to ensure uptime of the control plane.

Yeah, it is not be done by pod affinity. Maybe we can add a check in the ready endpoint in the sidecar to ensure daprd is running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HA spread of pods on single zone deployments not preferring different nodes
5 participants