Try spread pods in HA to different nodes on single zone deployments #7666

filintod · 2024-04-02T04:47:44Z

Description

Add soft antiaffinity with topology key kubernetes.io/hostname with a weight of 50 to still give more priority to zone antiaffinity if multiple zones available

Issue reference

Please reference the issue this PR will close: #7665

Checklist

Please make sure you've completed the relevant tasks for this PR, out of the following list:

Code compiles correctly
Created/updated tests
Unit tests passing
End-to-end tests passing
Extended the documentation / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
Specification has been updated / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
Provided sample for the feature / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]

Signed-off-by: Filinto Duran <filinto@diagrid.io>

codecov · 2024-04-02T05:01:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.83%. Comparing base (091a204) to head (c1370dc).
Report is 3 commits behind head on master.

❗ Current head c1370dc differs from pull request most recent head 3db261a. Consider uploading reports for the commit 3db261a to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7666      +/-   ##
==========================================
+ Coverage   61.39%   61.83%   +0.44%     
==========================================
  Files         265      245      -20     
  Lines       22609    22418     -191     
==========================================
- Hits        13880    13863      -17     
+ Misses       7579     7393     -186     
- Partials     1150     1162      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JoshVanL · 2024-04-02T12:37:42Z

@filintod, the kube-scheduler already scores based on Node locality for a replica set. Setting this value to me seems like we are just adding more processing time to the scheduling process. Same with the exiting zone affinity rule- do we actually need this?

filintod · 2024-04-02T13:29:56Z

@JoshVanL was seeing some pods for the same dapr system (ie dapr-sentry) to be running on the same nodes in some customers, even though it could also be there was a single node, but seem strange for prod systems. ~~Where do you see this default node anti-affinity for replicasets, as that was not expected in the past, but maybe nowadays~~, I saw in the code, but the score is also taking into account load of the nodes in general and we might really want to separate them onto different nodes, specially because we don't have guaranteed cos defined for them.

in relation to the zone anti-affinity, that was already there, it is kind of a common best practice for high availability, but more for stateful systems where you would care more about a whole zone going down, so probably not needed for all dapr systems (ie stateless). On the other hand if your load is spread in different zones, you might also get smaller latency talking to your in zone application, but latency should not be that big inte-zone but it is another consideration.

JoshVanL · 2024-04-02T14:28:54Z

@filintod I think we should do some experimenting with also removing the zone affinity rules. Scheduler also takes this into account by default. By adding custom rules we are causing changes to the default scoring which might be having unintended consequences and fighting sane defaults.

I would have thought having better pod priority to be the thing to focus on to ensure uptime of the control plane.

filintod · 2024-04-02T14:30:16Z

one thing is that you cannot say HA and have things running on the same node, and in some way for multi-zone, I need to check more on the score nowadays to see how weight is given to each

JoshVanL · 2024-04-02T14:41:10Z

I said uptime not HA 🙂 I see these as two separate things where uptime is a subset property of achieving HA, but HA also incorporating the idea of replication.

The Dapr control plane is not sensitive to churn like an application serving business traffic needing network failover would be. Single replica Dapr control plane can handle plenty, it just needs to be up, somewhere, and needs have higher priority of being up over consuming Dapr apps.

filintod · 2024-04-02T14:47:20Z

yes, I meant for you cannot say in term of dapr cannot say as we have here https://github.com/dapr/dapr/blob/master/charts/dapr/values.yaml#L29 HA and people might expect fault tolerance and ensuring spreading on different nodes/az is part of it where the scheduler might not give it as high priority as load.

Uptime is for sure important and we should find ways to make these services increase in priority for when one of them is going to be kick out to not be the one of the last in the list. but probably different to this PR.

filintod · 2024-04-03T02:14:10Z

btw, actually thought I had seen the replica set node spread locality you mentioned but actually it was different. https://kubernetes.io/docs/reference/scheduling/config/ . So lmk if you can point to me where you see it. From the default ones, you could get balance if all nodes have somewhat balanced load allocatable, but overtime that is usually not feasible without some sort of rebalancing (that there are tools around like descaler). If that is the case, we do need to have antiaffinity to tilt the balance on not having all pods on the same node if possible.

daixiang0 · 2024-04-11T02:43:49Z

I think we should do some experimenting with also removing the zone affinity rules. Scheduler also takes this into account by default. By adding custom rules we are causing changes to the default scoring which might be having unintended consequences and fighting sane defaults.

I would have thought having better pod priority to be the thing to focus on to ensure uptime of the control plane.

Yeah, it is not be done by pod affinity. Maybe we can add a check in the ready endpoint in the sidecar to ensure daprd is running.

filintod requested review from a team as code owners April 2, 2024 04:47

add soft antiaffinity

c1370dc

Signed-off-by: Filinto Duran <filinto@diagrid.io>

filintod force-pushed the filinto/anti-affinity branch from d501c04 to c1370dc Compare April 2, 2024 04:48

Merge branch 'master' into filinto/anti-affinity

cb6981a

daixiang0 approved these changes Apr 11, 2024

View reviewed changes

daixiang0 and others added 2 commits May 6, 2024 15:36

Merge branch 'master' into filinto/anti-affinity

f9e6f77

Merge branch 'master' into filinto/anti-affinity

3db261a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try spread pods in HA to different nodes on single zone deployments #7666

Try spread pods in HA to different nodes on single zone deployments #7666

filintod commented Apr 2, 2024

codecov bot commented Apr 2, 2024 •

edited

JoshVanL commented Apr 2, 2024

filintod commented Apr 2, 2024 •

edited

JoshVanL commented Apr 2, 2024

filintod commented Apr 2, 2024 •

edited

JoshVanL commented Apr 2, 2024

filintod commented Apr 2, 2024 •

edited

filintod commented Apr 3, 2024 •

edited

daixiang0 commented Apr 11, 2024

Try spread pods in HA to different nodes on single zone deployments #7666

Are you sure you want to change the base?

Try spread pods in HA to different nodes on single zone deployments #7666

Conversation

filintod commented Apr 2, 2024

Description

Issue reference

Checklist

codecov bot commented Apr 2, 2024 • edited

Codecov Report

JoshVanL commented Apr 2, 2024

filintod commented Apr 2, 2024 • edited

JoshVanL commented Apr 2, 2024

filintod commented Apr 2, 2024 • edited

JoshVanL commented Apr 2, 2024

filintod commented Apr 2, 2024 • edited

filintod commented Apr 3, 2024 • edited

daixiang0 commented Apr 11, 2024

codecov bot commented Apr 2, 2024 •

edited

filintod commented Apr 2, 2024 •

edited

filintod commented Apr 2, 2024 •

edited

filintod commented Apr 2, 2024 •

edited

filintod commented Apr 3, 2024 •

edited