Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix that Service routes may get lost when starting on Windows #4470

Merged
merged 1 commit into from
Mar 10, 2023

Conversation

hongliangl
Copy link
Contributor

Fix #4467

Signed-off-by: Hongliang Liu lhongliang@vmware.com

@hongliangl hongliangl added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/OS/windows Issues or PRs related to the Windows operating system. labels Dec 13, 2022
@codecov
Copy link

codecov bot commented Dec 13, 2022

Codecov Report

Merging #4470 (e8582e2) into main (b503a46) will increase coverage by 0.95%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4470      +/-   ##
==========================================
+ Coverage   65.89%   66.84%   +0.95%     
==========================================
  Files         402      378      -24     
  Lines       57247    54978    -2269     
==========================================
- Hits        37723    36751     -972     
+ Misses      16822    15519    -1303     
- Partials     2702     2708       +6     
Flag Coverage Δ
integration-tests 34.69% <100.00%> (+0.10%) ⬆️
kind-e2e-tests 40.22% <50.00%> (-7.63%) ⬇️
unit-tests 53.27% <50.00%> (+2.23%) ⬆️
Impacted Files Coverage Δ
...gent/controller/noderoute/node_route_controller.go 58.50% <0.00%> (-0.79%) ⬇️
pkg/agent/route/route_linux.go 70.89% <100.00%> (-1.43%) ⬇️
...g/agent/controller/serviceexternalip/controller.go 35.83% <0.00%> (-45.74%) ⬇️
pkg/features/antrea_features.go 64.00% <0.00%> (-36.00%) ⬇️
pkg/agent/flowexporter/exporter/exporter.go 56.18% <0.00%> (-25.50%) ⬇️
...nt/apiserver/handlers/serviceexternalip/handler.go 29.62% <0.00%> (-22.23%) ⬇️
cmd/antrea-agent/options.go 3.55% <0.00%> (-20.31%) ⬇️
pkg/agent/ipassigner/responder/arp_responder.go 55.29% <0.00%> (-17.65%) ⬇️
...pportbundlecollection/support_bundle_controller.go 47.98% <0.00%> (-17.59%) ⬇️
pkg/agent/controller/networkpolicy/packetin.go 61.48% <0.00%> (-15.55%) ⬇️
... and 86 more

pkg/agent/controller/noderoute/node_route_controller.go Outdated Show resolved Hide resolved
Comment on lines 220 to 222
for _, ingressIP := range svc.Status.LoadBalancer.Ingress {
ingressIPNet := util.NewIPNet(net.ParseIP(ingressIP.IP)).String()
desiredClusterIPSvcIPs.Insert(ingressIPNet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would cause potential issue by appending the ingress IPs to a set for cluster IPs. For example, it could cause the calculated ClusterIP CIDR much bigger than actual value.
If the needed set is supposed to contain both IPs, please rename the variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for Windows, Linux ClusterIP CIDR is not calculated from this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By potential issue, I meant it could cause a bug in future if others think the set contains pure cluster IPs, which is mislead by the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got that. Then how about change desiredClusterIPSvcIPs to desiredSvcIPNets?

test/integration/agent/route_test.go Outdated Show resolved Hide resolved
@hongliangl hongliangl force-pushed the 20221213-fix-4467 branch 2 times, most recently from 186d6f5 to 19a1254 Compare December 13, 2022 08:49
wenyingd
wenyingd previously approved these changes Dec 13, 2022
Copy link
Contributor

@wenyingd wenyingd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shettyg
Copy link
Contributor

shettyg commented Dec 13, 2022

Thanks @hongliangl !

It looks to me that addServiceRoute() still has the issue. The key at [1] is different than the key at [2].
[1]:

obj, found := c.hostRoutes.Load(svcIP.String())

[2]:
c.hostRoutes.Store(route.DestinationSubnet.String(), route)

@shettyg
Copy link
Contributor

shettyg commented Dec 13, 2022

Also, it looked to me that Reconcile() and Add addServiceRoute() can be called at the same time. So, addServiceRoute() can add a route and at the same time, Reconcile() can delete that added route. So there is a race condition.

@hongliangl
Copy link
Contributor Author

antrea/pkg/agent/route/route_windows.go

Good catch! I'll correct it.

@hongliangl
Copy link
Contributor Author

Also, it looked to me that Reconcile() and Add addServiceRoute() can be called at the same time. So, addServiceRoute() can add a route and at the same time, Reconcile() can delete that added route. So there is a race condition.

Yes, they can be called at the same time, but all routes can be installed correctly finally. For example:

For a Service, if Reconcile is called first, corresponding route will not be delete, and a mapping SvcIPNet -> route will be stored to sync.Map hostRoutes; then addServiceRoute is called, and the mapping SvcIPNet -> route is stored in hostRoutes, then do nothing.

For a Service, if addServiceRoute is called first and no mapping ipNetString -> route is found, then install the route and update SvcIPNet -> route to hostRoutes; then Reconcile is called, the route may be included or not in c.listRoutes(), but the route will be delete in Reconcile finally.

@shettyg
Copy link
Contributor

shettyg commented Dec 13, 2022

@hongliangl Consider the following scenario

  1. Reconcile() is about to be called with 500 services. Before Reconcile() calls listRoutes(), the following happens.
  2. 501st new service is created in k8s and addServiceRoute() is called.
  3. addServiceRoute() installs a new route
  4. Reconcile() calls listRoute() and gets newly added route in the list
  5. Reconcile() thinks that the newly added route is stale and deletes it.

@hongliangl
Copy link
Contributor Author

@hongliangl Consider the following scenario

  1. Reconcile() is about to be called with 500 services. Before Reconcile() calls listRoutes(), the following happens.
  2. 501st new service is created in k8s and addServiceRoute() is called.
  3. addServiceRoute() installs a new route
  4. Reconcile() calls listRoute() and gets newly added route in the list
  5. Reconcile() thinks that the newly added route is stale and deletes it.

You are right for this case, but if no new Service is created (without step 2), current code is ok. To cover the case you mentioned above, I'll change the code which is like in route_linux.go.

  • In Reconcile(), skip the route whose gateway is the virtual IP or the destination is the virtual IP.
  • When starting, every route for Service will installed unconditionally.

@shettyg
Copy link
Contributor

shettyg commented Dec 13, 2022

When starting, every route for Service will installed unconditionally.

Thanks. So with the new code, stale routes remain as-is, correct?

@hongliangl
Copy link
Contributor Author

When starting, every route for Service will installed unconditionally.

Thanks. So with the new code, stale routes remain as-is, correct?

Yes, in this way, at least there will be no ClusterIP that is unavailable.

@hongliangl hongliangl force-pushed the 20221213-fix-4467 branch 2 times, most recently from 2be6ee2 to e8582e2 Compare December 14, 2022 02:10
@hongliangl hongliangl added this to the Antrea v1.11 release milestone Jan 3, 2023
@hongliangl
Copy link
Contributor Author

hongliangl commented Mar 1, 2023

Do we need to backport this PR? If we don't need to backport this, I'll close this PR since #4221 can also fix the issue. @tnqn @shettyg @wenyingd

pkg/agent/controller/noderoute/node_route_controller.go Outdated Show resolved Hide resolved
svcIPNet := util.NewIPNet(svcIP)
obj, found := c.hostRoutes.Load(svcIPNet.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L309 also needs update, can we add unit tests to cover them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L309 also needs update
Done

can we add unit tests to cover them?

Add some test code to verify the cached route in c.hostRoutes

pkg/agent/route/route_windows.go Outdated Show resolved Hide resolved
@hongliangl hongliangl force-pushed the 20221213-fix-4467 branch 3 times, most recently from 4e35382 to 2cac702 Compare March 10, 2023 04:46
Fix antrea-io#4467

Signed-off-by: Hongliang Liu <lhongliang@vmware.com>
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Mar 10, 2023

/test-all
/test-windows-all

@tnqn tnqn merged commit cc2f40f into antrea-io:main Mar 10, 2023
This was referenced Mar 24, 2023
jainpulkit22 pushed a commit to urharshitha/antrea that referenced this pull request Apr 28, 2023
@hongliangl hongliangl deleted the 20221213-fix-4467 branch May 6, 2023 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/OS/windows Issues or PRs related to the Windows operating system.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Antrea1.4: Windows node service route logic looks to be broken
5 participants