Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inability to access NodePort in particular case #3371

Merged
merged 1 commit into from Mar 4, 2022

Conversation

hongliangl
Copy link
Contributor

@hongliangl hongliangl commented Feb 28, 2022

When a Service NodePort and an Egress CRD has the same backend Pod, accessing
to the NodePort Service may fail in particular cases. Assume that the backend
Pod is on Node A and the Egress's external IP is on Node B. If an external
client (not any K8s Node) accesses the NodePort through IP of Node A where
the backend Pod is running, the access will fail. The root cause is that the
reply packets of NodePort is incorrectly matched by the flow installed by Egress
which is used to match the packets sourced from local Pods and destined for
tunneling to Node B. This PR fixes the issue by loading NXM_NX_REG0[0..3]
(PktSourceField, field to mark packet source) to NXM_NX_CT_MARK[0..3] when Service
connection is committed, then the reply packets of Service connection sourced
from Antrea gateway can be matched by NXM_NX_CT_MARK[0..3] and forced back to
Antrea gateway.

Fix #3362

Signed-off-by: Hongliang Liu lhongliang@vmware.com

@codecov-commenter
Copy link

codecov-commenter commented Feb 28, 2022

Codecov Report

Merging #3371 (bec3b78) into main (57ef15c) will decrease coverage by 7.17%.
The diff coverage is 44.84%.

❗ Current head bec3b78 differs from pull request most recent head a767cfe. Consider uploading reports for the commit a767cfe to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3371      +/-   ##
==========================================
- Coverage   60.82%   53.64%   -7.18%     
==========================================
  Files         268      239      -29     
  Lines       26734    34179    +7445     
==========================================
+ Hits        16261    18336    +2075     
- Misses       8690    14067    +5377     
+ Partials     1783     1776       -7     
Flag Coverage Δ
e2e-tests 53.64% <44.84%> (?)
kind-e2e-tests ?
unit-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/types/networkpolicy.go 81.08% <ø> (-2.26%) ⬇️
pkg/controller/egress/controller.go 0.00% <0.00%> (-88.45%) ⬇️
pkg/controller/ipam/validate.go 0.00% <0.00%> (-80.00%) ⬇️
pkg/flowaggregator/flowaggregator.go 63.57% <0.00%> (-3.63%) ⬇️
pkg/ovs/openflow/ofctrl_nxfields.go 67.24% <76.92%> (+1.85%) ⬆️
pkg/util/channel/channel.go 80.64% <80.64%> (ø)
pkg/agent/openflow/pipeline.go 66.27% <94.73%> (-8.38%) ⬇️
pkg/agent/cniserver/pod_configuration.go 42.47% <100.00%> (-11.22%) ⬇️
pkg/agent/cniserver/server.go 40.58% <100.00%> (-25.08%) ⬇️
pkg/agent/controller/egress/egress_controller.go 69.40% <100.00%> (+30.69%) ⬆️
... and 273 more

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An e2e test case is required for issues like this to ensure it works as expected and won't be broken again.

@@ -2163,6 +2163,7 @@ func (c *client) snatCommonFlows(nodeIP net.IP, localSubnet net.IPNet, localGate
L3ForwardingTable.BuildFlow(priorityLow).
MatchProtocol(ipProto).
MatchRegMark(FromLocalRegMark).
MatchCTMark(NotServiceCTMark). // Don't match Service packets from local Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right. Service and Egress are not mutually exclusive:

  • A packet accessing a service whose backend pod is an external address should enforce Egress SNAT as well.
  • An inbound NodePort request handled by kube-proxy doesn't have ServiceCTMark set, but should not enforce Egress SNAT.

The real rule is: only request packets should enforce Egress, reply packets shouldn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I have updated the commit and could you have a look at the updated code to see that if it is right. I'll add e2e test in latter commit. Thanks.

// - Service request packets sourced from Antrea gateway and destined for external Endpoint.
// - Service response packets sourced from local Pods.
// Note that, when Egress is enabled, Service request packets sourced from local Pods and destined for external
// Endpoint should not be matched and such packets should be matched by Egress flow in L3ForwardingTable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment makes sense but the code change doesn't seem reflecting it. How does the flow skip "Service request packets sourced from local Pods"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new priority priorityDefault 180 and changed this flow's priority to 180. The priority of the flow that matches the Egress packets is still 190, as a result, Service request packets sourced from local Pods will be matched by the flow with priority 190.

Copy link
Member

@tnqn tnqn Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused by priorityDefault, thinking it as a priority higher than priorityLow. And I still think the name is confusing with priorityNormal.
Anyway, while I was reviewing the flows in L3Forwarding, I saw there is already a flow which should ensure reply packets wouldn't even go to SNAT table:

flows = append(flows, L3ForwardingTable.BuildFlow(priorityHigh).MatchProtocol(proto).
MatchRegMark(FromLocalRegMark).
MatchCTMark(FromGatewayCTMark).
MatchCTStateRpl(true).MatchCTStateTrk(true).
Action().SetDstMAC(localGatewayMAC).
Action().GotoTable(L3ForwardingTable.GetNext()).
Cookie(c.cookieAllocator.Request(category).Raw()).
Done(),
)

I think we should look at why this flow doesn't take effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongliangl you might miss this comment. I think the flows has considered this case but somehow doesn't work as expected

Copy link
Contributor Author

@hongliangl hongliangl Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to miss this comment. The reason why this flow doesn't work is that the connection is sourced from Antrea gateway and its type is NodePort. In table 105, we can see the first flow is used to forward Service traffic from Antrea gateway to table 106 directly, as a result, there is no CT mark FromGatewayCTMark on such connections.

 cookie=0x3c000000000000, table=105, priority=210,ct_mark=0x4/0x4,ip,reg0=0x1/0xf actions=resubmit(,106)
 cookie=0x3c000000000000, table=105, priority=200,ct_state=+new+trk,ip,reg0=0x1/0xf actions=ct(commit,table=108,zone=65520,exec(load:0x1->NXM_NX_CT_MARK[1]))
 cookie=0x3c000000000000, table=105, priority=200,ct_state=+new+trk,ip,reg0=0x5/0xf actions=ct(commit,table=108,zone=65520,exec(load:0x1->NXM_NX_CT_MARK[3]))
 cookie=0x3c000000000000, table=105, priority=190,ct_state=+new+trk,ip actions=ct(commit,table=108,zone=65520)
 cookie=0x3c000000000000, table=105, priority=190,ct_state=+trk,ct_mark=0x4/0x4,ip,reg4=0x20000/0x70000 actions=resubmit(,108)
 cookie=0x3c000000000000, table=105, priority=0 actions=resubmit(,108)

The second flow in table 105 is used to load CT mark FromGatewayCTMark, but Service connections sourced from Antrea gateway won't match this flow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then should we ensure the packet have correct fromGateway mark regardless of whether it is for service, which can fix the issue by making the original design work as expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
@hongliangl hongliangl force-pushed the fix-nodeport-egress branch 2 times, most recently from 8dbd77e to 082bd88 Compare February 28, 2022 16:16
Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a Service NodePort and an Egress CRD has the same backend Pod, assumed
that the backend Pod is on Node A and the external IP of Egress is on Node
B.

When a Service NodePort and an Egress CRD has the same backend Pod, accesses to the NodePort Service may fail in particular cases. Assume that the backend Pod is on Node A and the Egress's external IP is on Node.

If an external client (not any K8s Nodes) accesses the NodePort through
the IP of the Node A in which the backend Pod is running, the access will
fail.

any K8s Nodes -> any K8s Node

the IP of the Node A in which -> IP of Node A where

pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved
test/e2e/proxy_test.go Show resolved Hide resolved
test/e2e/proxy_test.go Outdated Show resolved Hide resolved
Values: []string{workerNodeName(1)},
},
}
egress := data.createEgress(t, "test-egress", matchExpressions, map[string]string{"app": "agnhost"}, "", egressNodeIP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think agnhost Pod has the labels that can match the matchExpressions.
Given egress is not enabled and the label selector won't select the backend pod, I think the test will pass regardless of whether the fix is valid. Could you ensure the test will fail without your change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I have verified that without my change, the test will fail.

test/e2e/proxy_test.go Outdated Show resolved Hide resolved
@hongliangl hongliangl force-pushed the fix-nodeport-egress branch 4 times, most recently from bf86baf to cf43e0a Compare March 1, 2022 16:19
@hongliangl hongliangl changed the title Fix inability to access NodePort in specific case Fix inability to access NodePort in particular case Mar 1, 2022
@tnqn
Copy link
Member

tnqn commented Mar 1, 2022

@hongliangl could you check comment at #3371 (comment)?

@hongliangl hongliangl force-pushed the fix-nodeport-egress branch 2 times, most recently from c1c44c8 to f37aa63 Compare March 1, 2022 17:21
@hongliangl hongliangl requested review from tnqn and jianjuns March 2, 2022 04:50
@hongliangl hongliangl force-pushed the fix-nodeport-egress branch 5 times, most recently from 5317506 to 8102e25 Compare March 2, 2022 14:39
// CTMark (NXM_NX_CT_MARK)
// CTMark[0..3]: Field to mark the source of the connection. This field has the same bits and positions as PktSourceField
// for persisting the value from reg0 to CTMark when committing the first packet of the connection with CT action.
PktSourceCTMarkField = binding.NewCTMarkField(0, 3, "PacketSourceCTMark")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be named ConnSourceCTMarkField to avoid confusion. The reply packet of this connection has different source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

// is hairpin traffic.
// - When Egress is enabled:
// - Service request packets sourced from Antrea gateway and destined for external Endpoint.
// TODO: The priority of this flow is 190, and the priority of the flow to match Egress traffic from local Pods is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use an issue to track it, instead of adding comments not related to the bug fix which may increase the difficulty of backport. I think all the comment change here could be reverted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thank we can revert all comment change here except NodePort/LoadBalancer/ClusterIP response packets. since this kind of packets is not matched by this flow anymore.

@@ -342,6 +342,7 @@ type CTAction interface {
LoadToCtMark(mark *CtMark) CTAction
LoadToLabelField(value uint64, labelField *CtLabel) CTAction
MoveToLabel(fromName string, fromRng, labelRng *Range) CTAction
MoveToCtMark(fromRegField *RegField, ctMark *CtMarkField) CTAction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be MoveToCtMarkField to be consistent with LoadToLabelField and LoadToRegField?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be renamed to MoveToCtMarkField.

@hongliangl
Copy link
Contributor Author

/test-all-features-conformance
/test-conformance
/test-e2e
/test-flexible-ipam-e2e
/test-multicluster-e2e
/test-networkpolicy
/test-windows-conformance
/test-windows-e2e
/test-windows-networkpolicy
/test-windows-proxyall-e2e
/test-integration

tnqn
tnqn previously approved these changes Mar 3, 2022
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @jianjuns @wenyingd could you take a look too?

@@ -1341,8 +1318,7 @@ func (c *client) l3FwdServiceDefaultFlowsViaGW(ipProto binding.Protocol, categor
// - NodePort/LoadBalancer request packets which pass through Antrea gateway and the Service Endpoint is not on
// local Pod CIDR or any remote Pod CIDRs.
// - ClusterIP request packets which are from Antrea gateway and the Service Endpoint is not on local Pod CIDR
// or any remote Pod CIDRs.
// - NodePort/LoadBalancer/ClusterIP response packets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't true any more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed something.

The following flow can match the reply packets of ClusterIP/NodePort/LoadBalancer connection initiated through Antrea gateway and Endpoint is a local Pod, but it cannot match the reply packets of ClusterIP/NodePort/LoadBalancer connection initiated through Antrea gateway and Endpoint is on external network.

 cookie=0x47000000000000,table=70, priority=210,ct_state=+rpl+trk,ct_mark=0x1/0xf,ip,reg0=0x2/0xf actions=mod_dl_dst:aa:c3:42:b5:e8:7e,resubmit(,80)

So we need to update the comment like this:

- NodePort/LoadBalancer/ClusterIP response packets from external network Endpoint.

}

func NewCTMark(value uint32, start, end uint32) *CtMark {
return &CtMark{value: value, rng: &Range{start, end}}
func NewCTMarkField(start, end uint32, name string) *CtMarkField {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder the name in the field is really useful, but I understand this is to aligh with RegField so I'm fine with keeping it in this PR.

@wenyingd do we really need name in RegField, CtMarkField? They look duplicate with the variable names which already represent their usage and the names in the structs are not used anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use name string of register to generate flow key. For ct_mark, I don't think it is required as although we have multiple logical fields, they actually refer to the same OVS field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use name string of register to generate flow key. For ct_mark, I don't think it is required as although we have multiple logical fields, they actually refer to the same OVS field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed name.

@hongliangl
Copy link
Contributor Author

/test-all-features-conformance
/test-conformance
/test-e2e
/test-flexible-ipam-e2e
/test-multicluster-e2e
/test-networkpolicy
/test-windows-conformance
/test-windows-e2e
/test-windows-networkpolicy
/test-windows-proxyall-e2e
/test-integration

@hongliangl
Copy link
Contributor Author

/test-windows-e2e

Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hongliangl
Copy link
Contributor Author

/test-windows-e2e
/test-windows-networkpolicy
/test-ipv6-all

When a Service NodePort and an Egress CRD has the same backend Pod, accessing
to the NodePort Service may fail in particular cases. Assume that the backend
Pod is on Node A and the Egress's external IP is on Node B. If an external
client (not any K8s Node) accesses the NodePort through IP of Node A where
the backend Pod is running, the access will fail. The root cause is that the
reply packets of NodePort is incorrectly matched by the flow installed by Egress
which is used to match the packets sourced from local Pods and destined for
tunneling to Node B. This PR fixes the issue by loading NXM_NX_REG0[0..3]
(PktSourceField, field to mark packet source) to NXM_NX_CT_MARK[0..3] when Service
connection is committed, then the reply packets of Service connection sourced
from Antrea gateway can be matched by NXM_NX_CT_MARK[0..3] and forced back to
Antrea gateway.

Signed-off-by: Hongliang Liu <lhongliang@vmware.com>
@hongliangl
Copy link
Contributor Author

/test-all-features-conformance
/test-conformance
/test-e2e
/test-flexible-ipam-e2e
/test-multicluster-e2e
/test-networkpolicy
/test-windows-conformance
/test-windows-e2e
/test-windows-networkpolicy
/test-windows-proxyall-e2e
/test-integration

@hongliangl
Copy link
Contributor Author

/test-all-features-conformance

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hongliangl
Copy link
Contributor Author

/test-ipv6-conformance
/test-ipv6-e2e
/test-ipv6-networkpolicy
/test-ipv6-only-conformance
/test-ipv6-only-e2e
/test-ipv6-only-networkpolicy

@hongliangl
Copy link
Contributor Author

/test-windows-networkpolicy

@hongliangl
Copy link
Contributor Author

/test-hw-offload

@tnqn tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Mar 4, 2022
@tnqn tnqn merged commit 1594e68 into antrea-io:main Mar 4, 2022
@tnqn
Copy link
Member

tnqn commented Mar 4, 2022

@hongliangl could you backport this to previous versions if applicable?

@hongliangl
Copy link
Contributor Author

hongliangl commented Mar 4, 2022

@tnqn Sure, backporting this to 1.5 is ok, but to 1.4, we also need to backport PR #2800.

GraysonWu pushed a commit to GraysonWu/antrea that referenced this pull request Mar 10, 2022
When a Service NodePort and an Egress CRD has the same backend Pod, accessing
to the NodePort Service may fail in particular cases. Assume that the backend
Pod is on Node A and the Egress's external IP is on Node B. If an external
client (not any K8s Node) accesses the NodePort through IP of Node A where
the backend Pod is running, the access will fail. The root cause is that the
reply packets of NodePort is incorrectly matched by the flow installed by Egress
which is used to match the packets sourced from local Pods and destined for
tunneling to Node B. This PR fixes the issue by loading NXM_NX_REG0[0..3]
(PktSourceField, field to mark packet source) to NXM_NX_CT_MARK[0..3] when Service
connection is committed, then the reply packets of Service connection sourced
from Antrea gateway can be matched by NXM_NX_CT_MARK[0..3] and forced back to
Antrea gateway.

Signed-off-by: Hongliang Liu <lhongliang@vmware.com>
@hongliangl hongliangl deleted the fix-nodeport-egress branch April 28, 2022 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodePort service is not accessible when Egress (SNAT) is applied to the same Pods.
5 participants