New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix inability to access NodePort in particular case #3371
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3371 +/- ##
==========================================
- Coverage 60.82% 53.64% -7.18%
==========================================
Files 268 239 -29
Lines 26734 34179 +7445
==========================================
+ Hits 16261 18336 +2075
- Misses 8690 14067 +5377
+ Partials 1783 1776 -7
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An e2e test case is required for issues like this to ensure it works as expected and won't be broken again.
pkg/agent/openflow/pipeline.go
Outdated
@@ -2163,6 +2163,7 @@ func (c *client) snatCommonFlows(nodeIP net.IP, localSubnet net.IPNet, localGate | |||
L3ForwardingTable.BuildFlow(priorityLow). | |||
MatchProtocol(ipProto). | |||
MatchRegMark(FromLocalRegMark). | |||
MatchCTMark(NotServiceCTMark). // Don't match Service packets from local Pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right. Service and Egress are not mutually exclusive:
- A packet accessing a service whose backend pod is an external address should enforce Egress SNAT as well.
- An inbound NodePort request handled by kube-proxy doesn't have ServiceCTMark set, but should not enforce Egress SNAT.
The real rule is: only request packets should enforce Egress, reply packets shouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I have updated the commit and could you have a look at the updated code to see that if it is right. I'll add e2e test in latter commit. Thanks.
b64f02a
to
5390552
Compare
pkg/agent/openflow/pipeline.go
Outdated
// - Service request packets sourced from Antrea gateway and destined for external Endpoint. | ||
// - Service response packets sourced from local Pods. | ||
// Note that, when Egress is enabled, Service request packets sourced from local Pods and destined for external | ||
// Endpoint should not be matched and such packets should be matched by Egress flow in L3ForwardingTable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment makes sense but the code change doesn't seem reflecting it. How does the flow skip "Service request packets sourced from local Pods"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new priority priorityDefault 180
and changed this flow's priority to 180. The priority of the flow that matches the Egress packets is still 190, as a result, Service request packets sourced from local Pods will be matched by the flow with priority 190.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused by priorityDefault
, thinking it as a priority higher than priorityLow
. And I still think the name is confusing with priorityNormal
.
Anyway, while I was reviewing the flows in L3Forwarding, I saw there is already a flow which should ensure reply packets wouldn't even go to SNAT table:
antrea/pkg/agent/openflow/pipeline.go
Lines 1256 to 1264 in ceae206
flows = append(flows, L3ForwardingTable.BuildFlow(priorityHigh).MatchProtocol(proto). | |
MatchRegMark(FromLocalRegMark). | |
MatchCTMark(FromGatewayCTMark). | |
MatchCTStateRpl(true).MatchCTStateTrk(true). | |
Action().SetDstMAC(localGatewayMAC). | |
Action().GotoTable(L3ForwardingTable.GetNext()). | |
Cookie(c.cookieAllocator.Request(category).Raw()). | |
Done(), | |
) |
I think we should look at why this flow doesn't take effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hongliangl you might miss this comment. I think the flows has considered this case but somehow doesn't work as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to miss this comment. The reason why this flow doesn't work is that the connection is sourced from Antrea gateway and its type is NodePort. In table 105, we can see the first flow is used to forward Service traffic from Antrea gateway to table 106 directly, as a result, there is no CT mark FromGatewayCTMark
on such connections.
cookie=0x3c000000000000, table=105, priority=210,ct_mark=0x4/0x4,ip,reg0=0x1/0xf actions=resubmit(,106)
cookie=0x3c000000000000, table=105, priority=200,ct_state=+new+trk,ip,reg0=0x1/0xf actions=ct(commit,table=108,zone=65520,exec(load:0x1->NXM_NX_CT_MARK[1]))
cookie=0x3c000000000000, table=105, priority=200,ct_state=+new+trk,ip,reg0=0x5/0xf actions=ct(commit,table=108,zone=65520,exec(load:0x1->NXM_NX_CT_MARK[3]))
cookie=0x3c000000000000, table=105, priority=190,ct_state=+new+trk,ip actions=ct(commit,table=108,zone=65520)
cookie=0x3c000000000000, table=105, priority=190,ct_state=+trk,ct_mark=0x4/0x4,ip,reg4=0x20000/0x70000 actions=resubmit(,108)
cookie=0x3c000000000000, table=105, priority=0 actions=resubmit(,108)
The second flow in table 105 is used to load CT mark FromGatewayCTMark, but Service connections sourced from Antrea gateway won't match this flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then should we ensure the packet have correct fromGateway mark regardless of whether it is for service, which can fix the issue by making the original design work as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
8dbd77e
to
082bd88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a Service NodePort and an Egress CRD has the same backend Pod, assumed
that the backend Pod is on Node A and the external IP of Egress is on Node
B.
When a Service NodePort and an Egress CRD has the same backend Pod, accesses to the NodePort Service may fail in particular cases. Assume that the backend Pod is on Node A and the Egress's external IP is on Node.
If an external client (not any K8s Nodes) accesses the NodePort through
the IP of the Node A in which the backend Pod is running, the access will
fail.
any K8s Nodes -> any K8s Node
the IP of the Node A in which -> IP of Node A where
082bd88
to
bafbd03
Compare
test/e2e/proxy_test.go
Outdated
Values: []string{workerNodeName(1)}, | ||
}, | ||
} | ||
egress := data.createEgress(t, "test-egress", matchExpressions, map[string]string{"app": "agnhost"}, "", egressNodeIP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think agnhost Pod has the labels that can match the matchExpressions
.
Given egress is not enabled and the label selector won't select the backend pod, I think the test will pass regardless of whether the fix is valid. Could you ensure the test will fail without your change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I have verified that without my change, the test will fail.
bf86baf
to
cf43e0a
Compare
@hongliangl could you check comment at #3371 (comment)? |
c1c44c8
to
f37aa63
Compare
5317506
to
8102e25
Compare
pkg/agent/openflow/fields.go
Outdated
// CTMark (NXM_NX_CT_MARK) | ||
// CTMark[0..3]: Field to mark the source of the connection. This field has the same bits and positions as PktSourceField | ||
// for persisting the value from reg0 to CTMark when committing the first packet of the connection with CT action. | ||
PktSourceCTMarkField = binding.NewCTMarkField(0, 3, "PacketSourceCTMark") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be named ConnSourceCTMarkField to avoid confusion. The reply packet of this connection has different source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
pkg/agent/openflow/pipeline.go
Outdated
// is hairpin traffic. | ||
// - When Egress is enabled: | ||
// - Service request packets sourced from Antrea gateway and destined for external Endpoint. | ||
// TODO: The priority of this flow is 190, and the priority of the flow to match Egress traffic from local Pods is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just use an issue to track it, instead of adding comments not related to the bug fix which may increase the difficulty of backport. I think all the comment change here could be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thank we can revert all comment change here except NodePort/LoadBalancer/ClusterIP response packets.
since this kind of packets is not matched by this flow anymore.
pkg/ovs/openflow/interfaces.go
Outdated
@@ -342,6 +342,7 @@ type CTAction interface { | |||
LoadToCtMark(mark *CtMark) CTAction | |||
LoadToLabelField(value uint64, labelField *CtLabel) CTAction | |||
MoveToLabel(fromName string, fromRng, labelRng *Range) CTAction | |||
MoveToCtMark(fromRegField *RegField, ctMark *CtMarkField) CTAction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be MoveToCtMarkField to be consistent with LoadToLabelField and LoadToRegField?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be renamed to MoveToCtMarkField
.
8f55123
to
a767cfe
Compare
/test-all-features-conformance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1341,8 +1318,7 @@ func (c *client) l3FwdServiceDefaultFlowsViaGW(ipProto binding.Protocol, categor | |||
// - NodePort/LoadBalancer request packets which pass through Antrea gateway and the Service Endpoint is not on | |||
// local Pod CIDR or any remote Pod CIDRs. | |||
// - ClusterIP request packets which are from Antrea gateway and the Service Endpoint is not on local Pod CIDR | |||
// or any remote Pod CIDRs. | |||
// - NodePort/LoadBalancer/ClusterIP response packets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't true any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed something.
The following flow can match the reply packets of ClusterIP/NodePort/LoadBalancer connection initiated through Antrea gateway and Endpoint is a local Pod, but it cannot match the reply packets of ClusterIP/NodePort/LoadBalancer connection initiated through Antrea gateway and Endpoint is on external network.
cookie=0x47000000000000,table=70, priority=210,ct_state=+rpl+trk,ct_mark=0x1/0xf,ip,reg0=0x2/0xf actions=mod_dl_dst:aa:c3:42:b5:e8:7e,resubmit(,80)
So we need to update the comment like this:
- NodePort/LoadBalancer/ClusterIP response packets from external network Endpoint.
pkg/ovs/openflow/ofctrl_nxfields.go
Outdated
} | ||
|
||
func NewCTMark(value uint32, start, end uint32) *CtMark { | ||
return &CtMark{value: value, rng: &Range{start, end}} | ||
func NewCTMarkField(start, end uint32, name string) *CtMarkField { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder the name in the field is really useful, but I understand this is to aligh with RegField so I'm fine with keeping it in this PR.
@wenyingd do we really need name in RegField, CtMarkField? They look duplicate with the variable names which already represent their usage and the names in the structs are not used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use name string of register to generate flow key. For ct_mark, I don't think it is required as although we have multiple logical fields, they actually refer to the same OVS field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use name string of register to generate flow key. For ct_mark, I don't think it is required as although we have multiple logical fields, they actually refer to the same OVS field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed name.
a767cfe
to
6b508e5
Compare
/test-all-features-conformance |
/test-windows-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-windows-e2e |
When a Service NodePort and an Egress CRD has the same backend Pod, accessing to the NodePort Service may fail in particular cases. Assume that the backend Pod is on Node A and the Egress's external IP is on Node B. If an external client (not any K8s Node) accesses the NodePort through IP of Node A where the backend Pod is running, the access will fail. The root cause is that the reply packets of NodePort is incorrectly matched by the flow installed by Egress which is used to match the packets sourced from local Pods and destined for tunneling to Node B. This PR fixes the issue by loading NXM_NX_REG0[0..3] (PktSourceField, field to mark packet source) to NXM_NX_CT_MARK[0..3] when Service connection is committed, then the reply packets of Service connection sourced from Antrea gateway can be matched by NXM_NX_CT_MARK[0..3] and forced back to Antrea gateway. Signed-off-by: Hongliang Liu <lhongliang@vmware.com>
6b508e5
to
77bad41
Compare
/test-all-features-conformance |
/test-all-features-conformance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-ipv6-conformance |
/test-windows-networkpolicy |
/test-hw-offload |
@hongliangl could you backport this to previous versions if applicable? |
When a Service NodePort and an Egress CRD has the same backend Pod, accessing to the NodePort Service may fail in particular cases. Assume that the backend Pod is on Node A and the Egress's external IP is on Node B. If an external client (not any K8s Node) accesses the NodePort through IP of Node A where the backend Pod is running, the access will fail. The root cause is that the reply packets of NodePort is incorrectly matched by the flow installed by Egress which is used to match the packets sourced from local Pods and destined for tunneling to Node B. This PR fixes the issue by loading NXM_NX_REG0[0..3] (PktSourceField, field to mark packet source) to NXM_NX_CT_MARK[0..3] when Service connection is committed, then the reply packets of Service connection sourced from Antrea gateway can be matched by NXM_NX_CT_MARK[0..3] and forced back to Antrea gateway. Signed-off-by: Hongliang Liu <lhongliang@vmware.com>
When a Service NodePort and an Egress CRD has the same backend Pod, accessing
to the NodePort Service may fail in particular cases. Assume that the backend
Pod is on Node A and the Egress's external IP is on Node B. If an external
client (not any K8s Node) accesses the NodePort through IP of Node A where
the backend Pod is running, the access will fail. The root cause is that the
reply packets of NodePort is incorrectly matched by the flow installed by Egress
which is used to match the packets sourced from local Pods and destined for
tunneling to Node B. This PR fixes the issue by loading NXM_NX_REG0[0..3]
(PktSourceField, field to mark packet source) to NXM_NX_CT_MARK[0..3] when Service
connection is committed, then the reply packets of Service connection sourced
from Antrea gateway can be matched by NXM_NX_CT_MARK[0..3] and forced back to
Antrea gateway.
Fix #3362
Signed-off-by: Hongliang Liu lhongliang@vmware.com