Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not panic when processing a PacketIn message for a denied connection #3447

Conversation

antoninbas
Copy link
Contributor

If the NetworkPolicy or rule cannot be found in the cache.
I assume that this is ok, since the existing log message used for this
scenario has such a high level (4).

Signed-off-by: Antonin Bas abas@vmware.com

If the NetworkPolicy or rule cannot be found in the cache.
I assume that this is ok, since the existing log message used for this
scenario has such a high level (4).

Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas antoninbas requested review from heanlan and tnqn March 12, 2022 01:41
@antoninbas
Copy link
Contributor Author

This is what I observed in the Agent logs during testing:

I0311 22:19:34.507985       8 conntrack_connections.go:148] Conntrack polling successful
I0311 22:19:34.895983       8 config.go:168] Calling handler.OnEndpointsUpdate
I0311 22:19:34.896001       8 config.go:168] Calling handler.OnEndpointsUpdate
I0311 22:19:35.154719       8 ofctrl_bridge.go:275] Received packet: &{Header:{Version:4 Type:10 Length:220 Xid:0} BufferId:4294967295 TotalLen:74 Reason:0 TableId:100 Cookie:1407374883553280 Match:{Type:1 Length:117 Fields:[{Class:32768 Field:0 HasMask:false Length:4 ExperimenterID:0 Value:0xc0011cf7a0 Mask:<nil>} {Class:32768 Field:5 HasMask:false Length:2 ExperimenterID:0 Value:0xc0011cf7a4 Mask:<nil>} {Class:1 Field:0 HasMask:false Length:4 ExperimenterID:0 Value:0xc0011cf7a8 Mask:<nil>} {Class:1 Field:1 HasMask:false Length:4 ExperimenterID:0 Value:0xc0011cf7ac Mask:<nil>} {Class:1 Field:3 HasMask:false Length:4 ExperimenterID:0 Value:0xc0011cf7b0 Mask:<nil>} {Class:1 Field:4 HasMask:false Length:4 ExperimenterID:0 Value:0xc0011cf7b4 Mask:<nil>} {Class:1 Field:7 HasMask:false Length:4 ExperimenterID:0 Value:0xc0011cf7b8 Mask:<nil>} {Class:1 Field:105 HasMask:true Length:8 ExperimenterID:0 Value:0xc0011cf7bc Mask:0xc0011cf7c0} {Class:1 Field:106 HasMask:false Length:2 ExperimenterID:0 Value:0xc0011cf7c4 Mask:<nil>} {Class:1 Field:107 HasMask:false Length:4 ExperimenterID:0 Value:0xc0011cf7c8 Mask:<nil>} {Class:1 Field:120 HasMask:false Length:4 ExperimenterID:0 Value:0xc0008d6420 Mask:<nil>} {Class:1 Field:121 HasMask:false Length:4 ExperimenterID:0 Value:0xc0008d6450 Mask:<nil>} {Class:1 Field:119 HasMask:false Length:1 ExperimenterID:0 Value:0xc0011cf7cc Mask:<nil>} {Class:1 Field:124 HasMask:false Length:2 ExperimenterID:0 Value:0xc0011cf7ce Mask:<nil>} {Class:1 Field:125 HasMask:false Length:2 ExperimenterID:0 Value:0xc0011cf7f0 Mask:<nil>}]} pad:[] Data:{Delimiter:0 HWDst:9a:4c:51:ee:97:f8 HWSrc:66:03:72:9d:f5:42 VLANID:{TPID:0 PCP:0 DEI:0 VID:0} Ethertype:2048 Data:0xc00081f080}}
I0311 22:19:35.154829       8 packetin.go:119] "Received packetIn" reason=0 handler="dnsresponse"
I0311 22:19:35.154849       8 packetin.go:119] "Received packetIn" reason=0 handler="networkpolicy"
I0311 22:19:35.154881       8 packetin.go:162] Cannot find NetworkPolicy or rule that has ruleID 168428067
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1bc066b]

goroutine 484 [running]:
antrea.io/antrea/pkg/agent/controller/networkpolicy.(*Controller).storeDenyConnection(0xc000533290, 0xc0011b2140)
	/antrea/pkg/agent/controller/networkpolicy/packetin.go:166 +0x72b
antrea.io/antrea/pkg/agent/controller/networkpolicy.(*Controller).HandlePacketIn(0xc000dc38c0, 0xc0011b2140)
	/antrea/pkg/agent/controller/networkpolicy/packetin.go:63 +0x1d3
antrea.io/antrea/pkg/agent/openflow.(*client).parsePacketIn(0xc000537dc0, 0xc000f24888)
	/antrea/pkg/agent/openflow/packetin.go:120 +0x2ec
created by antrea.io/antrea/pkg/agent/openflow.(*client).subscribeFeaturePacketIn
	/antrea/pkg/agent/openflow/packetin.go:107 +0x125

@codecov-commenter
Copy link

codecov-commenter commented Mar 12, 2022

Codecov Report

Merging #3447 (3086ad5) into main (1cbf816) will decrease coverage by 9.74%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3447      +/-   ##
==========================================
- Coverage   63.52%   53.77%   -9.75%     
==========================================
  Files         268      374     +106     
  Lines       26900    41448   +14548     
==========================================
+ Hits        17087    22287    +5200     
- Misses       7951    16808    +8857     
- Partials     1862     2353     +491     
Flag Coverage Δ
integration-tests 35.84% <ø> (?)
kind-e2e-tests 51.33% <0.00%> (-0.06%) ⬇️
unit-tests 42.44% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/controller/networkpolicy/packetin.go 68.36% <0.00%> (-0.71%) ⬇️
pkg/apiserver/certificate/certificate.go 69.44% <0.00%> (-6.95%) ⬇️
pkg/controller/serviceexternalip/controller.go 66.47% <0.00%> (-5.12%) ⬇️
...agent/flowexporter/connections/deny_connections.go 83.07% <0.00%> (-3.08%) ⬇️
pkg/controller/egress/store/egressgroup.go 56.94% <0.00%> (-2.78%) ⬇️
pkg/controller/grouping/controller.go 67.24% <0.00%> (-1.73%) ⬇️
pkg/agent/controller/egress/egress_controller.go 69.26% <0.00%> (-1.47%) ⬇️
...gent/controller/noderoute/node_route_controller.go 54.91% <0.00%> (-0.28%) ⬇️
pkg/agent/openflow/client.go 57.31% <0.00%> (-0.21%) ⬇️
pkg/agent/route/route_linux.go 48.21% <0.00%> (-0.15%) ⬇️
... and 111 more

@antoninbas
Copy link
Contributor Author

antoninbas commented Mar 12, 2022

Thinking more about this, this issue is probably because of a separate bug. It's bad that we panic here, but we panic because the rule ID we extract from the PacketIn message is clearly invalid. As a reminder we allocate ids incrementally starting at 0, and 168428067 is way too high to make any sense. Looking at the rest of the logs, we only have installed 4 rules, with IDs 1,2,3 and 4.

See #3448

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor Author

/test-all

@antoninbas
Copy link
Contributor Author

/test-e2e

@antoninbas antoninbas merged commit 1d27432 into antrea-io:main Mar 16, 2022
@antoninbas antoninbas deleted the do-not-panic-for-packetin-with-unknown-policy-or-rule-id branch March 16, 2022 18:54
Atish-iaf pushed a commit to Atish-iaf/antrea that referenced this pull request Mar 17, 2022
…on (antrea-io#3447)

If the NetworkPolicy or rule cannot be found in the cache.
I assume that this is ok, since the existing log message used for this
scenario has such a high level (4).

[CI][e2e] Wait for Service to be realized in proxy tests

Change the mechanism to wait for Service to be realized from sleep for a fixed length of time into poll to make sure that the Service has been realized.

Fixes antrea-io#2275

Signed-off-by: Antonin Bas <abas@vmware.com>
Signed-off-by: Kumar Atish <atish.iaf@gmail.com>
Atish-iaf pushed a commit to Atish-iaf/antrea that referenced this pull request Mar 18, 2022
…on (antrea-io#3447)

If the NetworkPolicy or rule cannot be found in the cache.
I assume that this is ok, since the existing log message used for this
scenario has such a high level (4).

[CI][e2e] Wait for Service to be realized in proxy tests

Changed the mechanism to wait for Service to be realized from sleep for a fixed length of time into poll to make sure that the Service has been realized.

The testProxyHairpin was using nc to listen for TCP connections, and it exits immediately after one successful connection. So the Pod terminated after the waitForService tests. Maybe it is caused by the limitation of nc in busybox. To resolve this, busybox pod has been replaced with agnhost in testProxyHairpin.

Fixes antrea-io#2275

Signed-off-by: Antonin Bas <abas@vmware.com>
Signed-off-by: Kumar Atish <atish.iaf@gmail.com>
Atish-iaf pushed a commit to Atish-iaf/antrea that referenced this pull request Mar 18, 2022
…on (antrea-io#3447)

If the NetworkPolicy or rule cannot be found in the cache.
I assume that this is ok, since the existing log message used for this
scenario has such a high level (4).

[CI][e2e] Wait for Service to be realized in proxy tests

Changed the mechanism to wait for Service to be realized from sleep for a fixed length of time into poll to make sure that the Service has been realized.

The testProxyHairpin was using nc to listen for TCP connections, and it exits immediately after one successful connection. So the Pod terminated after the waitForService tests. Maybe it is caused by the limitation of nc in busybox. To resolve this, busybox pod has been replaced with agnhost in testProxyHairpin.

Fixes antrea-io#2275

Signed-off-by: Antonin Bas <abas@vmware.com>
Signed-off-by: Kumar Atish <atish.iaf@gmail.com>
Atish-iaf pushed a commit to Atish-iaf/antrea that referenced this pull request Mar 18, 2022
…on (antrea-io#3447)

If the NetworkPolicy or rule cannot be found in the cache.
I assume that this is ok, since the existing log message used for this
scenario has such a high level (4).

[CI][e2e] Wait for Service to be realized in proxy tests

Changed the mechanism to wait for Service to be realized from sleep for a fixed length of time into poll to make sure that the Service has been realized.

The testProxyHairpin was using nc to listen for TCP connections, and it exits immediately after one successful connection. So the Pod terminated after the waitForService tests. Maybe it is caused by the limitation of nc in busybox. To resolve this, busybox pod has been replaced with agnhost in testProxyHairpin.

Fixes antrea-io#2275

Signed-off-by: Antonin Bas <abas@vmware.com>
Signed-off-by: Kumar Atish <atish.iaf@gmail.com>
Atish-iaf pushed a commit to Atish-iaf/antrea that referenced this pull request Mar 21, 2022
…on (antrea-io#3447)

If the NetworkPolicy or rule cannot be found in the cache.
I assume that this is ok, since the existing log message used for this
scenario has such a high level (4).

[CI][e2e] Wait for Service to be realized in proxy tests

Changed the mechanism to wait for Service to be realized from sleep for a fixed length of time into poll to make sure that the Service has been realized.

The testProxyHairpin was using nc to listen for TCP connections, and it exits immediately after one successful connection. So the Pod terminated after the waitForService tests. Maybe it is caused by the limitation of nc in busybox. To resolve this, busybox pod has been replaced with agnhost in testProxyHairpin.

Fixes antrea-io#2275

Signed-off-by: Antonin Bas <abas@vmware.com>
Signed-off-by: Kumar Atish <atish.iaf@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants